Friday, January 26, 2007

The enigma that is Google

Yes, it's the end of an era: Googling for miserable failure will no longer take you to George Bush's official bio. For that matter, "waffles" will no longer take you to John Kerry. Apparently Google has updated their search algorithm to fight against Googlebomb attacks by bloggers.

I've always wondered how much of Google's accuracy is dependent on specific, hand-coded tweaks, which a researcher like me tends to regard as a cheat as compared to automated algorithmic techniques (albeit a necessary one in the real world). Personally, I was surprised that Googlebombing worked as well as it did to begin with; it seemed to me that Google's accuracy relied a lot more on the ad hoc stuff they've applied on top of algorithms like PageRank than they normally let on. However, the new rankings apparently were done entirely through automation. I'm curious what they did. Some very informal observations of mine about Google techniques:

  • They seem to consciously favor known "reference sources" - Wikipedia links seem to come up high for a great many searches, for instance.
  • They also seem to put an automatic penalty on any page that might be classified as pornography. Do a search for "boobies", and you will see pages about the bird dominate the top results; I strongly suspect that would never be a natural result of PageRank alone.*
  • They're disproportionately friendly to academics, and particularly folks in technical disciplines. For example, at a recent machine learning reading group, we were amused that the UC Berkeley faculty member Michael Jordan actually manages to come up fourth in a search for "michael jordan"** - as opposed to, you know, another page about that other Michael Jordan. Perhaps this is the bias resulting from folks like us running Google.


* No, I don't actually sit around searching for things like "boobies" all day. Honest.
** I may have just helped increase his PageRank, actually.

Labels: ,

Wednesday, January 17, 2007

The world's most boring publication

A classic from 1955, courtesy of the RAND Corporation: A Million Random Digits with 100,000 Normal Deviates. It's hard to believe that, at one point in time, a book containing nothing more than an obscenely long string of random digits was a genuinely significant contribution to humankind, but be not deceived: effective random number generation is hard. It's hard enough to require a 131-page specification courtesy of the National Institute of Standards and Technology. And in in an era when one couldn't count on having a cheap, fast PC to run well-established algorithms on, a book like this was indeed a blessing. I'm tempted to buy this to read on airplanes, just to creep out the person sitting next to me.

Labels: ,