Netuality

Taming the big, bad, nasty websites

January 12 linkdump: Reddit on Hadoop on steroids, Hadoop lessons learned

leave a comment

Great Hadoop story, and a great read too, from Lau Jensen on Best In Class blog:

Hadoop opens a world of fun with the promise of some heavy lifting and in order to feed the beast I’ve written a Reddit-scraper in just 30 lines of Clojure.

[...]

Now that we’re sitting with almost unlimited insight into the posts which make Redditors tick, we can think of many stats that would be fun to compute. Since this is a tutorial I’ll go with the simplest version, ie. something like calculating total number of upvotes per domain/author, but for a future experiment it would be fun to pull out the top authors/posts and also scrape the URLs they link, categorizing them after content length, keywords, number of graphical elements etc, just to get the recipe for a succesful post.

Alex Popescu has a few notes and questions about ReadPath usage of Hadoop in production:

If you thought using NoSQL solutions would automatically address and solve backup and restore policies, you were wrong. [...]

Written by Adrian

January 12th, 2010 at 9:25 pm

Posted in Linkdump

Tagged with , , ,

Leave a Reply

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word