Netuality

Taming the big, bad, nasty websites

M/R vs DBMS benchmark paper rebutted

one comment

In a recent ACM article, Jeffrey Dean and Sanjay Ghemawat are discussing some pitfalls in the Hadoop vs DBMS comparison benchmarks that I’ve mentioned in one of my previous posts. They are clarifying three M/R misconceptions from the article:

  • MapReduce cannot use indexes and implies a full scan of all input data;
  • MapReduce input and outputs are always simple files in a file system;
  • MapReduce requires the use of inefficient textual data formats.

and also they emphasize some Hadoop strong points not covered by the benchmark paper.

The biggest drawback which is lack of indexes, while partially compensated in certain use cases by the range query feature, is typically solved by using an external indexing service such as Lucene/SOLR or even a dedicated RDBMS. One can employ vertical and horizontal sharding techniques on indexes in order to answer queries on these pre-canned indexes, instead of scanning the whole data-set as the authors of the comparison paper imply.

Some performance assumptions are also discussed in the second part of the paper. While the benchmarks results were not challenged per se, here’s Jeffrey and Sanjay’s conclusion:

“In our experience, MapReduce is a highly effective and efficient tool for large-scale fault-tolerant data analysis.

[...]

MapReduce provides many significant advantages over parallel databases. First and foremost, it provides fine-grain fault tolerance for large jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch. Second, MapReduce is very useful for handling data processing and data loading in a heterogenous system with many different storage systems. Third, MapReduce provides a good framework for the execution of more complicated functions than are supported directly in SQL.”

Written by Adrian

January 7th, 2010 at 9:53 am

Posted in Articles

Tagged with , , , ,

One Response to 'M/R vs DBMS benchmark paper rebutted'

Subscribe to comments with RSS or TrackBack to 'M/R vs DBMS benchmark paper rebutted'.

  1. [...] This post was mentioned on Twitter by Ilya Khaprov, Adrian Spinei. Adrian Spinei said: M/R vs DBMS benchmark paper rebutted in an ACM article http://bit.ly/4wD0dE [...]

Leave a Reply

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word