Netuality

Taming the big, bad, nasty websites

Archive for the ‘benchmark’ tag

Benchmarking the cloud: not simple

2 comments

Understanding the impact of using virtualized servers instead of real ones is perhaps one of the most complex issues when migrating from a traditional configuration to a cloud-based setup. Especially because virtualized servers are created equal … but only on paper.

A Rackspace-funded “report” tries to find out the performance differences between Rackspace Cloud Servers and Amazon EC2. I guess the only conclusion we can get from their so-called report is that Cloud Server disk throughput is better than EC2’s. As the “CPU test” is a kernel compile which also stresses the disk, I don’t think we can reliably get any conclusion from these.

An intrepid commenter ran a CPU-only test (Geekbench) and found out that EC2 performs slightly better than Rackspace in terms of raw processor performance. The same commenter, affiliated with Cloud Harmony,  mentions that a simple hdparm test shows that Rackspace hdd has more than twice the throughput of EC2 hdd, at least in terms of buffered reads. Last but not least, don’t forget that for better disk performance Amazon recommends EBS instead of the VM disk.

We cannot reliably make an informed cloud vendor choice just using VM benchmarks. Ideally, you should benchmark your own app on each cloud infrastructure and choose the one which gives you the best user-facing performance, because at the end of the day this is what matters most. Sadly, today this means experimenting with sometimes wildly different APIs and provisioning models.

Written by Adrian

January 18th, 2010 at 10:02 am

Posted in Datacenter

Tagged with , , , , ,

M/R vs DBMS benchmark paper rebutted

one comment

In a recent ACM article, Jeffrey Dean and Sanjay Ghemawat are discussing some pitfalls in the Hadoop vs DBMS comparison benchmarks that I’ve mentioned in one of my previous posts. They are clarifying three M/R misconceptions from the article:

  • MapReduce cannot use indexes and implies a full scan of all input data;
  • MapReduce input and outputs are always simple files in a file system;
  • MapReduce requires the use of inefficient textual data formats.

and also they emphasize some Hadoop strong points not covered by the benchmark paper.

The biggest drawback which is lack of indexes, while partially compensated in certain use cases by the range query feature, is typically solved by using an external indexing service such as Lucene/SOLR or even a dedicated RDBMS. One can employ vertical and horizontal sharding techniques on indexes in order to answer queries on these pre-canned indexes, instead of scanning the whole data-set as the authors of the comparison paper imply.

Some performance assumptions are also discussed in the second part of the paper. While the benchmarks results were not challenged per se, here’s Jeffrey and Sanjay’s conclusion:

“In our experience, MapReduce is a highly effective and efficient tool for large-scale fault-tolerant data analysis.

[...]

MapReduce provides many significant advantages over parallel databases. First and foremost, it provides fine-grain fault tolerance for large jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch. Second, MapReduce is very useful for handling data processing and data loading in a heterogenous system with many different storage systems. Third, MapReduce provides a good framework for the execution of more complicated functions than are supported directly in SQL.”

Written by Adrian

January 7th, 2010 at 9:53 am

Posted in Articles

Tagged with , , , ,

Hadoop Map/Reduce versus DBMS, benchmarks

leave a comment

Here’s a recent benchmark published at SIGMOD ’09 by a team of researchers and students from Brown, M.I.T. and Wisconsin-Madison universities. The details of their setup here and this is the paper (PDF).

They ran a few simple tasks such as loading, „grepping” (as described in the original M/R paper), aggregation, selection and join on a total of 1TB of data. On the same 100-nodes RedHat cluster they compared Vertica (a well-known MPP), „plain” Hadoop with custom-coded Map/Reduce tasks and an unnamed DBMS-X (probably Oracle Exadata, which is mentioned in the article).

The final result shows Vertica and DBMS-X being (not astonishing at all!) 2, respectively 3 times faster than the brute M/R approach. What they also mention is that Hadoop was surprisingly easy to install and run, while the DBMS-X installation process was a relatively complex one, followed by tuning. Parallel databases were using space more efficiently due to compression, while Hadoop needed at least 3 times the space due to redundancy mechanism. A good point for Hadoop was the failure model allowing for quick recovery from faults and uninterrupted long-running jobs.

The authors recommend parallel DBMS-es against „brute force” models. “[…] we are wary of devoting huge computational clusters and “brute force” approaches to computation when sophisticated software would could do the same processing with far less hardware and consume far less energy, or in less time, thereby obviating the need for a sophisticated fault tolerance model. A multithousand- node cluster of the sort Google, Microsoft, and Yahoo! run uses huge amounts of energy, and as our results show, for many data processing tasks a parallel DBMS can often achieve the same performance using far fewer nodes. As such, the desirable approach is to use high-performance algorithms with modest parallelism rather than brute force approaches on much larger clusters.

What do you think, dear reader? I would be curious to see the same benchmark replicated on other NoSQL systems. Also, I find 1TB too low for most web-scale apps today.

Written by Adrian

January 3rd, 2010 at 10:40 pm

Posted in Articles

Tagged with , , , , ,