Netuality

Taming the big, bad, nasty websites

Benchmarking the cloud: not simple

2 comments

Understanding the impact of using virtualized servers instead of real ones is perhaps one of the most complex issues when migrating from a traditional configuration to a cloud-based setup. Especially because virtualized servers are created equal … but only on paper.

A Rackspace-funded “report” tries to find out the performance differences between Rackspace Cloud Servers and Amazon EC2. I guess the only conclusion we can get from their so-called report is that Cloud Server disk throughput is better than EC2’s. As the “CPU test” is a kernel compile which also stresses the disk, I don’t think we can reliably get any conclusion from these.

An intrepid commenter ran a CPU-only test (Geekbench) and found out that EC2 performs slightly better than Rackspace in terms of raw processor performance. The same commenter, affiliated with Cloud Harmony,  mentions that a simple hdparm test shows that Rackspace hdd has more than twice the throughput of EC2 hdd, at least in terms of buffered reads. Last but not least, don’t forget that for better disk performance Amazon recommends EBS instead of the VM disk.

We cannot reliably make an informed cloud vendor choice just using VM benchmarks. Ideally, you should benchmark your own app on each cloud infrastructure and choose the one which gives you the best user-facing performance, because at the end of the day this is what matters most. Sadly, today this means experimenting with sometimes wildly different APIs and provisioning models.

Written by Adrian

January 18th, 2010 at 10:02 am

Posted in Datacenter

Tagged with , , , , ,

January 13 linkdump: KDD, EC2 congested, Coherence, Zimbra

leave a comment

Call to arms for the annual ACM KDD Conference. KDD stands for Knowledge Discovery and Data Mining, so if you’re looking for some hardcore use cases and new algorithms to apply, this is definitely the place to be (Washington, July 25-28):

KDD-2010 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

There’s rumor on the street that Amazon EC2 is over-subscribed. From the trenches it appears that their scalability is … well, duh … not infinite and elasticity is a tiny bit rigid:

Anyone that uses virtualized computing, whether it is in the cloud or in their own private setup (VMWare for example) knows you take a performance hit. These performance hits can be considerable, but on the whole, are tolerable and can be built into an architecture from the start.

The problems that we are starting to see from Amazon, are more than just the overhead of a virtualized environment. They are deep rooted scalability problems at their end that need to be addressed sooner rather than later.

My Adobe colleague Ricky Ho has posted some notes on Oracle’s Coherence (formerly Tangosol), a distributed Java cache rich in features. A great read especially if you want a technical intro to the product (code snippets and everything).

The acquisition of the day is Zimbra being bought by VMWare. Yahoo is selling Zimbra a loss, it seems. Analysts wonder what exactly is VMWare planning to do, well they’re probably going up the stack and working on providing their own cloud ecosystem and related services. “VMWare Applications”, soon?

Under the terms of the agreement, Yahoo can continue to use Zimbra technology in its communications services. VMWare’s interest in Zimbra is a bit of a mystery since VMWare focuses on selling virtualization technology; in the release, VMWare offers somewhat of an explanation saying that the purchase furthers its “mission of taking complexity out of the datacenter, desktop, application development and core IT services”

Written by Adrian

January 13th, 2010 at 8:23 pm

Posted in Linkdump

Tagged with , , , , , ,

January 12 linkdump: Reddit on Hadoop on steroids, Hadoop lessons learned

leave a comment

Great Hadoop story, and a great read too, from Lau Jensen on Best In Class blog:

Hadoop opens a world of fun with the promise of some heavy lifting and in order to feed the beast I’ve written a Reddit-scraper in just 30 lines of Clojure.

[...]

Now that we’re sitting with almost unlimited insight into the posts which make Redditors tick, we can think of many stats that would be fun to compute. Since this is a tutorial I’ll go with the simplest version, ie. something like calculating total number of upvotes per domain/author, but for a future experiment it would be fun to pull out the top authors/posts and also scrape the URLs they link, categorizing them after content length, keywords, number of graphical elements etc, just to get the recipe for a succesful post.

Alex Popescu has a few notes and questions about ReadPath usage of Hadoop in production:

If you thought using NoSQL solutions would automatically address and solve backup and restore policies, you were wrong. [...]

Written by Adrian

January 12th, 2010 at 9:25 pm

Posted in Linkdump

Tagged with , , ,

M/R vs DBMS benchmark paper rebutted

one comment

In a recent ACM article, Jeffrey Dean and Sanjay Ghemawat are discussing some pitfalls in the Hadoop vs DBMS comparison benchmarks that I’ve mentioned in one of my previous posts. They are clarifying three M/R misconceptions from the article:

  • MapReduce cannot use indexes and implies a full scan of all input data;
  • MapReduce input and outputs are always simple files in a file system;
  • MapReduce requires the use of inefficient textual data formats.

and also they emphasize some Hadoop strong points not covered by the benchmark paper.

The biggest drawback which is lack of indexes, while partially compensated in certain use cases by the range query feature, is typically solved by using an external indexing service such as Lucene/SOLR or even a dedicated RDBMS. One can employ vertical and horizontal sharding techniques on indexes in order to answer queries on these pre-canned indexes, instead of scanning the whole data-set as the authors of the comparison paper imply.

Some performance assumptions are also discussed in the second part of the paper. While the benchmarks results were not challenged per se, here’s Jeffrey and Sanjay’s conclusion:

“In our experience, MapReduce is a highly effective and efficient tool for large-scale fault-tolerant data analysis.

[...]

MapReduce provides many significant advantages over parallel databases. First and foremost, it provides fine-grain fault tolerance for large jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch. Second, MapReduce is very useful for handling data processing and data loading in a heterogenous system with many different storage systems. Third, MapReduce provides a good framework for the execution of more complicated functions than are supported directly in SQL.”

Written by Adrian

January 7th, 2010 at 9:53 am

Posted in Articles

Tagged with , , , ,

How big is your meat cloud? The golden number for servers

leave a comment

Just went through a recent thread on Slashdot discussing “how many admins per user computer” or how many desktops per admin to be more specific. While the client desktop subject is totally uninteresting, I found in the comment noise a few interesting tidbits about the meat cloud size in different server environments.

On the low non-automated end there were figures such as “1 admin per 70 Linux boxes or 30 Windows machines” (are Windows servers really twice as dificult to manage than Linux servers?) – confirmed by another commenter working for a Government facility. Of course, it depends on how many different hardware brands and software services you have to manage…

Another allegedly 12-year experienced sysadmin commented that the larger the organization, the bigger the ratio. Going from 50 server per sysadmin on small organizations to 250 on corporations (but his company revenue “definitions” are a bit weird). An insightful comment mentions Facebook’s Jeff Rotschild according to which Facebook has roughly 130 servers per admin or (interesting metric) 1 million or more users per engineer.

Of course in specific cases this number can go way higher. Especially when you have to deal with quasi-identical hardware and software configurations running in a very large cluster. On the extreme scale there’s the Microsoft container data center in Chicago which supposedly has a total of 30 employees supporting some 300,000 servers. That’s 10,000 servers/employee! At this point I suspect they basically only change faulty hardware and wire new capacity when needed, everything else should be fully automated.

Written by Adrian

January 5th, 2010 at 7:16 pm

Posted in Datacenter

Tagged with , ,