Netuality

Taming the big, bad, nasty websites

Archive for the ‘EC2’ tag

Linkdump: Cassandra lovers, blowing the circuit breaker and Oracle clouds

2 comments

Good points (as always) on Alexandru’s blog discussing the SQL scalability isn’t for everyone topic.

NoSQL as RDBMS are just tools for our job and there is nothing about the death of one of the other. But as we’ve learned over years, every new programming language is the death of all its precursors, every new programming paradigm is the death of everything that existed before and so on. The part that some seem to be missing or ignoring deliberately is that in most of these cases this death have never really happened.

For large-scale performance testing of a production environment check out how Facebook MySpace simulated 1 million concurrent users with a huge EC2 cluster, described on the High Scalability blog. While the article is a guest post from a company selling “cloud testing” solutions and has a bit of “sales juice” in it, it’s still a very good read:

Large-scale testing using EC2

Someone is in love with Cassandra after only 4 months. Hoping Cassandra doesn’t get too fat after the wedding:

Traditional sharding and replication with databases like MySQL and PostgreSQL have been shown to work even on the largest scale websites — but come at a large operational cost. Setting up replication for MySQL can be done quickly, but there are many issues you need to be aware of, such as slave replication lag. Sharding can be done once you reach write throughput limits, but you are almost always stuck writing your own sharding layer to fit how your data is created and operationally, it takes a lot of time to set everything up correctly. We skipped that step all together and added a couple hooks to make our data aggregation service siphon to both PostgreSQL and Cassandra for the initial integration.

Distributed data war stories from Anders @ bandwidth.com, HBase and Hadoop on commodity hardware:

As mentioned before, the commodity machines I used were very basic but I was able to insert conservatively about 500 records per second with this setup. I kept blowing the circuit breaker at the office as well forcing me to spread the machines across several power circuits but it proved that the system was at least fault tolerant!

SourceForge chooses Python, TurboGears and … MongoDB for a new version of their website. Looks like Mongo is becoming quite mainstream.

Don’t believe the rumors, Oracle is into cloud computing after all – at least according to Forrester. Well, as long as the clouds are private. And as long as you can live with “coming soon” tooling. And it’s not like they really have a clear long-term strategy for cloud computing:

I believe that cloud is a revolution for Oracle, IBM, SAP, and the other big vendors with direct sales forces (despite what they say). Cloud computing has the potential to undermine the account-management practices and pricing models these big companies are founded on. I think it will take years for each of the big vendors to adapt to cloud computing. Oracle is just beginning this journey; I think other vendors are further down the track.

The igvita blog hits NoSQL in the groin by showing a simple way of having a schema-free data store … in MySQL. It’s a sort of proxy that translates schemas into denormalized data placed in distinct tables:

Instead of defining columns on a table, each attribute has its own table (new tables are created on the fly), which means that we can add and remove attributes at will. In turn, performing a select simply means joining all of the tables on that individual key. To the client this is completely transparent, and while the proxy server does the actual work, this functionality could be easily extracted into a proper MySQL engine – I’m just surprised that no one has done so already.

While an interesting idea, not sure how effective this will be in practice, as joins are among the most time-consuming operations in the database world. I’m pretty sure that replacing a 10-column table get on the primary key with joins on 10 tables will add an important overhead.

Written by Adrian

March 4th, 2010 at 9:31 pm

Posted in Linkdump

Tagged with , , , ,

Benchmarking the cloud: not simple

2 comments

Understanding the impact of using virtualized servers instead of real ones is perhaps one of the most complex issues when migrating from a traditional configuration to a cloud-based setup. Especially because virtualized servers are created equal … but only on paper.

A Rackspace-funded “report” tries to find out the performance differences between Rackspace Cloud Servers and Amazon EC2. I guess the only conclusion we can get from their so-called report is that Cloud Server disk throughput is better than EC2’s. As the “CPU test” is a kernel compile which also stresses the disk, I don’t think we can reliably get any conclusion from these.

An intrepid commenter ran a CPU-only test (Geekbench) and found out that EC2 performs slightly better than Rackspace in terms of raw processor performance. The same commenter, affiliated with Cloud Harmony,  mentions that a simple hdparm test shows that Rackspace hdd has more than twice the throughput of EC2 hdd, at least in terms of buffered reads. Last but not least, don’t forget that for better disk performance Amazon recommends EBS instead of the VM disk.

We cannot reliably make an informed cloud vendor choice just using VM benchmarks. Ideally, you should benchmark your own app on each cloud infrastructure and choose the one which gives you the best user-facing performance, because at the end of the day this is what matters most. Sadly, today this means experimenting with sometimes wildly different APIs and provisioning models.

Written by Adrian

January 18th, 2010 at 10:02 am

Posted in Datacenter

Tagged with , , , , ,

January 13 linkdump: KDD, EC2 congested, Coherence, Zimbra

leave a comment

Call to arms for the annual ACM KDD Conference. KDD stands for Knowledge Discovery and Data Mining, so if you’re looking for some hardcore use cases and new algorithms to apply, this is definitely the place to be (Washington, July 25-28):

KDD-2010 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

There’s rumor on the street that Amazon EC2 is over-subscribed. From the trenches it appears that their scalability is … well, duh … not infinite and elasticity is a tiny bit rigid:

Anyone that uses virtualized computing, whether it is in the cloud or in their own private setup (VMWare for example) knows you take a performance hit. These performance hits can be considerable, but on the whole, are tolerable and can be built into an architecture from the start.

The problems that we are starting to see from Amazon, are more than just the overhead of a virtualized environment. They are deep rooted scalability problems at their end that need to be addressed sooner rather than later.

My Adobe colleague Ricky Ho has posted some notes on Oracle’s Coherence (formerly Tangosol), a distributed Java cache rich in features. A great read especially if you want a technical intro to the product (code snippets and everything).

The acquisition of the day is Zimbra being bought by VMWare. Yahoo is selling Zimbra a loss, it seems. Analysts wonder what exactly is VMWare planning to do, well they’re probably going up the stack and working on providing their own cloud ecosystem and related services. “VMWare Applications”, soon?

Under the terms of the agreement, Yahoo can continue to use Zimbra technology in its communications services. VMWare’s interest in Zimbra is a bit of a mystery since VMWare focuses on selling virtualization technology; in the release, VMWare offers somewhat of an explanation saying that the purchase furthers its “mission of taking complexity out of the datacenter, desktop, application development and core IT services”

Written by Adrian

January 13th, 2010 at 8:23 pm

Posted in Linkdump

Tagged with , , , , , ,