Netuality

Taming the big, bad, nasty websites

Archive for the ‘Oracle’ tag

January 13 linkdump: KDD, EC2 congested, Coherence, Zimbra

leave a comment

Call to arms for the annual ACM KDD Conference. KDD stands for Knowledge Discovery and Data Mining, so if you’re looking for some hardcore use cases and new algorithms to apply, this is definitely the place to be (Washington, July 25-28):

KDD-2010 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

There’s rumor on the street that Amazon EC2 is over-subscribed. From the trenches it appears that their scalability is … well, duh … not infinite and elasticity is a tiny bit rigid:

Anyone that uses virtualized computing, whether it is in the cloud or in their own private setup (VMWare for example) knows you take a performance hit. These performance hits can be considerable, but on the whole, are tolerable and can be built into an architecture from the start.

The problems that we are starting to see from Amazon, are more than just the overhead of a virtualized environment. They are deep rooted scalability problems at their end that need to be addressed sooner rather than later.

My Adobe colleague Ricky Ho has posted some notes on Oracle’s Coherence (formerly Tangosol), a distributed Java cache rich in features. A great read especially if you want a technical intro to the product (code snippets and everything).

The acquisition of the day is Zimbra being bought by VMWare. Yahoo is selling Zimbra a loss, it seems. Analysts wonder what exactly is VMWare planning to do, well they’re probably going up the stack and working on providing their own cloud ecosystem and related services. “VMWare Applications”, soon?

Under the terms of the agreement, Yahoo can continue to use Zimbra technology in its communications services. VMWare’s interest in Zimbra is a bit of a mystery since VMWare focuses on selling virtualization technology; in the release, VMWare offers somewhat of an explanation saying that the purchase furthers its “mission of taking complexity out of the datacenter, desktop, application development and core IT services”

Written by Adrian

January 13th, 2010 at 8:23 pm

Posted in Linkdump

Tagged with , , , , , ,

Hadoop Map/Reduce versus DBMS, benchmarks

leave a comment

Here’s a recent benchmark published at SIGMOD ’09 by a team of researchers and students from Brown, M.I.T. and Wisconsin-Madison universities. The details of their setup here and this is the paper (PDF).

They ran a few simple tasks such as loading, „grepping” (as described in the original M/R paper), aggregation, selection and join on a total of 1TB of data. On the same 100-nodes RedHat cluster they compared Vertica (a well-known MPP), „plain” Hadoop with custom-coded Map/Reduce tasks and an unnamed DBMS-X (probably Oracle Exadata, which is mentioned in the article).

The final result shows Vertica and DBMS-X being (not astonishing at all!) 2, respectively 3 times faster than the brute M/R approach. What they also mention is that Hadoop was surprisingly easy to install and run, while the DBMS-X installation process was a relatively complex one, followed by tuning. Parallel databases were using space more efficiently due to compression, while Hadoop needed at least 3 times the space due to redundancy mechanism. A good point for Hadoop was the failure model allowing for quick recovery from faults and uninterrupted long-running jobs.

The authors recommend parallel DBMS-es against „brute force” models. “[…] we are wary of devoting huge computational clusters and “brute force” approaches to computation when sophisticated software would could do the same processing with far less hardware and consume far less energy, or in less time, thereby obviating the need for a sophisticated fault tolerance model. A multithousand- node cluster of the sort Google, Microsoft, and Yahoo! run uses huge amounts of energy, and as our results show, for many data processing tasks a parallel DBMS can often achieve the same performance using far fewer nodes. As such, the desirable approach is to use high-performance algorithms with modest parallelism rather than brute force approaches on much larger clusters.

What do you think, dear reader? I would be curious to see the same benchmark replicated on other NoSQL systems. Also, I find 1TB too low for most web-scale apps today.

Written by Adrian

January 3rd, 2010 at 10:40 pm

Posted in Articles

Tagged with , , , , ,