Netuality

Taming the big bad websites

Archive for the ‘Articles’ Category

Google’s Map/Reduce patent and impact on Hadoop: none expected

one comment

From the GigaOm analysis:

Fortunately, for them, it seems unlikely that Google will take to the courts to enforce its new intellectual property. A big reason is that “map” and “reduce” functions have been part of parallel programming for decades, and vendors with deep pockets certainly could make arguments that Google didn’t invent MapReduce at all.

Should Hadoop come under fire, any defendants (or interveners like Yahoo and/or IBM) could have strong technical arguments over whether the open-source Hadoop even is an infringement. Then there is the question of money: Google has been making plenty of it without the patent, so why risk the legal and monetary consequences of losing any hypothetical lawsuit? Plus, Google supports Hadoop, which lets university students learn webscale programming (so they can become future Googlers) without getting access to Google’s proprietary MapReduce language.

[...]

A Google spokeswoman emailed this in response to our questions about why Google sought the patent, and whether or not Google would seek to enforce its patent rights, attributing it to Michelle Lee, Deputy General Counsel:

“Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops. While we do not comment about the use of this or any part of our portfolio, we feel that our behavior to date has been inline with our corporate values and priorities.”

From Ars Technica:

Hadoop isn’t the only open source project that uses MapReduce technology. As some readers may know, I’ve recently been experimenting with CouchDB, an open source database system that allows developers to perform queries with map and reduce functions. Another place where I’ve seen MapReduce is Nokia’s QtConcurrent framework, an extremely elegant parallel programming library for Qt desktop applications.

It’s unclear what Google’s patent will mean for all of these MapReduce adopters. Fortunately, Google does not have a history of aggressive patent enforcement. It’s certainly possible that the company obtained the patent for “defensive” purposes. Like virtually all major software companies, Google is frequently the target of patent lawsuits. Many companies in technical fields attempt to collect as many broad patents as they can so that they will have ammunition with which to retaliate when they are faced with patent infringement lawsuits.

Google’s MapReduce patent raises some troubling questions for software like Hadoop, but it looks unlikely that Google will assert the patent in the near future; Google itself uses Hadoop for its Code University program.

Even if Google takes the unlikely course of action and does decide to target Hadoop users with patent litigation, the company would face significant resistance from the open source project’s deep-pocketed backers—including IBM, which holds the industry’s largest patent arsenal.

Another dimension of this issue is the patent’s validity. On one hand, it’s unclear if taking age-old principles of functional software development and applying them to a cluster constitutes a patentable innovation.

Still nothing from the big analysts, Gartner and the gang…

Written by Adrian

January 22nd, 2010 at 7:39 pm

Posted in Articles

Tagged with , , , ,

M/R vs DBMS benchmark paper rebutted

one comment

In a recent ACM article, Jeffrey Dean and Sanjay Ghemawat are discussing some pitfalls in the Hadoop vs DBMS comparison benchmarks that I’ve mentioned in one of my previous posts. They are clarifying three M/R misconceptions from the article:

  • MapReduce cannot use indexes and implies a full scan of all input data;
  • MapReduce input and outputs are always simple files in a file system;
  • MapReduce requires the use of inefficient textual data formats.

and also they emphasize some Hadoop strong points not covered by the benchmark paper.

The biggest drawback which is lack of indexes, while partially compensated in certain use cases by the range query feature, is typically solved by using an external indexing service such as Lucene/SOLR or even a dedicated RDBMS. One can employ vertical and horizontal sharding techniques on indexes in order to answer queries on these pre-canned indexes, instead of scanning the whole data-set as the authors of the comparison paper imply.

Some performance assumptions are also discussed in the second part of the paper. While the benchmarks results were not challenged per se, here’s Jeffrey and Sanjay’s conclusion:

“In our experience, MapReduce is a highly effective and efficient tool for large-scale fault-tolerant data analysis.

[...]

MapReduce provides many significant advantages over parallel databases. First and foremost, it provides fine-grain fault tolerance for large jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch. Second, MapReduce is very useful for handling data processing and data loading in a heterogenous system with many different storage systems. Third, MapReduce provides a good framework for the execution of more complicated functions than are supported directly in SQL.”

Written by Adrian

January 7th, 2010 at 9:53 am

Posted in Articles

Tagged with , , , ,

Hadoop Map/Reduce versus DBMS, benchmarks

leave a comment

Here’s a recent benchmark published at SIGMOD ’09 by a team of researchers and students from Brown, M.I.T. and Wisconsin-Madison universities. The details of their setup here and this is the paper (PDF).

They ran a few simple tasks such as loading, „grepping” (as described in the original M/R paper), aggregation, selection and join on a total of 1TB of data. On the same 100-nodes RedHat cluster they compared Vertica (a well-known MPP), „plain” Hadoop with custom-coded Map/Reduce tasks and an unnamed DBMS-X (probably Oracle Exadata, which is mentioned in the article).

The final result shows Vertica and DBMS-X being (not astonishing at all!) 2, respectively 3 times faster than the brute M/R approach. What they also mention is that Hadoop was surprisingly easy to install and run, while the DBMS-X installation process was a relatively complex one, followed by tuning. Parallel databases were using space more efficiently due to compression, while Hadoop needed at least 3 times the space due to redundancy mechanism. A good point for Hadoop was the failure model allowing for quick recovery from faults and uninterrupted long-running jobs.

The authors recommend parallel DBMS-es against „brute force” models. “[…] we are wary of devoting huge computational clusters and “brute force” approaches to computation when sophisticated software would could do the same processing with far less hardware and consume far less energy, or in less time, thereby obviating the need for a sophisticated fault tolerance model. A multithousand- node cluster of the sort Google, Microsoft, and Yahoo! run uses huge amounts of energy, and as our results show, for many data processing tasks a parallel DBMS can often achieve the same performance using far fewer nodes. As such, the desirable approach is to use high-performance algorithms with modest parallelism rather than brute force approaches on much larger clusters.

What do you think, dear reader? I would be curious to see the same benchmark replicated on other NoSQL systems. Also, I find 1TB too low for most web-scale apps today.

Written by Adrian

January 3rd, 2010 at 10:40 pm

Posted in Articles

Tagged with , , , , ,

Rebuttal : What’s Wrong with the Eclipse Plugin Infrastructure?

leave a comment

Mr. Philipp K. Janert, author of What's Wrong with the Eclipse Plugin Infrastructure? Version 1.1 (2004/03/13 06:44:12) All rights reserved hasn't answered my email sent 6 weeks ago, nor modified a single word of his article. Well, with such a long delay, I presume he is in a well-deserved retirement in Bahamas. Thus, given his absence and inability to correct erroneous informations in his site, someone else (me, for instance !) has to correct the errors displayed in the article. I strongly encourage everybody else who thinks that Eclipse plugin architecture is not as 'wrong' as Philipp thinks to write a few words by email, maybe (when he comes back from Bahamas) he'll manage to fix the problems with his document.

Here we go:

Dear Philipp,

Unfortunately, most of the points from your article “What's Wrong with the Eclipse Plugin Infrastructure?” are wrong :

1. There is a way (albeit, undocumented) allowing you to keep the locally installed plugins between different Eclipse installs. It is covered in some weblogs and this is one of them.

2. Using the workaround described at point 1 you can partially solve point 2 via multiple links to local plugins directory for each user.

3. Is already adressed in 3.0M8 and will be fixed in 3.0 final

4. Is a design choice, also explained by 5

5. Plugins are radically different from simple libraries. The corresponding class hierarchies are loaded (and will be unloaded) on an as-is basis by a special Eclipse classloader. Thus, putting the plugin artifacts in the jvm classpath is not useful in this case.

6. Yes, there is a difference. Please please document thoroughly before writing an article with public exposure.

7. Valid point, will be adressed before 3.0final

8. and 9. are plugin developers issue. Eclipse team do have their naming conventions for plugins, however they cannot enforce it upon external plugin developers.

10. If a plugin has a corresponding feature, you may use the help/software updates/manage configuration page.

11. Again, this is plugin developers responsibility. Each plugin should have a 'welcome' page detailing its functionality. This is clearly stated in the Eclipse guides.

12. See 11

13. Please use the 'Error log' view from 'PDE Runtime'. I have to repeat that documentation, contrary to the public opinion, is a useful thing.

Please make the necessary corrections in your article. The site has a decent Google PageRank and is referenced in some news weblogs, thus it has an important public exposure.

Sincerely,

Adrian Spinei

Written by Adrian

May 12th, 2004 at 9:49 pm

Posted in Articles

Tagged with , ,

Two of the Eclipse books to appear in 2004

leave a comment

Eric Clayberg and Dan Rubel, both from Instantiations (the company which is developing and distributing SWT-Designer and Codepro, two very good payware Eclipse plugins) are preparing the book “Eclipse: Building Commercial Quality Plugins”, which should be published by Addison Wesley in 2Q 2004. Draft chapters are available at qualityeclipse.com.

Eclipse powered is a site by Ed Burnette, co-author of Eclipse in action at Manning. The new book will be targeted at building application architectured around RCP (Rich Client Platform). Unfortunately, the project is currently on hold (no reference about it on Manning either !), but the site has a nice, albeit short, collection of links to RCP resources. There also is an associated Sourceforge project with 4 example plugins, nothing very “rich-client” yet. Hope that's going to be published this year, because RCP is a very hot topic. Next week, I'll have my first RCP assignment – an integration feasibility study to establish whether we're going to use RCP as an alternative container for the supply chain&procurement modules of an ERP targeted to agro-industrial customers (the current container is a simple collection of tabs and a menu). It's going to be a fun week :)

Written by Adrian

January 31st, 2004 at 6:50 pm

Posted in Articles

Tagged with

… want to know what trouble means for a distributed project ?

leave a comment

“The project was complex-the USS Monitor had 47 different patentable inventions on board at launch. (Think you have integration problems?) When the pieces arrived at the shipyard from the foundries, they fit together poorly and much retrofitting had to be done. This affected the schedule and standards of quality. Thus, there was a rushed final integration effort and a couple of trial run retrofits. Although the ship's first battle in 1862 against the larger but less maneuverable ironclad CSS Virginia was successful (both sides declared victory-and wooden warships became a thing of the past), the USS Monitor sank a few months later on New Year's Eve while under tow in rough seas.”

Quote from a must-read article : Distributed Development Lessons Learned, ACM Queue vol. 1, no. 9 – December/January 2003-2004 by Michael Turnlund, Cisco Systems.

I have only recently descovered ACM Queue (via Miguel) and think that's a great resource. I have also found out that a new free, limited account is available on ACM Digital Library ? It's nicely complementing my IEEE Computer subscription (altough I never have to the time to read everything I want to, it's a warm, nice, fuzzy feeling to know that more and more articles are available :) ).

Written by Adrian

January 31st, 2004 at 10:36 am

Posted in Articles

Tagged with ,