Netuality

Taming the big, bad, nasty websites

Archive for the ‘Google’ tag

Google’s Map/Reduce patent and impact on Hadoop: none expected

leave a comment

From the GigaOm analysis:

Fortunately, for them, it seems unlikely that Google will take to the courts to enforce its new intellectual property. A big reason is that “map” and “reduce” functions have been part of parallel programming for decades, and vendors with deep pockets certainly could make arguments that Google didn’t invent MapReduce at all.

Should Hadoop come under fire, any defendants (or interveners like Yahoo and/or IBM) could have strong technical arguments over whether the open-source Hadoop even is an infringement. Then there is the question of money: Google has been making plenty of it without the patent, so why risk the legal and monetary consequences of losing any hypothetical lawsuit? Plus, Google supports Hadoop, which lets university students learn webscale programming (so they can become future Googlers) without getting access to Google’s proprietary MapReduce language.

[...]

A Google spokeswoman emailed this in response to our questions about why Google sought the patent, and whether or not Google would seek to enforce its patent rights, attributing it to Michelle Lee, Deputy General Counsel:

“Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops. While we do not comment about the use of this or any part of our portfolio, we feel that our behavior to date has been inline with our corporate values and priorities.”

From Ars Technica:

Hadoop isn’t the only open source project that uses MapReduce technology. As some readers may know, I’ve recently been experimenting with CouchDB, an open source database system that allows developers to perform queries with map and reduce functions. Another place where I’ve seen MapReduce is Nokia’s QtConcurrent framework, an extremely elegant parallel programming library for Qt desktop applications.

It’s unclear what Google’s patent will mean for all of these MapReduce adopters. Fortunately, Google does not have a history of aggressive patent enforcement. It’s certainly possible that the company obtained the patent for “defensive” purposes. Like virtually all major software companies, Google is frequently the target of patent lawsuits. Many companies in technical fields attempt to collect as many broad patents as they can so that they will have ammunition with which to retaliate when they are faced with patent infringement lawsuits.

Google’s MapReduce patent raises some troubling questions for software like Hadoop, but it looks unlikely that Google will assert the patent in the near future; Google itself uses Hadoop for its Code University program.

Even if Google takes the unlikely course of action and does decide to target Hadoop users with patent litigation, the company would face significant resistance from the open source project’s deep-pocketed backers—including IBM, which holds the industry’s largest patent arsenal.

Another dimension of this issue is the patent’s validity. On one hand, it’s unclear if taking age-old principles of functional software development and applying them to a cluster constitutes a patentable innovation.

Still nothing from the big analysts, Gartner and the gang…

Written by Adrian

January 22nd, 2010 at 7:39 pm

Posted in Articles

Tagged with , , , ,

Google: sorry, but Lisp/Ruby/Erlang not on the menu

7 comments

Yes, language propaganda again. Ain’t it fun ?

Here comes a nice quote from the latest Steve Yegge post (read it entirely if you have the time, it’s both fun and educational – at least for me). So, there:

I made the famously, horribly, career-shatteringly bad mistake of trying to use Ruby at Google, for this project. And I became, very quickly, I mean almost overnight, the Most Hated Person At Google. And, uh, and I’d have arguments with people about it, and they’d be like Nooooooo, WHAT IF… And ultimately, you know, ultimately they actually convinced me that they were right, in the sense that there actually were a few things. There were some taxes that I was imposing on the systems people, where they were gonna have to have some maintenance issues that they wouldn’t have. [...] But, you know, Google’s all about getting stuff done.

[...]

Is it allowed at Google to use Lisp and other languages?

No. No, it’s not OK. At Google you can use C++, Java, Python, JavaScript… I actually found a legal loophole and used server-side JavaScript for a project.

Mmmmm … key ?

Written by Adrian

May 29th, 2008 at 12:35 am

Posted in Tools

Tagged with , , , ,

Java going down, Python way up, and more …

8 comments

According to O’Reilly Radar, sales of Java books have declined in the last 4 years by almost 50%. C# is selling more books from year to year and will probably level up with Java in 2008. Javascript is on the rise (due to AJAX, for sure) and PHP is on a surprising decrease path (although the job statistics indicate quite the contrary).

According to O’Reilly Radar, sales of Java books have declined in the last 4 years by almost 50%

In 2007, the number of sold Ruby books was larger than the number of Python books. In their article they qualify Ruby as being a “mid-major programming language” and Python as “mid-minor programming language”. However, after the announcement of Google App Engine the number of Python downloads from ActiveState has tripled in May. This should become visible in the book sales statistics, pretty soon.

Written by Adrian

May 24th, 2008 at 5:36 pm

Posted in Tools

Tagged with , , , , ,

memcached vs tugela vs memcachedb

2 comments

This presentation was planned for an older Wurbe event, but as this never quite happened in the last 4 months I am publishing it now, before it becomes totally obsolete.

My original contribution here is a comparison between the original memcached server from Danga and the tugela fork from the MediaWiki programmers. I’ve also tried memcachedb but the pre 1.0 version (from Google Code) in November 2007 was quite unstable and unpredictible.

In a nutshell, these memcache versions are using BerkeleyDB instead of memory slab allocator. There are 2 direct consequences:

  • when the memory is large enough for the whole cache, database-backed servers will be slower (my tests shown 10-15% which might be tolerable – or not – for your app)
  • when you’ve got lots of data to cache and your server’s memory is low, relying on bdb is significantly better than letting the swap mechanism to do its job (from my benchmarks, the difference can go up to 10 times faster especially under very high concurrency conditions)

Tugela will prove especially useful when running it on virtualized servers with very low memory.

My tests were performed with the “Tummy” Python client and Stackless for the multithreaded version. In one of the following weeks I’ll update the benchmarks for memcachedb 1.0.x – and I promise never ever to wait 4 months for a presentation, again …

Written by Adrian

March 17th, 2008 at 12:38 am

SEO eye for the Tapestry guy

leave a comment

One of my previous customers has a Jakarta Tapestry [3.0.x] based site. The site is subscription-based, but it also has a public area – if you browse each and every link you should be able to view few thousand of [dynamically generated] pages. No SEO* consulting was involved in building the site. To put it simply : I’ve got some specs and HTML templates: developed, deployed, bugfixed and hasta la vista…

More than 6 months later [!], the site is still alive, which is good, but it doesn’t really spot impressive traffic figures and growth. Basically, all the traffic it gathers seems to come from existing subscribers and paid ads, very low level of organic and almost zero traffic from major engines such as Google (although it was submitted to quite a lot of engines and directories).
Lo and behold, there must be something really nasty going on since a quick query on Google with site:www.javaguysite.com** gives only one freaking result : the home page. What means: Google has indexed ONLY the entry page – same thing happens with all the other major search indexes. And guess what : nobody is going to find the content if it isn’t even indexed by search engines.

Making friends with your public URLs

The problem : Tapestry URLs are too ugly for search engines. Looking at the source of my navigation menu, I found little beasts such as

http://www.javaguysite.com/b2b?service=direct/1/Home/$Border.$Header.$LoggedMenu.$CheckedMenu$1.$DirectLink&sp=SSearchOffers#a

For a Tapestry programmer it is simple direct link from inside a component embedded in other components, but for search engine bots it is an overly complex link to a dynamic page, which will NOT be followed. Thus, if you want these little buggers to crawl all over your site and index all the pages, make’em think it’s a simple static site such as :

http://www.javaguysite.com/b2b/bla/bla/bla/SearchOffers.html

In SEO consultants slang, it’s called “friendly URLs”***.

You don’t have to make all your links friendlier. For instance, no need to touch the pages available only to subscribers as they’ll never be available for public searching. In the public area, make friendly URLs only to access those pages containing relevant content.
The method is called URL rewriting. Rewriting means that the web server is transforming the request URL, behind the scenes, using regular expressions, in a totally transparent manner. Thus, the client browser or the bot “thinks” it reaches a certain URL, however a different address is sent to the servlet container. The rewriting is performed either by:

1. using a servlet filter such as urlrewrite.

or

2. with mod_rewrite in Apache. I do use Apache as a proxy server, in order to perform transparent and efficient gzip-ing on the fly as described in one of my previous blog posts. Now, I only had to add the mod_rewrite filter and I’m ready to go.

Only minor syntax differences exists between the regular expressions in the filter and the Apache module. I was able to seamlessly switch between the two, as I use the servlet filter in development environment and we did Apache proxying in production.

The devil is in the details

Now we’re sure that dynamic pages from the public area will be searchable after the Google bot crawls them. Problem is : all pages of a single category will have the same title. Like for instance “Company details” for all the pages containing … umm, company details. And when you have thousands of companies in the database, that makes helluva lot of pages with the same title ! Besides, keywords contained in the page title play an important role in the search position for the specific keyword. The conclusion: make your page titles as customised as possible: put not only the page type, but also relevant data from the page content – in our case, the company name and whynot the city where the business is located. This is easy with Tapestry:

<html jwcid="@Shell" title="ognl:page.pageTitle">

and then define a customized

public String getPageTitle();

in all the Page classes (with maybe an abstract getPageTitle in the base page class, supposing you have one defined in the project, which one normally should).

The same type of reasoning applies for page keywords and description metatags, as they are taken into account by most of the search engines. Use them, make them dynamic and insert content-relevant data. Don’t just rely on generic keywords as the competition is huge on these: a bit of keyword long tail can do wonders. Don’t overdo it and don’t try to trick Google as you may have some nasty surprises in the process. And if you can afford, have some SEO consulting for the keywords and titles content.

There’s another rather obsolete nevertheless important HTML tag : the H1. Who on Earth needs H1 when you got CSS and can name your headings according to a coherent naming schema ? Well, apparently Google needs H1 tags, reads H1 tags and uses the content of H1 tags to compute search relevancy. So make sure to redefine H1 in your CSS and use it inside page content. People seem to believe it has something to do with a sort of HTML semantics …

That’s it, at least from a basic technical point of view. For a longer discussion about SEO read Roger Johansson’s Basics of search engine optimisation as well as the massive amounts of freely available literature available on the web concerning this subject. Just … search for it.

*SEO = Search Engine Optimization.

**Names changed to protect the innocents.

***Supposedly, Tapestry 3.1, currently in alpha stage, has its URLs way friendlier than 3.0. However, don’t use an alpha API on a production site.

Written by Adrian

September 25th, 2005 at 8:21 pm

Posted in AndEverythingElse

Tagged with , , , ,