Netuality

Taming the big bad websites

Archive for the ‘Google’ tag

Comic strips and contextual advertising

one comment

As seen today on Google Reader. A strip is a strip is a strip is a strip:

Written by Adrian

June 13th, 2010 at 10:44 am

Posted in AndEverythingElse

Tagged with ,

Linkdump: Coop, HBase performance and a bit of Warcraft

one comment

Riptano is to Cassandra what Cloudera is to Hadoop or Percona to MySQL. Mmmkey?

A great, insightful post from Pingdom (as usual) allows us to take a peek behind the doors at largest web sites in the world, just by reading selected stuff from their respective developer blogs.

Yahoo decreased data-center cooling costs compared to power costs from 50 cents/dollar to only one cent/dollar. This is obtained on their most recent Yahoo Computing Coop data-center built in Lockport, New York.

The data center operates with no chillers, and will require water for only a handful of days each year. Yahoo projects that the new facility will operate at a Power Usage Effectiveness (PUE) of 1.1, placing it among the most efficient in the industry. [...]

If it looks like a chicken coop, it’s because some of the design principles were adapted from …. well, chicken coops. “Tyson Foods has done research involving facilities with the heat source in the center of the facility, looking at how to evacuate the hot air,” said Noteboom. “We applied a lot of similar thought to our data center.”

The Lockport site is ideal for fresh air cooling, with a climate that allows Yahoo to operate for nearly the entire year without using air conditioning for its servers.

High Scalability blog dissects a paper describing Dapper, Google’s tracing system used to instrument all the components of a software system in order to understand its behavior. Immensely interesting:

As you might expect Google has produced and elegant and well thought out tracing system. In many ways it is similar to other tracing systems, but it has that unique Google twist. A tree structure, probabilistically unique keys, sampling, emphasising common infrastructure insertion points, technically minded data exploration tools, a global system perspective, MapReduce integration, sensitivity to index size, enforcement of system wide invariants, an open API—all seem very Googlish.

On my favorite blog :) HStack.org Andrei wrote a great post about real-life performance testing of HBase:

The numbers are the tip of the iceberg; things become really interesting once we start looking under the hood, and interpreting the results.

When investigating performance issues you have to assume that “everybody lies”. It is crucial that you don’t stop at a simple capacity or latency result; you need to investigate every layer: the performance tool, your code, their code, third-party libraries, the OS and the hardware. Here’s how we went about it:

The first potential liar is your test, then your test tool – they could both have bugs so you need to double-check.

But the most interesting distributed system of the week is World of Warcraft. Ars Technica describes a tour of the Blizzard campus and here’s a peek at the best NOC screen ever:

For the hooorde!

Written by Adrian

April 27th, 2010 at 10:35 pm

Posted in Linkdump

Tagged with , , , ,

Google’s Map/Reduce patent and impact on Hadoop: none expected

one comment

From the GigaOm analysis:

Fortunately, for them, it seems unlikely that Google will take to the courts to enforce its new intellectual property. A big reason is that “map” and “reduce” functions have been part of parallel programming for decades, and vendors with deep pockets certainly could make arguments that Google didn’t invent MapReduce at all.

Should Hadoop come under fire, any defendants (or interveners like Yahoo and/or IBM) could have strong technical arguments over whether the open-source Hadoop even is an infringement. Then there is the question of money: Google has been making plenty of it without the patent, so why risk the legal and monetary consequences of losing any hypothetical lawsuit? Plus, Google supports Hadoop, which lets university students learn webscale programming (so they can become future Googlers) without getting access to Google’s proprietary MapReduce language.

[...]

A Google spokeswoman emailed this in response to our questions about why Google sought the patent, and whether or not Google would seek to enforce its patent rights, attributing it to Michelle Lee, Deputy General Counsel:

“Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops. While we do not comment about the use of this or any part of our portfolio, we feel that our behavior to date has been inline with our corporate values and priorities.”

From Ars Technica:

Hadoop isn’t the only open source project that uses MapReduce technology. As some readers may know, I’ve recently been experimenting with CouchDB, an open source database system that allows developers to perform queries with map and reduce functions. Another place where I’ve seen MapReduce is Nokia’s QtConcurrent framework, an extremely elegant parallel programming library for Qt desktop applications.

It’s unclear what Google’s patent will mean for all of these MapReduce adopters. Fortunately, Google does not have a history of aggressive patent enforcement. It’s certainly possible that the company obtained the patent for “defensive” purposes. Like virtually all major software companies, Google is frequently the target of patent lawsuits. Many companies in technical fields attempt to collect as many broad patents as they can so that they will have ammunition with which to retaliate when they are faced with patent infringement lawsuits.

Google’s MapReduce patent raises some troubling questions for software like Hadoop, but it looks unlikely that Google will assert the patent in the near future; Google itself uses Hadoop for its Code University program.

Even if Google takes the unlikely course of action and does decide to target Hadoop users with patent litigation, the company would face significant resistance from the open source project’s deep-pocketed backers—including IBM, which holds the industry’s largest patent arsenal.

Another dimension of this issue is the patent’s validity. On one hand, it’s unclear if taking age-old principles of functional software development and applying them to a cluster constitutes a patentable innovation.

Still nothing from the big analysts, Gartner and the gang…

Written by Adrian

January 22nd, 2010 at 7:39 pm

Posted in Articles

Tagged with , , , ,

Google: sorry, but Lisp/Ruby/Erlang not on the menu

7 comments

Yes, language propaganda again. Ain’t it fun ?

Here comes a nice quote from the latest Steve Yegge post (read it entirely if you have the time, it’s both fun and educational – at least for me). So, there:

I made the famously, horribly, career-shatteringly bad mistake of trying to use Ruby at Google, for this project. And I became, very quickly, I mean almost overnight, the Most Hated Person At Google. And, uh, and I’d have arguments with people about it, and they’d be like Nooooooo, WHAT IF… And ultimately, you know, ultimately they actually convinced me that they were right, in the sense that there actually were a few things. There were some taxes that I was imposing on the systems people, where they were gonna have to have some maintenance issues that they wouldn’t have. [...] But, you know, Google’s all about getting stuff done.

[...]

Is it allowed at Google to use Lisp and other languages?

No. No, it’s not OK. At Google you can use C++, Java, Python, JavaScript… I actually found a legal loophole and used server-side JavaScript for a project.

Mmmmm … key ?

Written by Adrian

May 29th, 2008 at 12:35 am

Posted in Tools

Tagged with , , , ,

Java going down, Python way up, and more …

8 comments

According to O’Reilly Radar, sales of Java books have declined in the last 4 years by almost 50%. C# is selling more books from year to year and will probably level up with Java in 2008. Javascript is on the rise (due to AJAX, for sure) and PHP is on a surprising decrease path (although the job statistics indicate quite the contrary).

According to O’Reilly Radar, sales of Java books have declined in the last 4 years by almost 50%

In 2007, the number of sold Ruby books was larger than the number of Python books. In their article they qualify Ruby as being a “mid-major programming language” and Python as “mid-minor programming language”. However, after the announcement of Google App Engine the number of Python downloads from ActiveState has tripled in May. This should become visible in the book sales statistics, pretty soon.

Written by Adrian

May 24th, 2008 at 5:36 pm

Posted in Tools

Tagged with , , , , ,

memcached vs tugela vs memcachedb

2 comments

This presentation was planned for an older Wurbe event, but as this never quite happened in the last 4 months I am publishing it now, before it becomes totally obsolete.

My original contribution here is a comparison between the original memcached server from Danga and the tugela fork from the MediaWiki programmers. I’ve also tried memcachedb but the pre 1.0 version (from Google Code) in November 2007 was quite unstable and unpredictible.

In a nutshell, these memcache versions are using BerkeleyDB instead of memory slab allocator. There are 2 direct consequences:

  • when the memory is large enough for the whole cache, database-backed servers will be slower (my tests shown 10-15% which might be tolerable – or not – for your app)
  • when you’ve got lots of data to cache and your server’s memory is low, relying on bdb is significantly better than letting the swap mechanism to do its job (from my benchmarks, the difference can go up to 10 times faster especially under very high concurrency conditions)

Tugela will prove especially useful when running it on virtualized servers with very low memory.

My tests were performed with the “Tummy” Python client and Stackless for the multithreaded version. In one of the following weeks I’ll update the benchmarks for memcachedb 1.0.x – and I promise never ever to wait 4 months for a presentation, again …

Written by Adrian

March 17th, 2008 at 12:38 am

SEO eye for the Tapestry guy

leave a comment

One of my previous customers has a Jakarta Tapestry [3.0.x] based site. The site is subscription-based, but it also has a public area – if you browse each and every link you should be able to view few thousand of [dynamically generated] pages. No SEO* consulting was involved in building the site. To put it simply : I’ve got some specs and HTML templates: developed, deployed, bugfixed and hasta la vista…

More than 6 months later [!], the site is still alive, which is good, but it doesn’t really spot impressive traffic figures and growth. Basically, all the traffic it gathers seems to come from existing subscribers and paid ads, very low level of organic and almost zero traffic from major engines such as Google (although it was submitted to quite a lot of engines and directories).
Lo and behold, there must be something really nasty going on since a quick query on Google with site:www.javaguysite.com** gives only one freaking result : the home page. What means: Google has indexed ONLY the entry page – same thing happens with all the other major search indexes. And guess what : nobody is going to find the content if it isn’t even indexed by search engines.

Making friends with your public URLs

The problem : Tapestry URLs are too ugly for search engines. Looking at the source of my navigation menu, I found little beasts such as

http://www.javaguysite.com/b2b?service=direct/1/Home/$Border.$Header.$LoggedMenu.$CheckedMenu$1.$DirectLink&sp=SSearchOffers#a

For a Tapestry programmer it is simple direct link from inside a component embedded in other components, but for search engine bots it is an overly complex link to a dynamic page, which will NOT be followed. Thus, if you want these little buggers to crawl all over your site and index all the pages, make’em think it’s a simple static site such as :

http://www.javaguysite.com/b2b/bla/bla/bla/SearchOffers.html

In SEO consultants slang, it’s called “friendly URLs”***.

You don’t have to make all your links friendlier. For instance, no need to touch the pages available only to subscribers as they’ll never be available for public searching. In the public area, make friendly URLs only to access those pages containing relevant content.
The method is called URL rewriting. Rewriting means that the web server is transforming the request URL, behind the scenes, using regular expressions, in a totally transparent manner. Thus, the client browser or the bot “thinks” it reaches a certain URL, however a different address is sent to the servlet container. The rewriting is performed either by:

1. using a servlet filter such as urlrewrite.

or

2. with mod_rewrite in Apache. I do use Apache as a proxy server, in order to perform transparent and efficient gzip-ing on the fly as described in one of my previous blog posts. Now, I only had to add the mod_rewrite filter and I’m ready to go.

Only minor syntax differences exists between the regular expressions in the filter and the Apache module. I was able to seamlessly switch between the two, as I use the servlet filter in development environment and we did Apache proxying in production.

The devil is in the details

Now we’re sure that dynamic pages from the public area will be searchable after the Google bot crawls them. Problem is : all pages of a single category will have the same title. Like for instance “Company details” for all the pages containing … umm, company details. And when you have thousands of companies in the database, that makes helluva lot of pages with the same title ! Besides, keywords contained in the page title play an important role in the search position for the specific keyword. The conclusion: make your page titles as customised as possible: put not only the page type, but also relevant data from the page content – in our case, the company name and whynot the city where the business is located. This is easy with Tapestry:

<html jwcid="@Shell" title="ognl:page.pageTitle">

and then define a customized

public String getPageTitle();

in all the Page classes (with maybe an abstract getPageTitle in the base page class, supposing you have one defined in the project, which one normally should).

The same type of reasoning applies for page keywords and description metatags, as they are taken into account by most of the search engines. Use them, make them dynamic and insert content-relevant data. Don’t just rely on generic keywords as the competition is huge on these: a bit of keyword long tail can do wonders. Don’t overdo it and don’t try to trick Google as you may have some nasty surprises in the process. And if you can afford, have some SEO consulting for the keywords and titles content.

There’s another rather obsolete nevertheless important HTML tag : the H1. Who on Earth needs H1 when you got CSS and can name your headings according to a coherent naming schema ? Well, apparently Google needs H1 tags, reads H1 tags and uses the content of H1 tags to compute search relevancy. So make sure to redefine H1 in your CSS and use it inside page content. People seem to believe it has something to do with a sort of HTML semantics …

That’s it, at least from a basic technical point of view. For a longer discussion about SEO read Roger Johansson’s Basics of search engine optimisation as well as the massive amounts of freely available literature available on the web concerning this subject. Just … search for it.

*SEO = Search Engine Optimization.

**Names changed to protect the innocents.

***Supposedly, Tapestry 3.1, currently in alpha stage, has its URLs way friendlier than 3.0. However, don’t use an alpha API on a production site.

Written by Adrian

September 25th, 2005 at 8:21 pm

Posted in AndEverythingElse

Tagged with , , , ,

JCS: the good, the bad and the undocumented

one comment

Java Caching System is one of the mainstream opensource and free Java caches*, along with OSCache, EHCache and JbossCache. Choosing JCS may be the subject of an article by itself, since this API has a vastly undeserved reputation of being a buggy, slow cache. Exactly this reputation has motivated the development of EHCache, which is in fact a fork of JCS. But, JCS has evolved a lot lately and is now a perfectly valid alternative for production; it still has a few occasional bugs, but nothing really bothersome. I've recently had this interesting experience of cleaning up and tuning a website powered by JCS. This dynamic Java-based site is exposed to a healthy traffic of 0.6-1.2 Mhits/day, with 12.000-25.000 unique visitors daily, and caching has greatly improved its performance. This article is a collection of tips and best practices not comprised (yet?) in the official JCS documentation.

Where to download JCS

This is usually the first question when one wants to use JCS. Since JCS is not a 'real' Jakarta project, but a module of the Turbine framework, there is no downloading link available on the main site. If you search on Google, this question has popped many times on different mail lists or blogs and it usually has two kinds of answers, both IMHO wrong:

  • download the source of Turbine and you'll find JCS in the dependencies. No, you won't, because Turbine is build with Maven, which is supposed to automagically download all the needed dependencies and bring them to you on a silver plate. Meaning: tons of useless jars hidden somehwere in the murky depths of wherever Maven thinks is a nice install location. Uhh.
  • build it from scratch. Another sadistic advice, given that JCS is also build with Maven. So you'll not only need to checkout the sources from CVS, but also install Maven. Then try to build JCS. And eventually give up. Like for instance in my case, I installed the monster^H^H^H^H^H^H wonderful build tool, then ran 'maven jar'. Instead of the expected result [you know, building the jar !] Maven performed a series of operations like running unit tests, washing teeth, cooking a turkey. Well, I suppose it was doing this, because I couldn't read the huge gobs of text running quickly on the screen. At the end, it miserably failed, with no logical explanations (too many explanations is the modern equivalent of unexplained). So I gave up. Again.

Fortunately, some kind souls at Jakarta (think of these developers as of a sort of secret congregation) provide clandestine latest 'demavenized' binary builds in obscure places; for JCS, the
location is here. I used the last 1.1 build without problems for a few weeks and I strongly recommend it.

Using the auxiliary disk cache

There's a common misconception that one doesn't need no stinkin' disk cache. Even on Hibernate site the example JCS configurations has the auxiliary disk cache commented out. Maybe this comes from the fact that JCS disk cache suffered from a memory leak (not true any more) or from the simplistic reasoning that disk access is inherently slower than memory access. Well it surely is, but at the same time it's probably much faster than some of the database queries, which could benefit from caching.

Also, it is interesting to note that incorrectly dimensioned 'memory' caches will make the Java process overflow from main memory to the swap disk. So you'll use the disk anyway, only in an un-optimized manner !

I wouldn't advise you to activate the auxiliary cache on disk without limiting its size, otherwise, the cache file would grow indefinitely. Controlling cache size is done by 2 parameters (MaxKeySize and OptimizeAtRemoveCount) example:

jcs.auxiliary.DC.attributes.MaxKeySize=2500
jcs.auxiliary.DC.attributes.OptimizeAtRemoveCount=2500

Only MaxKeySize is not enough, since it will only limit the number of keys pointing to values which are in disk cache. In fact, removing a value from the disk cache will only remove its key. But, the second (OptimizeAtRemoveCount) parameter will tell the cache to recreate a new file after a certain number of 'removes'. This new cache file will keep only the cached values corresponding to the remaining keys, thus cleaning all obsolete values, and of course will replace the old cache file. The size of disk cache and the remove count is of course subject of tuning in your own environment.

Tuning the memory shrinker

Although one of the JCS authors specify that the shrinker “is rarely necessary”, it might come handy especially in memory constrained environments or for really big caches. With one exception: be careful and specify the MaxSpoolPerRun parameter (undocumented yet, but discussed on the mailing list) otherwise the shrinking process might lead to spikes in CPU usage. I am using the shrinker like that:

jcs.default.cacheattributes.UseMemoryShrinker=true
jcs.default.cacheattributes.ShrinkerIntervalSeconds=3600
jcs.default.cacheattributes.MaxSpoolPerRun=300

YMMV.

Cache control via servlet

Again, undocumented, but people seem to know about it. The servlet class is org.apache.jcs.admin.servlet.JCSAdminServlet but do not expect it to work out of the box ! This servlet uses Velocity thus you'll need to :

  • initialize Velocity before trying to access the servlet (or lazy, but you'll have to modify the servlet source)
  • copy the templates into the Velocity template location. The templates (JCSAdminServletDefault.vm and JCSAdminServletRegionDetail.vm) are not (bug ? feature ?) in the jar, so you'll have to retrieve them from the CVS repository. For the moment, they are at this location.

These are my findings. I would have really appreciated to have these few pieces of info before starting the cache tuning. If anybody thinks this article is useful and/or needs to be completed, write a comment, send an email, wave hands. I'll try to come up with more details.

*For a complete list, see the corresponding section at Java-source.

Written by Adrian

February 17th, 2005 at 9:30 pm

Posted in Tools

Tagged with , , , ,

XML descriptors in PHP frameworks – considered harmful

2 comments

No, I am not a seasoned PHP programmer and I do not intend to become one. But we do live in a harsh economy where all IT projects are worth considering, thus my occasional incursions in the world of of PHP-driven websites.

I am not new to PHP either, but – coming from a Java world – immediately felt the need of a serious MVC framework.
Nobody wants to reinvent the wheel each time a new website is built. Just launch the obvious “PHP MVC framework” on Google and the results pages will be dominated by four open-source projects :

  • PHPMVC is probably the oldest project and implements a model 2 front controller/li>
  • Ambivalence declares itself as a simple port of
    Java Maverick project
  • Eocene a “simple and easy to use OO web development framework for PHP and ASP.NET”,implementing MVC and front controller
  • Phrame is a Texas Tech University project released as LGPL, heavily inspired by Struts.

The choice is not easy. There are no examples of industrial-quality sites built with either of these frameworks.
(some may say there are no examples of industrial quality sites built with PHP but let's ignore these nasty people for now).

There are no serious comparisons of the four frameworks, neither feature-wise nor performance-wise.
In the tradition of open-source projects, the documentation is rather scarce and examples are “helloworld”-isms.
Yes I am a bloody bastard for pointing out these aspects – since the authors are not paid to release these projects – and perhaps I could contribute myself with some documentation. However, when under an aggressive schedule I feel it's easier to write my own framework instead of understanding other people's code and document it thoroughly.
However, I have a nice hint for you. The first three frameworks are using XML files for controller initialization (call it “sitemap”, “descriptor” or otherwise; but it's just a plain dumb XML file). So you should safely ignore them in a production environment.

Because, the “controller” is nothing more than a glorified state machine. The succession of states and transitions (or “actions” or whatever) should be persisted somewhere. XML is probably a nice choice for Java frameworks, where the files are parsed and the application server keeps in memory a pool of controller objects.

But: PHP sessions are stateless. The only way of keeping state is via filesystem or database, usually based on an ad-hoc generated unique key, which is kept in the session cookie. More: PHP allows native serialization only for primitive variables; a complex object such as the controller can not be persisted easily, so it has to be retrieved from XML and fully rebuilt. Unlike in Java appservers, objects cannot be shared between multiple session, thus pooling is not an option. Thus, in PHP, the XML approach is highly un-recommended, since this means that the XML files are parsed for each page that is viewed on the site. Although PHP's parser is James Clarks's Expat, one of the fastest parsers right now (written in C), note that the DOM object must be browsed in order to create the controller object (which is becoming more and more complex as the site grows). This is called heavy overhead, no matter how you look at it.

There are a few reasons about why you need XML in a web framework, however this does NOT apply to PHP apps. Myth quicklist:

  • it's “human-readable”. Come on, PHP is stored is ASCII readable files and even if you use Zend products to compile and encrypt your code, why on earth would you allow readability and modification of the controller state machine on the deployment site ?
  • easier to modify than in code. This is probably true for Java and complex frameworks, but in PHP is significantly simpler than Java.
  • automatically generated from code by tools such as Xdoclet or via an IDE. If you're writing it in Java, because PHP does not have such tools.

This means that the only serious candidate (between these considered here) for a PHP MVC framework is Phrame, which stores the sitemap as a multi-dimensional hashmap. Thus, you should either consider Phrame or (for small < 50 screens) sites you'll be better off writing your own mini-framework, with a state machine implemented as a hashed array of arrays and some session persistence in the database. I chose to serialize and persist an array containing primitive variables, using PHPSESSID as the primary key in order to retrieve and unserialize the array, all coupled with a simple "aging" mechanism for these users with the nasty habit of leaving the site without performing logout first.

Finally a last world of advice : use PEAR ! This often overlooked library of PHP classes includes a few man-years of quality work. You'll get a generic database connection layer (PEAR-DB) along with automatic generation of model classes mapped on the database schema (DB_DataObjects), a plethora of HTML tools (tables, forms, menus) and some templating systems to choose from. All in a nice easy to install and upgrade package.

Don't put a heavy burden on your upgrade cycle using heterogenous packages downloaded from different sites on the web, just use PEAR.

Or simply ignore the PHP offer and wait patiently for your next Java project. Vacations are almost over.

Written by Adrian

October 29th, 2004 at 8:53 am

Posted in Tools

Tagged with , , , ,

(Undocumented?) HTTP-based REST API in Jira

leave a comment

While the REST API is mentioned in the Top 10 Reasons to use Jira, I can hardly see any reference to such an external API in the documentation (I mean, besides XML-RPC and upcoming SOAP support) and even Google can't clear the issue. But I can confirm you, it is possible to fully(?) automate Jira using HTTP requests.

Recently, I was asked to programatically login a user into Jira and open a browser with the results of an advanced search. This is part of a really slick integration between one of my employer's products and a JIRA server – mainly helping our testers to avoid bug duplication and also providing insightful statistics for QA dept.

I've started doing it the hard way, trying to obtain and pass JSESSIONID and other useless stuff between my application and the browser, until I realized that Jira can be fully controlled through HTTP in a REST-like manner. Let me explain. Normally, if you are not logged into Jira and launch a browser with a carefully crafted search request (well, you know how to reverse enginner POSTs into GETs don't you ?) – then a graceful invitation to log in is everything you'll ever obtain. But, if you add at the end of your request “&os_username=” + USER + “&os_password=”+ PASS bang ! not only you obtain the desired search results, but you are automagically logged into Jira for the rest of the browsing session. Yes, yes yes : here I come, integration ! A couple of hours later, I am able to programatically insert bug reports, extract bug details, compute statistics and open customized advanced searches.

To quote a classic : Jira ! Jira ! Jira !. Docs would be nice, though.

PS I'm testing this on a Jira Professional 2.6.1-#65. YMMV.

Written by Adrian

September 17th, 2004 at 4:45 pm

Posted in Tools

Tagged with , ,