Netuality

Taming the big, bad, nasty websites

Archive for the ‘Google’ tag

SEO eye for the Tapestry guy

leave a comment

One of my previous customers has a Jakarta Tapestry [3.0.x] based site. The site is subscription-based, but it also has a public area – if you browse each and every link you should be able to view few thousand of [dynamically generated] pages. No SEO* consulting was involved in building the site. To put it simply : I’ve got some specs and HTML templates: developed, deployed, bugfixed and hasta la vista…

More than 6 months later [!], the site is still alive, which is good, but it doesn’t really spot impressive traffic figures and growth. Basically, all the traffic it gathers seems to come from existing subscribers and paid ads, very low level of organic and almost zero traffic from major engines such as Google (although it was submitted to quite a lot of engines and directories).
Lo and behold, there must be something really nasty going on since a quick query on Google with site:www.javaguysite.com** gives only one freaking result : the home page. What means: Google has indexed ONLY the entry page – same thing happens with all the other major search indexes. And guess what : nobody is going to find the content if it isn’t even indexed by search engines.

Making friends with your public URLs

The problem : Tapestry URLs are too ugly for search engines. Looking at the source of my navigation menu, I found little beasts such as

http://www.javaguysite.com/b2b?service=direct/1/Home/$Border.$Header.$LoggedMenu.$CheckedMenu$1.$DirectLink&sp=SSearchOffers#a

For a Tapestry programmer it is simple direct link from inside a component embedded in other components, but for search engine bots it is an overly complex link to a dynamic page, which will NOT be followed. Thus, if you want these little buggers to crawl all over your site and index all the pages, make’em think it’s a simple static site such as :

http://www.javaguysite.com/b2b/bla/bla/bla/SearchOffers.html

In SEO consultants slang, it’s called “friendly URLs”***.

You don’t have to make all your links friendlier. For instance, no need to touch the pages available only to subscribers as they’ll never be available for public searching. In the public area, make friendly URLs only to access those pages containing relevant content.
The method is called URL rewriting. Rewriting means that the web server is transforming the request URL, behind the scenes, using regular expressions, in a totally transparent manner. Thus, the client browser or the bot “thinks” it reaches a certain URL, however a different address is sent to the servlet container. The rewriting is performed either by:

1. using a servlet filter such as urlrewrite.

or

2. with mod_rewrite in Apache. I do use Apache as a proxy server, in order to perform transparent and efficient gzip-ing on the fly as described in one of my previous blog posts. Now, I only had to add the mod_rewrite filter and I’m ready to go.

Only minor syntax differences exists between the regular expressions in the filter and the Apache module. I was able to seamlessly switch between the two, as I use the servlet filter in development environment and we did Apache proxying in production.

The devil is in the details

Now we’re sure that dynamic pages from the public area will be searchable after the Google bot crawls them. Problem is : all pages of a single category will have the same title. Like for instance “Company details” for all the pages containing … umm, company details. And when you have thousands of companies in the database, that makes helluva lot of pages with the same title ! Besides, keywords contained in the page title play an important role in the search position for the specific keyword. The conclusion: make your page titles as customised as possible: put not only the page type, but also relevant data from the page content – in our case, the company name and whynot the city where the business is located. This is easy with Tapestry:

<html jwcid="@Shell" title="ognl:page.pageTitle">

and then define a customized

public String getPageTitle();

in all the Page classes (with maybe an abstract getPageTitle in the base page class, supposing you have one defined in the project, which one normally should).

The same type of reasoning applies for page keywords and description metatags, as they are taken into account by most of the search engines. Use them, make them dynamic and insert content-relevant data. Don’t just rely on generic keywords as the competition is huge on these: a bit of keyword long tail can do wonders. Don’t overdo it and don’t try to trick Google as you may have some nasty surprises in the process. And if you can afford, have some SEO consulting for the keywords and titles content.

There’s another rather obsolete nevertheless important HTML tag : the H1. Who on Earth needs H1 when you got CSS and can name your headings according to a coherent naming schema ? Well, apparently Google needs H1 tags, reads H1 tags and uses the content of H1 tags to compute search relevancy. So make sure to redefine H1 in your CSS and use it inside page content. People seem to believe it has something to do with a sort of HTML semantics …

That’s it, at least from a basic technical point of view. For a longer discussion about SEO read Roger Johansson’s Basics of search engine optimisation as well as the massive amounts of freely available literature available on the web concerning this subject. Just … search for it.

*SEO = Search Engine Optimization.

**Names changed to protect the innocents.

***Supposedly, Tapestry 3.1, currently in alpha stage, has its URLs way friendlier than 3.0. However, don’t use an alpha API on a production site.

Written by Adrian

September 25th, 2005 at 8:21 pm

Posted in AndEverythingElse

Tagged with , , , ,

JCS: the good, the bad and the undocumented

one comment

Java Caching System is one of the mainstream opensource and free Java caches*, along with OSCache, EHCache and JbossCache. Choosing JCS may be the subject of an article by itself, since this API has a vastly undeserved reputation of being a buggy, slow cache. Exactly this reputation has motivated the development of EHCache, which is in fact a fork of JCS. But, JCS has evolved a lot lately and is now a perfectly valid alternative for production; it still has a few occasional bugs, but nothing really bothersome. I've recently had this interesting experience of cleaning up and tuning a website powered by JCS. This dynamic Java-based site is exposed to a healthy traffic of 0.6-1.2 Mhits/day, with 12.000-25.000 unique visitors daily, and caching has greatly improved its performance. This article is a collection of tips and best practices not comprised (yet?) in the official JCS documentation.

Where to download JCS

This is usually the first question when one wants to use JCS. Since JCS is not a 'real' Jakarta project, but a module of the Turbine framework, there is no downloading link available on the main site. If you search on Google, this question has popped many times on different mail lists or blogs and it usually has two kinds of answers, both IMHO wrong:

  • download the source of Turbine and you'll find JCS in the dependencies. No, you won't, because Turbine is build with Maven, which is supposed to automagically download all the needed dependencies and bring them to you on a silver plate. Meaning: tons of useless jars hidden somehwere in the murky depths of wherever Maven thinks is a nice install location. Uhh.
  • build it from scratch. Another sadistic advice, given that JCS is also build with Maven. So you'll not only need to checkout the sources from CVS, but also install Maven. Then try to build JCS. And eventually give up. Like for instance in my case, I installed the monster^H^H^H^H^H^H wonderful build tool, then ran 'maven jar'. Instead of the expected result [you know, building the jar !] Maven performed a series of operations like running unit tests, washing teeth, cooking a turkey. Well, I suppose it was doing this, because I couldn't read the huge gobs of text running quickly on the screen. At the end, it miserably failed, with no logical explanations (too many explanations is the modern equivalent of unexplained). So I gave up. Again.

Fortunately, some kind souls at Jakarta (think of these developers as of a sort of secret congregation) provide clandestine latest 'demavenized' binary builds in obscure places; for JCS, the
location is here. I used the last 1.1 build without problems for a few weeks and I strongly recommend it.

Using the auxiliary disk cache

There's a common misconception that one doesn't need no stinkin' disk cache. Even on Hibernate site the example JCS configurations has the auxiliary disk cache commented out. Maybe this comes from the fact that JCS disk cache suffered from a memory leak (not true any more) or from the simplistic reasoning that disk access is inherently slower than memory access. Well it surely is, but at the same time it's probably much faster than some of the database queries, which could benefit from caching.

Also, it is interesting to note that incorrectly dimensioned 'memory' caches will make the Java process overflow from main memory to the swap disk. So you'll use the disk anyway, only in an un-optimized manner !

I wouldn't advise you to activate the auxiliary cache on disk without limiting its size, otherwise, the cache file would grow indefinitely. Controlling cache size is done by 2 parameters (MaxKeySize and OptimizeAtRemoveCount) example:

jcs.auxiliary.DC.attributes.MaxKeySize=2500
jcs.auxiliary.DC.attributes.OptimizeAtRemoveCount=2500

Only MaxKeySize is not enough, since it will only limit the number of keys pointing to values which are in disk cache. In fact, removing a value from the disk cache will only remove its key. But, the second (OptimizeAtRemoveCount) parameter will tell the cache to recreate a new file after a certain number of 'removes'. This new cache file will keep only the cached values corresponding to the remaining keys, thus cleaning all obsolete values, and of course will replace the old cache file. The size of disk cache and the remove count is of course subject of tuning in your own environment.

Tuning the memory shrinker

Although one of the JCS authors specify that the shrinker “is rarely necessary”, it might come handy especially in memory constrained environments or for really big caches. With one exception: be careful and specify the MaxSpoolPerRun parameter (undocumented yet, but discussed on the mailing list) otherwise the shrinking process might lead to spikes in CPU usage. I am using the shrinker like that:

jcs.default.cacheattributes.UseMemoryShrinker=true
jcs.default.cacheattributes.ShrinkerIntervalSeconds=3600
jcs.default.cacheattributes.MaxSpoolPerRun=300

YMMV.

Cache control via servlet

Again, undocumented, but people seem to know about it. The servlet class is org.apache.jcs.admin.servlet.JCSAdminServlet but do not expect it to work out of the box ! This servlet uses Velocity thus you'll need to :

  • initialize Velocity before trying to access the servlet (or lazy, but you'll have to modify the servlet source)
  • copy the templates into the Velocity template location. The templates (JCSAdminServletDefault.vm and JCSAdminServletRegionDetail.vm) are not (bug ? feature ?) in the jar, so you'll have to retrieve them from the CVS repository. For the moment, they are at this location.

These are my findings. I would have really appreciated to have these few pieces of info before starting the cache tuning. If anybody thinks this article is useful and/or needs to be completed, write a comment, send an email, wave hands. I'll try to come up with more details.

*For a complete list, see the corresponding section at Java-source.

Written by Adrian

February 17th, 2005 at 9:30 pm

Posted in Tools

Tagged with , , , ,

XML descriptors in PHP frameworks – considered harmful

2 comments

No, I am not a seasoned PHP programmer and I do not intend to become one. But we do live in a harsh economy where all IT projects are worth considering, thus my occasional incursions in the world of of PHP-driven websites.

I am not new to PHP either, but – coming from a Java world – immediately felt the need of a serious MVC framework.
Nobody wants to reinvent the wheel each time a new website is built. Just launch the obvious “PHP MVC framework” on Google and the results pages will be dominated by four open-source projects :

  • PHPMVC is probably the oldest project and implements a model 2 front controller/li>
  • Ambivalence declares itself as a simple port of
    Java Maverick project
  • Eocene a “simple and easy to use OO web development framework for PHP and ASP.NET”,implementing MVC and front controller
  • Phrame is a Texas Tech University project released as LGPL, heavily inspired by Struts.

The choice is not easy. There are no examples of industrial-quality sites built with either of these frameworks.
(some may say there are no examples of industrial quality sites built with PHP but let's ignore these nasty people for now).

There are no serious comparisons of the four frameworks, neither feature-wise nor performance-wise.
In the tradition of open-source projects, the documentation is rather scarce and examples are “helloworld”-isms.
Yes I am a bloody bastard for pointing out these aspects – since the authors are not paid to release these projects – and perhaps I could contribute myself with some documentation. However, when under an aggressive schedule I feel it's easier to write my own framework instead of understanding other people's code and document it thoroughly.
However, I have a nice hint for you. The first three frameworks are using XML files for controller initialization (call it “sitemap”, “descriptor” or otherwise; but it's just a plain dumb XML file). So you should safely ignore them in a production environment.

Because, the “controller” is nothing more than a glorified state machine. The succession of states and transitions (or “actions” or whatever) should be persisted somewhere. XML is probably a nice choice for Java frameworks, where the files are parsed and the application server keeps in memory a pool of controller objects.

But: PHP sessions are stateless. The only way of keeping state is via filesystem or database, usually based on an ad-hoc generated unique key, which is kept in the session cookie. More: PHP allows native serialization only for primitive variables; a complex object such as the controller can not be persisted easily, so it has to be retrieved from XML and fully rebuilt. Unlike in Java appservers, objects cannot be shared between multiple session, thus pooling is not an option. Thus, in PHP, the XML approach is highly un-recommended, since this means that the XML files are parsed for each page that is viewed on the site. Although PHP's parser is James Clarks's Expat, one of the fastest parsers right now (written in C), note that the DOM object must be browsed in order to create the controller object (which is becoming more and more complex as the site grows). This is called heavy overhead, no matter how you look at it.

There are a few reasons about why you need XML in a web framework, however this does NOT apply to PHP apps. Myth quicklist:

  • it's “human-readable”. Come on, PHP is stored is ASCII readable files and even if you use Zend products to compile and encrypt your code, why on earth would you allow readability and modification of the controller state machine on the deployment site ?
  • easier to modify than in code. This is probably true for Java and complex frameworks, but in PHP is significantly simpler than Java.
  • automatically generated from code by tools such as Xdoclet or via an IDE. If you're writing it in Java, because PHP does not have such tools.

This means that the only serious candidate (between these considered here) for a PHP MVC framework is Phrame, which stores the sitemap as a multi-dimensional hashmap. Thus, you should either consider Phrame or (for small < 50 screens) sites you'll be better off writing your own mini-framework, with a state machine implemented as a hashed array of arrays and some session persistence in the database. I chose to serialize and persist an array containing primitive variables, using PHPSESSID as the primary key in order to retrieve and unserialize the array, all coupled with a simple "aging" mechanism for these users with the nasty habit of leaving the site without performing logout first.

Finally a last world of advice : use PEAR ! This often overlooked library of PHP classes includes a few man-years of quality work. You'll get a generic database connection layer (PEAR-DB) along with automatic generation of model classes mapped on the database schema (DB_DataObjects), a plethora of HTML tools (tables, forms, menus) and some templating systems to choose from. All in a nice easy to install and upgrade package.

Don't put a heavy burden on your upgrade cycle using heterogenous packages downloaded from different sites on the web, just use PEAR.

Or simply ignore the PHP offer and wait patiently for your next Java project. Vacations are almost over.

Written by Adrian

October 29th, 2004 at 8:53 am

Posted in Tools

Tagged with , , , ,

(Undocumented?) HTTP-based REST API in Jira

leave a comment

While the REST API is mentioned in the Top 10 Reasons to use Jira, I can hardly see any reference to such an external API in the documentation (I mean, besides XML-RPC and upcoming SOAP support) and even Google can't clear the issue. But I can confirm you, it is possible to fully(?) automate Jira using HTTP requests.

Recently, I was asked to programatically login a user into Jira and open a browser with the results of an advanced search. This is part of a really slick integration between one of my employer's products and a JIRA server – mainly helping our testers to avoid bug duplication and also providing insightful statistics for QA dept.

I've started doing it the hard way, trying to obtain and pass JSESSIONID and other useless stuff between my application and the browser, until I realized that Jira can be fully controlled through HTTP in a REST-like manner. Let me explain. Normally, if you are not logged into Jira and launch a browser with a carefully crafted search request (well, you know how to reverse enginner POSTs into GETs don't you ?) – then a graceful invitation to log in is everything you'll ever obtain. But, if you add at the end of your request “&os_username=” + USER + “&os_password=”+ PASS bang ! not only you obtain the desired search results, but you are automagically logged into Jira for the rest of the browsing session. Yes, yes yes : here I come, integration ! A couple of hours later, I am able to programatically insert bug reports, extract bug details, compute statistics and open customized advanced searches.

To quote a classic : Jira ! Jira ! Jira !. Docs would be nice, though.

PS I'm testing this on a Jira Professional 2.6.1-#65. YMMV.

Written by Adrian

September 17th, 2004 at 4:45 pm

Posted in Tools

Tagged with , ,

… and a few lesser known Java tools

leave a comment

Very very busy lately, but I'd like to share some knowledge about a few useful Java OSS gems that were not easy to find. Mr. Google, please 'index this':

1. Aspirin is a self-contained SMTP server (send only) written in Java, open-sourced and free. It simplifies configuration and deployment by allowing your app server to send emails without passing through an external SMTP server. The project is heavily inspired from Apache James code (thus its licencing terms). The few problems I see right now are : possible performance issues when sending big volumes of mail, behavior still erratic (sometimes sending fails without plausible reason), failure reports which do not provide reasons of failure. However, the thingie works pretty well and is a big time saver because, well, configuration is not the most pleasant part of a complex server.

2. If you produce a lot of reports and want to send them automatically on a remote printing server you may use JIPSI (quickstart in English, but site in German) which implements CUPS as a Java Print Service API. This little beauty was found by one of my coworkers and the 'report guys' seem to be making good use of it.

3. You're in for some serious processing on OpenOffice documents using the freely available DTD's (downloadable from the OOo CVS server) ? Then hold your horses ! I've tried to make sensible use of them and failed abruptly. Let's just say that those DTDs are a big pain in the a**: to begin with, no tool is able to transform them into a schema. I've tried XmlSpy and a few other exotic softwares, without success. Even basic stuff like parsing with a validating parser does not work. So much for the usefulness of open standards. Eventually, I have ended up by using the excellent Writer2Latex. Don't be fooled by the name, you may do all sorts of conversions with it, including Writer to XHTML, which I was interested in. You can even write your own plugin to boot some exotic formats, because Writer2Latex is built around a modified version of XMerge. Officially, XMerge is the solution for visualizing documents on 'Small Devices', but it really is a fancy plugin-based document converter. Most probably (too lazy to check the sources), SAX-based with a nonvalidating parser. Go figure.

4. The Eclipse download site has now links to a BitTorrent tracker. I just used it succesfully in order to download RC3 at a reasonable download rate (anyway, being on wifi right now I wasn't expecting blazing speed). I found interesting that all the other peers were using Azureus, a torrent client written in Java+SWT. Azureus is a fantastic source of knowledge, choke full of tips and techniques for writing professional-looking and very responsive SWT apps. But not only : Azureus is also a great example about how to write a plugin-ready app, which performs automatic updates from the net. Not bad, at all.

Written by Adrian

June 21st, 2004 at 8:14 pm

Posted in Tools

Tagged with , , ,