Netuality

Taming the big bad websites

Archive for September, 2005

SEO eye for the Tapestry guy

leave a comment

One of my previous customers has a Jakarta Tapestry [3.0.x] based site. The site is subscription-based, but it also has a public area – if you browse each and every link you should be able to view few thousand of [dynamically generated] pages. No SEO* consulting was involved in building the site. To put it simply : I’ve got some specs and HTML templates: developed, deployed, bugfixed and hasta la vista…

More than 6 months later [!], the site is still alive, which is good, but it doesn’t really spot impressive traffic figures and growth. Basically, all the traffic it gathers seems to come from existing subscribers and paid ads, very low level of organic and almost zero traffic from major engines such as Google (although it was submitted to quite a lot of engines and directories).
Lo and behold, there must be something really nasty going on since a quick query on Google with site:www.javaguysite.com** gives only one freaking result : the home page. What means: Google has indexed ONLY the entry page – same thing happens with all the other major search indexes. And guess what : nobody is going to find the content if it isn’t even indexed by search engines.

Making friends with your public URLs

The problem : Tapestry URLs are too ugly for search engines. Looking at the source of my navigation menu, I found little beasts such as

http://www.javaguysite.com/b2b?service=direct/1/Home/$Border.$Header.$LoggedMenu.$CheckedMenu$1.$DirectLink&sp=SSearchOffers#a

For a Tapestry programmer it is simple direct link from inside a component embedded in other components, but for search engine bots it is an overly complex link to a dynamic page, which will NOT be followed. Thus, if you want these little buggers to crawl all over your site and index all the pages, make’em think it’s a simple static site such as :

http://www.javaguysite.com/b2b/bla/bla/bla/SearchOffers.html

In SEO consultants slang, it’s called “friendly URLs”***.

You don’t have to make all your links friendlier. For instance, no need to touch the pages available only to subscribers as they’ll never be available for public searching. In the public area, make friendly URLs only to access those pages containing relevant content.
The method is called URL rewriting. Rewriting means that the web server is transforming the request URL, behind the scenes, using regular expressions, in a totally transparent manner. Thus, the client browser or the bot “thinks” it reaches a certain URL, however a different address is sent to the servlet container. The rewriting is performed either by:

1. using a servlet filter such as urlrewrite.

or

2. with mod_rewrite in Apache. I do use Apache as a proxy server, in order to perform transparent and efficient gzip-ing on the fly as described in one of my previous blog posts. Now, I only had to add the mod_rewrite filter and I’m ready to go.

Only minor syntax differences exists between the regular expressions in the filter and the Apache module. I was able to seamlessly switch between the two, as I use the servlet filter in development environment and we did Apache proxying in production.

The devil is in the details

Now we’re sure that dynamic pages from the public area will be searchable after the Google bot crawls them. Problem is : all pages of a single category will have the same title. Like for instance “Company details” for all the pages containing … umm, company details. And when you have thousands of companies in the database, that makes helluva lot of pages with the same title ! Besides, keywords contained in the page title play an important role in the search position for the specific keyword. The conclusion: make your page titles as customised as possible: put not only the page type, but also relevant data from the page content – in our case, the company name and whynot the city where the business is located. This is easy with Tapestry:

<html jwcid="@Shell" title="ognl:page.pageTitle">

and then define a customized

public String getPageTitle();

in all the Page classes (with maybe an abstract getPageTitle in the base page class, supposing you have one defined in the project, which one normally should).

The same type of reasoning applies for page keywords and description metatags, as they are taken into account by most of the search engines. Use them, make them dynamic and insert content-relevant data. Don’t just rely on generic keywords as the competition is huge on these: a bit of keyword long tail can do wonders. Don’t overdo it and don’t try to trick Google as you may have some nasty surprises in the process. And if you can afford, have some SEO consulting for the keywords and titles content.

There’s another rather obsolete nevertheless important HTML tag : the H1. Who on Earth needs H1 when you got CSS and can name your headings according to a coherent naming schema ? Well, apparently Google needs H1 tags, reads H1 tags and uses the content of H1 tags to compute search relevancy. So make sure to redefine H1 in your CSS and use it inside page content. People seem to believe it has something to do with a sort of HTML semantics …

That’s it, at least from a basic technical point of view. For a longer discussion about SEO read Roger Johansson’s Basics of search engine optimisation as well as the massive amounts of freely available literature available on the web concerning this subject. Just … search for it.

*SEO = Search Engine Optimization.

**Names changed to protect the innocents.

***Supposedly, Tapestry 3.1, currently in alpha stage, has its URLs way friendlier than 3.0. However, don’t use an alpha API on a production site.

Written by Adrian

September 25th, 2005 at 8:21 pm

Posted in AndEverythingElse

Tagged with , , , ,

Aggregating webservers logs for an Apache cluster

one comment

One of the ways of scaling a heavy-traffic LAMP web application is to transform the server into a cluster of servers. Some may opt to walk on the easy path by using an overpriced appliance load balancer, but the most daring [and budget-restrained] will go for free software solutions such as pound or haproxy.

Although excellent performers, these free balancers have lots of missing features when compared with counterpart commercial solutions. One of the most embarrassing misses is the lack of flexibility in producing decent access logs. Both pound (LogLevel 4) and haproxy (option httplog) may generate Apache-like logs in their logfiles or the syslog, however none offers the level of customization encountered in Apache. Basically, you're left with using the logs from the cluster nodes. These logs present a couple of problems:

  • the originating IP is always the internal IP of the balancer
  • there is one log/node, while log analysis tools can usually cope with a single log file/report

First problem is relatively easy to solve. Start by activating the X-Forwarded-For header in the balancing software : for instance configuring haproxy with option forwardfor. A relatively unknown Apache module called mod_rpaf will solve the tedious task of extracting the remote IP from X-Forwarded-For header and copying it in the remote address field of Apache logs. For Debian Linux fans, it's nice to note that libapache-mod-rpaf is available via apt.

Now that you have N realistic Apache weblogs, 1 per cluster node, you just have to concatenate and put them in a form understandable by your log analysis tools. Just simply cat-ing them in a big file, won't cut it [arf] because new records will appear in different regions of the file instead of appending chronologically to its tail. The easiest solution in that case is to perform a sort on these logs. Although I am aware of the vague possibility of sorting on the Apache datetime field, even taking the locale into account, I confess my profound inability of finding the right combination of parameters. Instead, I choose to add a custom field in the Apache log; using the following log format:


LogFormat "%h %V %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Cookie}i" %c %T "%{%Y%m%d%H%M%S}t"" combined

where %{%Y%m%d%H%M%S}t is a standard projection of current datetime in an easily sortable integer, like for instance 20050925120000 – equivalent of 25 Sep 2005 12:00:00. Now, considering the quote as a separator in the Apache log format, is easy to sort upon this custom field [the 10th]:

sort -b -t """ -T /opt -k 10 /logpath/access?.log > /logpath/full.log

And there you are, having this nice huge log file to munch on. On a standard P4 with 1GB of RAM it takes less than a minute to obtain a 2GB log file…

In case the web traffic is really big and log analysis process impacts the existing web activity, use a separate machine instead of overloading one of the cluster nodes. For automated transfer of log files, generate ssh keys on all the cluster nodes for paswordless login from the web analytics server in the web logfiles owner account. Minimization of traffic between these machines is done by installing rsync on them and them using rsync via ssh:

rsync -e ssh -Cavz www-data@node1:/var/log/apache/access.log /logpath/access1.log

Now, you know all the steps required to fully automate the log aggregation and its processing. One may ask why all the fuss when in fact a simple subscription to a ASP style web analytics provider should suffice. Yes, it's true however… The cluster that I've recently configured with this procedure has a few million hits per week. Yes, we're talking about page hits. At this level of traffic, the cost for a web analytics service starts from 10.000$/year. It's certainly a nice amount of money, which will allow you to afford your own analytics tool [such as for instance Urchin v5] and keep some cash from the first year. Some might say that this kind of commercial tools have their own load balancer analysis techniques. Sure, but it all comes with a cost. In the case of Urchin, you just saved 695$/node and some bragging rights with your mates. Relax and enjoy.

PS: Yes we're talking millions of page hits LAMP solution not J2EE… Maybe I'll get into details on another occasion, assuming that somebody is interested. Leave a comment, send a mail or something.

Written by Adrian

September 25th, 2005 at 8:20 pm

Posted in Tools

Tagged with , , ,