Netuality

Taming the big bad websites

Archive for the ‘web’ tag

My presentation at Wurbe#5

3 comments

Wurbe is the informal web developers meeting group, from Bucharest Romania. Meeting #5 was focused on automated testing (unit, TDD, BDD, other stuff). This is my presentation:

Written by Adrian

January 22nd, 2008 at 2:01 pm

Java Persistence with Hibernate – the book, my review

leave a comment

You have to know that I’ve tried. Honestly, I did. I hoped to be able to read each and every page of “Java persistence with Hibernate” (revised edition of “Hibernate in action”), by Christian Bauer and Gavin King. But, I gave up before reading a third of it, then I continued only reading some sections. First, because I know Hibernate, I’ve used Hibernate in all the Java projects I’ve been involved with – in the last 5 years or so. Second, because the content from the first edition is more than familiar to me. And third, because this second edition is a massive > 800 pages book (double the number of pages in the first edition). At that rate, the fourth edition will be sold together with some sort of transportation device, because a mere human will not be able to carry that amount of paper. How did this happened ?

Hibernate is the perfect example of a successful Java open-source project. Initially started as a free alternative to commercial object-relational mapping tools, it quickly became mainstream. Lots of Java developers around the world use Hibernate for the data layer inside their projects. It’s very comfortable, just set some attributes or ask for a business object instance and Hibernate does all the ugly SQL for you. As a developer, you are then comfortably protected from that nasty relational database, and gently swim in a sea of nicely bound objects. Right ? No, not exactly. Each object-relationship mapping tool has its own ways of being handled efficiently, and this is where books like “Java persistence with Hibernate” come into play. This book teaches you how to work with Hibernate, with a “real-world” example: the Caveat-Emptor online auction application.

Since the first edition of the book was written, lots of things happened in the Hibernate world and you can see their impact in “Java persistence with Hibernate”. Most important is the release of the 3.x version line and its different ameliorations and new features: code annotations used as mapping descriptors, package naming reorganization inside the API, but most important the standardization under the umbrella of JPA (Java Persistence API) for a smooth integration with EJB 3 inside Java EE 5 servers. And this, is a little bit funny. Yesterday, Hibernate was the main proof that it is possible to make industrial-quality projects within a “J2EE-less” environment, today Hibernate has put a suit and a tie, joined the ranks of Jboss, then Redhat, and it lures the unsuspecting Java developers towards the wonderful and (sometimes) expensive world of Java EE 5 application servers. Which is not necessarily a bad move for the Hibernate API, but it’s a proof that in order to thrive as an open-source project, you need so much more than the Sourceforge account and some passion …

Enough with that, let’s take a look at the book content. Some 75% if it is in fact the content of the first edition, updated and completed. You learn what object-relational mapping is, the advantages, the quirks, the recommended way of developing with Hibernate. For a better understanding, single chapters from the initial book were expanded into 2, sometimes more, chapters. The “unit of work” is now called “a conversation” and you’ve got a whole new chapter (11) about conversations, which is in fact pretty good stuff about session and transaction management. Christian and Gavin done some great writing about concurrency and isolation in the relatively small 10-th chapter – which is a must read even if you’re not interested in Hibernate, but you want to understand once and for all what are these concurrent transaction behaviors everyone is talking about. The entire 13th chapter is dedicated to fetching strategy and caching, which is a must read if you want performance and optimization from your application. There is also a good deal of EJB, JPA and EE 5 – related stuff scattered in multiple chapters. And finally, a solid 50-pages chapter is pimping the JSF (Java Server Faces) compliant web development framework, Jboss Seam. I have only managed to read a few pages of this final chapter, so cannot really comment. Note to self: play a little bit with that Seam thing.

To conclude, is this a fun book ? No. Is this a perfect book to convert young open-source fanatics to the wonders of Hibernate API ? Nope. Is this a book to read cover to cover during a weekend ? Not even close. Then, what is this ? First, it’s the best book out there about Hibernate (and there are quite a few on the market right now), maybe even the best book about ORM in Java, in general. It has lots of references to EJB, JPA and EE, it will help you to easily sell a Hibernate project to the management. Even if the final implementation uses Spring … And finally, it’s the best Hibernate reference money can buy. When you have an issue, open the darn index and search, there are 90% chances your problem will be solved. And that’s a nice accomplishment. Don’t get this book because it’s funny, because it’s a nice read, about a new innovative open-source project. Buy it because it helps you grok ORM, write better code, deliver quality projects.

Written by Adrian

December 17th, 2006 at 2:00 pm

Posted in Books

Tagged with , , , ,

Programming is hard – the website

leave a comment

A newcomer in the world of “code snippets” sites in programmingishard.com. Although the site is a few months old, only recently it started to gain some steam. Unlike its competition Krugle and Koders, this is not a code search engine but a snippet repository entirely tag-based, user-built. The author has a blog at tentimesbetter.com.

As for watering your mouth, this is a Python code fragment that I found on the site, for the classic inline conditional which does not exist “such as” in Python:

n = ['no', 'yes'][thing == 1]

Obviously it has the big disadvantage of having to compute both values no matter what the condition thing is, but is very short and elegant. Simple but nice code sugar.

Written by Adrian

August 2nd, 2006 at 11:07 pm

Posted in Tools

Tagged with , ,

Monitoring memcached with cacti

3 comments

Memcached is a clusterable cache server from Danga. Or, as they call, it a distributed memory object caching system. Well, whatever. Just note that memcached clients exist for lots of languages (Java, PHP, Python, Ruby, Perl) – mainstream languages in the web world. A lighter version of server was rewritten in Java by Mr. Jehiah Czebotar. Major websites such as Facebook, Slashdot, Livejournal and Dealnews use memcached in order to scale for the huge load they’re serving. Recently, we needed to monitor the memcache servers on a high-performance web cluster serving the Planigo websites. By googling and reading the related newsgroups, I was able to find two solutions:

  • from faemalia.net, a script which is integrated with the MySQL server templates for Cacti. Uses the Perl client.
  • from dealnews.com, a dedicated memcached template for Cacti and some scripts based on the Python client. The installation is thoroughly described here.

These two solutions have the same approach – provide a specialized Cacti template. The charts drawn by these templates are based on data extracted by the execution of memcached client scripts. Maybe very elegant, but could become a pain in the dorsal area. Futzing with Cacti templates was never my favorite pasttime. Just try to import a template exported from a different version of Cacti and you’ll know what I mean. In my opinion, there is a simple way, which consists in installing a memcached client on all the memcached servers, then extracting the statistical values using a script. We’ll use the technique described in one of my previous posts, to expose script results as SNMP OID values. Then, track these values in Cacti via the generic existing mechanism. My approach has the disadvantage of installing a memcached client on all the servers. However, it is very simple to build your own charts and data source templates, as for any generic SNMP data. All you need now a simple script which will print the memcached statistics, one per line. I will provide one-liners for Python, which will obviously work only on machines having Python and the “tummy” client installed. This is the recipe (default location of Python executable on Debian is /usr/bin/python but YMMV):

1. first use this one liner as snmpd exec :

/usr/bin/python -c “import memcache; print (‘%s’%[memcache.Client(['127.0.0.1:11211'], debug=0).get_stats()[0][1],]).replace(“‘”,”).replace(‘,’,'n’).replace(‘[','')
.replace(']‘,”).replace(‘{‘,”).replace(‘}’,”)”

This will display the name of the memcached statistic along with its value and will allow you to hand pick the OIDs that you want to track. Yes, I know it could be done simpler with translate instead of multiple replace. Left as an exercise for the Python-aware reader.

2. after having a complete list of OIDs use this one-liner:

/usr/bin/python -c “import memcache; print ‘##’.join(memcache.Client(['127.0.0.1:11211'], debug=0).get_stats()[0][1].values()).replace(‘##’,'n’)”

The memcached statistics will be displayed in the same order, but only their values not their names.

And this is the mandatory eye candy:



Written by Adrian

August 2nd, 2006 at 10:54 pm

Posted in Tools

Tagged with , , , , , , ,

Unicode in Python micro-recipe : from MySQL to webpage via Cheetah

leave a comment

Very easy:

  • start by adding the default-character-set=utf8 in your MySQL configuration file and restart the database server
  • apply this recipe from Activestate Python Cookbook (“guaranteed conversion to unicode or byte string”)
  • inside the Cheetah template, use the ReplaceNone filter:


#filter ReplaceNone
${myUnicodeString}
#end filter

in order to prevent escaping non-ASCII characters.

Now. That’s better.

Written by Adrian

April 14th, 2006 at 11:42 pm

Posted in Tools

Tagged with , , , ,

Monitor everything on your Linux servers – with SNMP and Cacti

6 comments

UPDATE: Did you knew there’s an official Cacti guide? Find it at Cacti 0.8 Beginner’s Guide. For more info about SNMP please don’t hesitate to take a look at Essential SNMP, Second Edition.

Two free open-source tools are running the show for network and server-activity monitoring. The oldest and quite popular among network and system administrators is Nagios. Nagios does not only do monitoring, but also event traps, escalation and notification. The younger challenger is called Cacti. Unlike Nagios, it’s written in a scripting language [PHP] so no compiling is necessary – it just runs out of the box1. Cacti’s problem is that – at its current version – is missing lots of real-time features such as monitoring and notification. All these features are scheduled to be integrated in future versions of the product, but as with any open-source roadmap nothing is guaranteed, Anyway, this article is focusing on Cacti integration because it’s what I am currently using.

Cacti is built upon an open-source graphing tool called MRTG and a communication protocol SNMP. SNMP is not exactly a developer’s cup of tea, being more of a network administrator’s tool2. However, a monitoring server comes extremely handy in performance measurement and tuning, especially for complex performance behavior which can only be benchmarked long-term : such as large caches impact on a web application, or performance of long-running operations.

But is that specific variable you need to monitor, available with SNMP out of the box ? There is a strong chance it is. SNMP being an extensible protocol, lots of organization have recorded their own MIBs and respective implementations. Basically, a MIB is a group of unique identifiers called OIDs. An OID is a sequence of numbers separated by dots, for instance ‘.1.3.6.1.4.1.2021.11′; each number has a special meaning in a standard object tree – this example, the meaning of ‘.1.3.6.1.4.1.2021.11′ is ‘.iso.org.dod.internet.private.enterprises.ucdavis.systemStats’. Even you can have your own MIB in the ‘.iso.org.dod.internet.private.enterprises’ tree, by applying on this page at IANA.

Most probably you don’t really need your own MIB, no matter how ‘exotic’ your monitoring is, because:

a) it’s already there, in the huge list of existing MIBs and implementations

and

b) you are not bound to the existing official MIBs, in fact you can create your own MIB as long as you replicate it in the snmp configuration on all the servers that you want to monitor.

To take a look at existing MIBs, free tools are available on the net, IMHO the best one being MibBrowser. This multiplatform [Java] MIB browser has a free version which should be more than enough for our basic task. The screen capture shown here depicts a “Get Subtree” operation on the ‘.1.3.6.1.4.1.2021.11′ MIB; the result is a list of single value MIBs, such for instance ‘.1.3.6.1.4.1.2021.11.11.0′ which has the alias ‘ssCpuIdle.0′ and value 97 [meaning that the CPU is 97% idle]. You can see the alias by loading the corresponding MIB file [select File/Load MIB then choose 'UCD-SNMP-MIB.txt' from the list of predefined MIBs].

From command line, in order to display existing MIB values, you can use snmpwalk:

snmpwalk -Os -c [community_name] -v 1 [hostname] .1.3.6.1.4.1.111111.1

3 and the result is:

.1.3.6.1.4.1.2021.11 OID (.iso.org.dod.internet.private.enterprises.ucdavis.systemStats)
snmpwalk -v 1 -c sncq localhost .1.3.6.1.4.1.2021.11
UCD-SNMP-MIB::ssIndex.0 = INTEGER: 1
UCD-SNMP-MIB::ssErrorName.0 = STRING: systemStats
UCD-SNMP-MIB::ssSwapIn.0 = INTEGER: 0
UCD-SNMP-MIB::ssSwapOut.0 = INTEGER: 0
UCD-SNMP-MIB::ssIOSent.0 = INTEGER: 4
UCD-SNMP-MIB::ssIOReceive.0 = INTEGER: 2
UCD-SNMP-MIB::ssSysInterrupts.0 = INTEGER: 4
UCD-SNMP-MIB::ssSysContext.0 = INTEGER: 1
UCD-SNMP-MIB::ssCpuUser.0 = INTEGER: 2
UCD-SNMP-MIB::ssCpuSystem.0 = INTEGER: 1
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 96
UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 17096084
UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 24079
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 6778580
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 599169454
UCD-SNMP-MIB::ssCpuRawKernel.0 = Counter32: 6778580
UCD-SNMP-MIB::ssIORawSent.0 = Counter32: 998257634
UCD-SNMP-MIB::ssIORawReceived.0 = Counter32: 799700984
UCD-SNMP-MIB::ssRawInterrupts.0 = Counter32: 711143737
UCD-SNMP-MIB::ssRawContexts.0 = Counter32: 1163331309
UCD-SNMP-MIB::ssRawSwapIn.0 = Counter32: 23015
UCD-SNMP-MIB::ssRawSwapOut.0 = Counter32: 13730

Each of this values has its own significance, like for instance ‘ssCpuIdle.0′ which announces that the CPU is 96% idle.
In order to retrieve just a single value of the list, use its alias as a parameter to the snmpget command, for instance

snmpget -Os -c [community_name] -v 1 [hostname] UCD-SNMP-MIB::ssCpuIdle.0

Sometimes, you want to monitor something which you do not seem to find in the list of MIBs. Say, for instance, the performance of a MySQL database that your’re pounding pretty hard with your webapp4. The easiest way of doing this is to pass through a script – snmp implementations can take the result of any script and expose it through the protocol, line by line.

Supposing you want to keep track of the values obtained with the following script:

#!/bin/sh
/usr/bin/mysqladmin -uroot status | /usr/bin/awk '{printf("%fn%dn%dn",$4/
10,$6/1000,$9)}'

The mysqladmin command and a bit of simple awk magic display the following three values, each on a separate line:

  • number of opened connections / 10
  • number of queries / 1000
  • number of slow queries

It is interesting to not that, while the first value is instantaneous gauge-like, the following two are incremental, growing and growing as long as new queries and new slow queries are recorded. Will keep this in mind for later, when we will track these values.

But for now, let’s see how these three values are exposed through snmp. The first step is to tell the SNMP daemon that the script has an associated MIB. This is done in the configuration file, usually located at /etc/snmp/snmp.d. The following line attaches the script [for example /home/user/myscript.sh] execution to a certain OID:

exec .1.3.6.1.4.1.111111.1 MySQLParameters /home/user/myscript.sh

the ‘.1.3.6.1.4.1.111111.1′ OID is a branch of ‘.1.3.6.1.4.1′ [meaning '.iso.org.dod.internet.private.enterprises']. We tried to make it look ‘legitimate’ but obviously you can use here any sequence you want to.

After restarting the daemon, let’s interrogate Mibbrowser for the freshly created OID, see the following image snmpwalk -Os -c [community_name] -v 1 [hostname] .1.3.6.1.4.1.111111.1 ; the result is:

enterprises.111111.1.1.1 = INTEGER: 1
enterprises.111111.1.2.1 = STRING: "MySQLParameters"
enterprises.111111.1.3.1 = STRING: "/etc/snmp/mysql_params.sh"
enterprises.111111.1.100.1 = INTEGER: 0
enterprises.111111.1.101.1 = STRING: "0.900000"
enterprises.111111.1.101.2 = STRING: "18551"
enterprises.111111.1.101.3 = STRING: "108"
enterprises.111111.1.102.1 = INTEGER: 0
enterprises.111111.1.103.1 = ""

Great ! Now we have the proof that it really works and our specific values extracted with a custom script are visible through SNMP. Let’s go back to Cacti and see how we can make some nice charts out of them5.

Cacti has this nice feature of defining ‘templates’ that you can reuse afterwards. My strategy is to define a data template for each one of the 3 parameters I want to chart, using the ‘Duplicate’ function applied to the ‘SNMP – Generic OID Template’.

On the duplicate datasource template, you have to change the datasource title, name to display in charts, data source type [use DERIVE for incremental counters and GAUGE for instantaneous values], specific OID and the snmp community. Do it for the three values.

Using the three new datasource templates, create a chart template for ‘MySQL Activity’. That’s a bit more complicated, but it boils down to the following procedure, repeated for each of the 3 data sources:

  • add a data source and associate a graph [I always use AREA for the first graph as a background and LINE3 for the other, but it's just a matter of taste]
  • associate labels with current or computed values: CURRENT, AVERAGE, MAX in this example

All the rest is really fine tuning – deciding for better colors, wether to use autoscale or fixed scale and so on. By now, your graph template should be ready to use.

Note that for the incremental values ['DERIVE' type data sources] I’ve used titles such as ‘Thousands queries/5 min’ – the 5 minutes come from the Cacti poller which is set to query for data each 5 minutes. The end result is something like this one :

On this real production chart you’ll see a few interesting patterns. For instance, at 3 o’clock in the morning, there is a huge spike in all the charted parameters – indeed, a cron’ed script was provoking this spike. From time to time, a small burst of slow queries is recorded – still under investigation. What is interesting here is that these spikes were previously undetectable on the load average chart, which look clean and innocuous:

To conclude, SNMP is a valuable resource for server performance monitoring. Often, investigating specific parameters and displaying them in tools such as Cacti can bring interesting insights upon the behavior of servers.

Some SNMP implementations in different programming languages:

  • Java: Westhawk’s Java SNMP stack [free w commercial support], AdventNet SNMP API [commercial, with a feature-restricted un-expiring free version], iREASONING SNMP API [commercial implementation], SNMP4J [free and feature-rich implementation - thank you Mathias for the tip]
  • PHP: client-only supported by the php-snmp extension, part of the PHP distribution [free]
  • Python: PySNMP is a Python SNMP framework, client+agents [free].
  • Ruby: client-only implementation Ruby SNMP [free]

1 If you’re running Debian, Cacti comes with apt so it’s a breeze to install and run [apt-get install cacti]

2 a bit out of the scope of this article, SNMP also allows writing values on remote servers, not only retrieving monitored values.

3 Replace [hostname] with the server hostname and [community_name] with the SNMP community – default being ‘public’. The SNMP community is a way of authenticating a client to a SNMP server; although the system can be used for pretty sophisticated stuff, most of the time the servers have a read-only passwordless community, visible only in the internal network for monitoring purposes.

4 In fact, a commercial implementation of SNMP for MySQL does exist.

5 The procedure described here applies to Cacti v0.8.6.c

Written by Adrian

March 5th, 2006 at 5:27 pm

Posted in Tools

Tagged with , , , , , , , ,

SEO eye for the Tapestry guy

leave a comment

One of my previous customers has a Jakarta Tapestry [3.0.x] based site. The site is subscription-based, but it also has a public area – if you browse each and every link you should be able to view few thousand of [dynamically generated] pages. No SEO* consulting was involved in building the site. To put it simply : I’ve got some specs and HTML templates: developed, deployed, bugfixed and hasta la vista…

More than 6 months later [!], the site is still alive, which is good, but it doesn’t really spot impressive traffic figures and growth. Basically, all the traffic it gathers seems to come from existing subscribers and paid ads, very low level of organic and almost zero traffic from major engines such as Google (although it was submitted to quite a lot of engines and directories).
Lo and behold, there must be something really nasty going on since a quick query on Google with site:www.javaguysite.com** gives only one freaking result : the home page. What means: Google has indexed ONLY the entry page – same thing happens with all the other major search indexes. And guess what : nobody is going to find the content if it isn’t even indexed by search engines.

Making friends with your public URLs

The problem : Tapestry URLs are too ugly for search engines. Looking at the source of my navigation menu, I found little beasts such as

http://www.javaguysite.com/b2b?service=direct/1/Home/$Border.$Header.$LoggedMenu.$CheckedMenu$1.$DirectLink&sp=SSearchOffers#a

For a Tapestry programmer it is simple direct link from inside a component embedded in other components, but for search engine bots it is an overly complex link to a dynamic page, which will NOT be followed. Thus, if you want these little buggers to crawl all over your site and index all the pages, make’em think it’s a simple static site such as :

http://www.javaguysite.com/b2b/bla/bla/bla/SearchOffers.html

In SEO consultants slang, it’s called “friendly URLs”***.

You don’t have to make all your links friendlier. For instance, no need to touch the pages available only to subscribers as they’ll never be available for public searching. In the public area, make friendly URLs only to access those pages containing relevant content.
The method is called URL rewriting. Rewriting means that the web server is transforming the request URL, behind the scenes, using regular expressions, in a totally transparent manner. Thus, the client browser or the bot “thinks” it reaches a certain URL, however a different address is sent to the servlet container. The rewriting is performed either by:

1. using a servlet filter such as urlrewrite.

or

2. with mod_rewrite in Apache. I do use Apache as a proxy server, in order to perform transparent and efficient gzip-ing on the fly as described in one of my previous blog posts. Now, I only had to add the mod_rewrite filter and I’m ready to go.

Only minor syntax differences exists between the regular expressions in the filter and the Apache module. I was able to seamlessly switch between the two, as I use the servlet filter in development environment and we did Apache proxying in production.

The devil is in the details

Now we’re sure that dynamic pages from the public area will be searchable after the Google bot crawls them. Problem is : all pages of a single category will have the same title. Like for instance “Company details” for all the pages containing … umm, company details. And when you have thousands of companies in the database, that makes helluva lot of pages with the same title ! Besides, keywords contained in the page title play an important role in the search position for the specific keyword. The conclusion: make your page titles as customised as possible: put not only the page type, but also relevant data from the page content – in our case, the company name and whynot the city where the business is located. This is easy with Tapestry:

<html jwcid="@Shell" title="ognl:page.pageTitle">

and then define a customized

public String getPageTitle();

in all the Page classes (with maybe an abstract getPageTitle in the base page class, supposing you have one defined in the project, which one normally should).

The same type of reasoning applies for page keywords and description metatags, as they are taken into account by most of the search engines. Use them, make them dynamic and insert content-relevant data. Don’t just rely on generic keywords as the competition is huge on these: a bit of keyword long tail can do wonders. Don’t overdo it and don’t try to trick Google as you may have some nasty surprises in the process. And if you can afford, have some SEO consulting for the keywords and titles content.

There’s another rather obsolete nevertheless important HTML tag : the H1. Who on Earth needs H1 when you got CSS and can name your headings according to a coherent naming schema ? Well, apparently Google needs H1 tags, reads H1 tags and uses the content of H1 tags to compute search relevancy. So make sure to redefine H1 in your CSS and use it inside page content. People seem to believe it has something to do with a sort of HTML semantics …

That’s it, at least from a basic technical point of view. For a longer discussion about SEO read Roger Johansson’s Basics of search engine optimisation as well as the massive amounts of freely available literature available on the web concerning this subject. Just … search for it.

*SEO = Search Engine Optimization.

**Names changed to protect the innocents.

***Supposedly, Tapestry 3.1, currently in alpha stage, has its URLs way friendlier than 3.0. However, don’t use an alpha API on a production site.

Written by Adrian

September 25th, 2005 at 8:21 pm

Posted in AndEverythingElse

Tagged with , , , ,

Aggregating webservers logs for an Apache cluster

one comment

One of the ways of scaling a heavy-traffic LAMP web application is to transform the server into a cluster of servers. Some may opt to walk on the easy path by using an overpriced appliance load balancer, but the most daring [and budget-restrained] will go for free software solutions such as pound or haproxy.

Although excellent performers, these free balancers have lots of missing features when compared with counterpart commercial solutions. One of the most embarrassing misses is the lack of flexibility in producing decent access logs. Both pound (LogLevel 4) and haproxy (option httplog) may generate Apache-like logs in their logfiles or the syslog, however none offers the level of customization encountered in Apache. Basically, you're left with using the logs from the cluster nodes. These logs present a couple of problems:

  • the originating IP is always the internal IP of the balancer
  • there is one log/node, while log analysis tools can usually cope with a single log file/report

First problem is relatively easy to solve. Start by activating the X-Forwarded-For header in the balancing software : for instance configuring haproxy with option forwardfor. A relatively unknown Apache module called mod_rpaf will solve the tedious task of extracting the remote IP from X-Forwarded-For header and copying it in the remote address field of Apache logs. For Debian Linux fans, it's nice to note that libapache-mod-rpaf is available via apt.

Now that you have N realistic Apache weblogs, 1 per cluster node, you just have to concatenate and put them in a form understandable by your log analysis tools. Just simply cat-ing them in a big file, won't cut it [arf] because new records will appear in different regions of the file instead of appending chronologically to its tail. The easiest solution in that case is to perform a sort on these logs. Although I am aware of the vague possibility of sorting on the Apache datetime field, even taking the locale into account, I confess my profound inability of finding the right combination of parameters. Instead, I choose to add a custom field in the Apache log; using the following log format:


LogFormat "%h %V %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Cookie}i" %c %T "%{%Y%m%d%H%M%S}t"" combined

where %{%Y%m%d%H%M%S}t is a standard projection of current datetime in an easily sortable integer, like for instance 20050925120000 – equivalent of 25 Sep 2005 12:00:00. Now, considering the quote as a separator in the Apache log format, is easy to sort upon this custom field [the 10th]:

sort -b -t """ -T /opt -k 10 /logpath/access?.log > /logpath/full.log

And there you are, having this nice huge log file to munch on. On a standard P4 with 1GB of RAM it takes less than a minute to obtain a 2GB log file…

In case the web traffic is really big and log analysis process impacts the existing web activity, use a separate machine instead of overloading one of the cluster nodes. For automated transfer of log files, generate ssh keys on all the cluster nodes for paswordless login from the web analytics server in the web logfiles owner account. Minimization of traffic between these machines is done by installing rsync on them and them using rsync via ssh:

rsync -e ssh -Cavz www-data@node1:/var/log/apache/access.log /logpath/access1.log

Now, you know all the steps required to fully automate the log aggregation and its processing. One may ask why all the fuss when in fact a simple subscription to a ASP style web analytics provider should suffice. Yes, it's true however… The cluster that I've recently configured with this procedure has a few million hits per week. Yes, we're talking about page hits. At this level of traffic, the cost for a web analytics service starts from 10.000$/year. It's certainly a nice amount of money, which will allow you to afford your own analytics tool [such as for instance Urchin v5] and keep some cash from the first year. Some might say that this kind of commercial tools have their own load balancer analysis techniques. Sure, but it all comes with a cost. In the case of Urchin, you just saved 695$/node and some bragging rights with your mates. Relax and enjoy.

PS: Yes we're talking millions of page hits LAMP solution not J2EE… Maybe I'll get into details on another occasion, assuming that somebody is interested. Leave a comment, send a mail or something.

Written by Adrian

September 25th, 2005 at 8:20 pm

Posted in Tools

Tagged with , , ,

JCS: the good, the bad and the undocumented

one comment

Java Caching System is one of the mainstream opensource and free Java caches*, along with OSCache, EHCache and JbossCache. Choosing JCS may be the subject of an article by itself, since this API has a vastly undeserved reputation of being a buggy, slow cache. Exactly this reputation has motivated the development of EHCache, which is in fact a fork of JCS. But, JCS has evolved a lot lately and is now a perfectly valid alternative for production; it still has a few occasional bugs, but nothing really bothersome. I've recently had this interesting experience of cleaning up and tuning a website powered by JCS. This dynamic Java-based site is exposed to a healthy traffic of 0.6-1.2 Mhits/day, with 12.000-25.000 unique visitors daily, and caching has greatly improved its performance. This article is a collection of tips and best practices not comprised (yet?) in the official JCS documentation.

Where to download JCS

This is usually the first question when one wants to use JCS. Since JCS is not a 'real' Jakarta project, but a module of the Turbine framework, there is no downloading link available on the main site. If you search on Google, this question has popped many times on different mail lists or blogs and it usually has two kinds of answers, both IMHO wrong:

  • download the source of Turbine and you'll find JCS in the dependencies. No, you won't, because Turbine is build with Maven, which is supposed to automagically download all the needed dependencies and bring them to you on a silver plate. Meaning: tons of useless jars hidden somehwere in the murky depths of wherever Maven thinks is a nice install location. Uhh.
  • build it from scratch. Another sadistic advice, given that JCS is also build with Maven. So you'll not only need to checkout the sources from CVS, but also install Maven. Then try to build JCS. And eventually give up. Like for instance in my case, I installed the monster^H^H^H^H^H^H wonderful build tool, then ran 'maven jar'. Instead of the expected result [you know, building the jar !] Maven performed a series of operations like running unit tests, washing teeth, cooking a turkey. Well, I suppose it was doing this, because I couldn't read the huge gobs of text running quickly on the screen. At the end, it miserably failed, with no logical explanations (too many explanations is the modern equivalent of unexplained). So I gave up. Again.

Fortunately, some kind souls at Jakarta (think of these developers as of a sort of secret congregation) provide clandestine latest 'demavenized' binary builds in obscure places; for JCS, the
location is here. I used the last 1.1 build without problems for a few weeks and I strongly recommend it.

Using the auxiliary disk cache

There's a common misconception that one doesn't need no stinkin' disk cache. Even on Hibernate site the example JCS configurations has the auxiliary disk cache commented out. Maybe this comes from the fact that JCS disk cache suffered from a memory leak (not true any more) or from the simplistic reasoning that disk access is inherently slower than memory access. Well it surely is, but at the same time it's probably much faster than some of the database queries, which could benefit from caching.

Also, it is interesting to note that incorrectly dimensioned 'memory' caches will make the Java process overflow from main memory to the swap disk. So you'll use the disk anyway, only in an un-optimized manner !

I wouldn't advise you to activate the auxiliary cache on disk without limiting its size, otherwise, the cache file would grow indefinitely. Controlling cache size is done by 2 parameters (MaxKeySize and OptimizeAtRemoveCount) example:

jcs.auxiliary.DC.attributes.MaxKeySize=2500
jcs.auxiliary.DC.attributes.OptimizeAtRemoveCount=2500

Only MaxKeySize is not enough, since it will only limit the number of keys pointing to values which are in disk cache. In fact, removing a value from the disk cache will only remove its key. But, the second (OptimizeAtRemoveCount) parameter will tell the cache to recreate a new file after a certain number of 'removes'. This new cache file will keep only the cached values corresponding to the remaining keys, thus cleaning all obsolete values, and of course will replace the old cache file. The size of disk cache and the remove count is of course subject of tuning in your own environment.

Tuning the memory shrinker

Although one of the JCS authors specify that the shrinker “is rarely necessary”, it might come handy especially in memory constrained environments or for really big caches. With one exception: be careful and specify the MaxSpoolPerRun parameter (undocumented yet, but discussed on the mailing list) otherwise the shrinking process might lead to spikes in CPU usage. I am using the shrinker like that:

jcs.default.cacheattributes.UseMemoryShrinker=true
jcs.default.cacheattributes.ShrinkerIntervalSeconds=3600
jcs.default.cacheattributes.MaxSpoolPerRun=300

YMMV.

Cache control via servlet

Again, undocumented, but people seem to know about it. The servlet class is org.apache.jcs.admin.servlet.JCSAdminServlet but do not expect it to work out of the box ! This servlet uses Velocity thus you'll need to :

  • initialize Velocity before trying to access the servlet (or lazy, but you'll have to modify the servlet source)
  • copy the templates into the Velocity template location. The templates (JCSAdminServletDefault.vm and JCSAdminServletRegionDetail.vm) are not (bug ? feature ?) in the jar, so you'll have to retrieve them from the CVS repository. For the moment, they are at this location.

These are my findings. I would have really appreciated to have these few pieces of info before starting the cache tuning. If anybody thinks this article is useful and/or needs to be completed, write a comment, send an email, wave hands. I'll try to come up with more details.

*For a complete list, see the corresponding section at Java-source.

Written by Adrian

February 17th, 2005 at 9:30 pm

Posted in Tools

Tagged with , , , ,

HTTP compression filter on servlets : good idea, wrong layer

3 comments

The Servlet 2.3 specifications introduced the notion of servlet filters, powerful tools but unfortunately used in quite unimaginative ways. Let’s take for instance this ONJava article (“Two Servlet Filters Every Web Application Should Have”) written by one of the coauthors to Servlets and JavaServer Pages; the J2EE Web Tier (a well-known servlets and JSP book from O’Reilly), Jayson Falkner*. This article has loads of trackbacks, it became so popular that the filters eventually got published on JavaPerformanceTuning along with an (otherwise very sensible and pragmatic) interview of the author. However, there is a more efficient way of performing these tasks, as undiscriminated page compression and simple time-based caching do not necessarily belong in the servlet container**. As one of the comments (on ONJava) put it : ‘good idea, wrong layer !’. Let’s see why…

There is a simple way to compress pages from any kind of site (be it Java, PHP, or Ruby on Rails), natively, in Apache web server. The trick consists in chaining two Apache modules : mod_proxy and mod_gzip.Via mod_proxy, it becomes possible to configure a certain path on one of your virtual hosts to proxy all requests to the servlet container, then you may selectively compress pages using mod_gzip.

Supposing that the two modules are compiled and loaded in the configuration, and your servlet is located at http://local_address:8080/b2b. You want to make it visible at http://external_address/b2b. To activate the proxy, add the following two lines :

ProxyPass /b2b/ http://local_address:8080/b2b/
ProxyPassReverse /b2b/ http://local_address:8080/b2b/

You can add as many directives as you like, proxy-ing all the servlets for the server (for instance, one of the configuration I’ve looked at has a special servlet for dynamic image generation and one for dynamic PDF documents generation – the output will not be compressed, but they all had to be proxy-ed). Time-based caching is also possible with mod_proxy, but this subject deserves a little article by itself. For the moment, we’ll stick to simple transparent proxying and compression.

Congratulations, just restart Apache and you have a running proxy. Mod_gzip is a little bit trickier. I’ve adapted a little bit the configuration from the article Getting mod_gzip to compress Zope pages proxied by Apache (haven’t been able to find anything better concerning integration with Java servlet containers) and here’s the result :

#module settings
mod_gzip_on Yes
mod_gzip_can_negotiate Yes
mod_gzip_send_vary Yes
mod_gzip_dechunk Yes
mod_gzip_add_header_count Yes
mod_gzip_minimum_file_size 512
mod_gzip_maximum_file_size	5000000
mod_gzip_maximum_inmem_size	100000
mod_gzip_temp_dir /tmp
mod_gzip_keep_workfiles No
mod_gzip_update_static No
mod_gzip_static_suffix .gz
#includes
mod_gzip_item_include mime ^text/*$
mod_gzip_item_include mime httpd/unix-directory
mod_gzip_item_include handler proxy-server
mod_gzip_item_include handler cgi-script
#excludes
mod_gzip_item_exclude reqheader  "User-agent: Mozilla/4.0[678]"
mod_gzip_item_exclude mime ^image/*$
#log settings
LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" mod_gzip: %{mod_gzip_result}n In:%{mod_gzip_input_size}n Out:%{mod_gzip_output_size}n:%{mod_gzip_compression_ratio}npct." mod_gzip_info
CustomLog /var/log/apache/mod_gzip.log mod_gzip_info

Short explanation. The module is activated and allowed to negotiate (see if a static or cached file was already compressed and reuse it). The Vary header is useful for client-side caches to work, dechunking eliminates the ‘Transfer-encoding: chunked’ HTTP header and joins the page into one big packet before compressing. Header length is added for traffic measuring purposes (we’ll see the ‘right’ figures in the log). Minimum size of a file to be compressed is 512 bytes, setting maximum is also a good idea because a) compressing a huge file will stump your server and b) the limitation guards against infinite loops. Maximum file size to compress in memory is 100KB in my setting, but you should tune this value for optimum performance. Temporary directory is /tmp and workfiles should be kept only if you need to debug mod_gzip. Which you don’t.

We’ll include in the files to be gzipped everything that’s text type, directory listing and … the magic line is the one that specifies that everything coming from the proxy-server is susceptible to be compressed: this will assure the compression of your generated pages. And while you’re at it, why not add the cgi scripts…

The includes specified here are quite generous, let’s now filter some of it: we’ll exclude all the images because they SHOULD be already compressed and optimized for web. And last but not least, we’ll decide the format of the line to be added and the location of the compression log – it will allow us to see whether the filter is effectively running and compute how much bandwidth we have saved.

A compelling reason to use mod_gzip is its maturity. Albeit complex, this Apache module is stable and relatively bug free, which can hardly be said about the various compression filters found on the web. The original code from the O’Reilly article was behaving incorrectly under certain circumstances (corrected later on the book’s site, I’ve tested the code and it works fine). I also had some issues with Amy Roh’s filter (from Sun). Amy’s compression filter can be found in a lot of places on the web (JavaWorld, Sun), but unfortunately does not set the correct ‘Content-Length’ header, thus disturbing httpunit, which in turn has ‘turned 100% red’ my web tests suite – as soon as the compression filter was on. Argh.

For the final word, let’s compare the performance of the two solutions (servlet filter agains mod_proxy+mod_gzip). I’ve used a single machine to install both Apache and the servlet container (Jetty), and Amy Roh’s compression filter. A mildly complex navigation scenario was recorded in TestMaker (a cool free testing tool written in Java), then played a certain number of times (100, to be more specific). The results are expressed in TPS (transactions per second): the bigger, the better. The following median values were obtained : 3.10TPS direct connection to the servlet container, 2.64TPS via the compression filter and 2.81TPS via Apache mod_proxy+mod_gzip. That means a 5% performance hit between the Apache and the filter solution. Of course the figure is highly dependent on my test setup, the specific webapp and a lot of other parameters, however I am confident that Apache is superior in any configuration. You also have to consider that using a proxy has some nice bonuses. For instance, Apache HTTPS virtual sites may encrypt your content in a transparent manner. Apache has very good and fast logging, so it’d be cool to completely disable HTTP requests logging in your servlet container. Moreover, the Apache log format is understood by a myriad of traffic analyzer tools. Load balancing is possible using mod_proxy and another remarkably useful Apache module, mod_rewrite. As Apache runs in a completely different process, you might expect slightly better scalability on multiple processor boxes.

Nota bene: in all the articles I’ve read on the subject of compression, there is this strange statement that compression cannot be detected client-side. Of course you can do it… Supposing you use Firefox (which you should, if you’re serious about web browsing !) with the Web Developer plugin (which you should, if you’re serious about web development !). As depicted in the figure, the plugin helps you to “View Response Headers” (in “Information” menu): the presence or absence of Content-Encoding: gzip is what you’re looking for. Voila ! Just for kicks, look at the response headers on a few well-known sites, and prepare to be surprized (try Microsoft, for instance or Slashdot for some funny random quotes).

* Jayson Falkner has also authored this article (“Another Java Servlet Filter Most Web Applications Should Have”) which explains how to control the client-side cache via HTTP response headers. While the example is very simple, one can easily extend it to do more complex stuff such as caching according to rules (for instance, caching dynamically generated documents or images according to the context). This _is_ a pragmatic example of servlet filter.

** Unless of course – as one of the commenters explains here – you have some specific constraints against being able to use Apache, such as : embedded environment, forced to use another web server than Apache (alternative solutions might exist for those servers but I am not aware of them), mod_gzip unavailable on the target platform, etc.

Written by Adrian

February 2nd, 2005 at 8:28 am

Posted in Tools

Tagged with , , , , , ,