Netuality

Taming the big bad websites

Archive for February, 2005

JCS: the good, the bad and the undocumented

one comment

Java Caching System is one of the mainstream opensource and free Java caches*, along with OSCache, EHCache and JbossCache. Choosing JCS may be the subject of an article by itself, since this API has a vastly undeserved reputation of being a buggy, slow cache. Exactly this reputation has motivated the development of EHCache, which is in fact a fork of JCS. But, JCS has evolved a lot lately and is now a perfectly valid alternative for production; it still has a few occasional bugs, but nothing really bothersome. I've recently had this interesting experience of cleaning up and tuning a website powered by JCS. This dynamic Java-based site is exposed to a healthy traffic of 0.6-1.2 Mhits/day, with 12.000-25.000 unique visitors daily, and caching has greatly improved its performance. This article is a collection of tips and best practices not comprised (yet?) in the official JCS documentation.

Where to download JCS

This is usually the first question when one wants to use JCS. Since JCS is not a 'real' Jakarta project, but a module of the Turbine framework, there is no downloading link available on the main site. If you search on Google, this question has popped many times on different mail lists or blogs and it usually has two kinds of answers, both IMHO wrong:

  • download the source of Turbine and you'll find JCS in the dependencies. No, you won't, because Turbine is build with Maven, which is supposed to automagically download all the needed dependencies and bring them to you on a silver plate. Meaning: tons of useless jars hidden somehwere in the murky depths of wherever Maven thinks is a nice install location. Uhh.
  • build it from scratch. Another sadistic advice, given that JCS is also build with Maven. So you'll not only need to checkout the sources from CVS, but also install Maven. Then try to build JCS. And eventually give up. Like for instance in my case, I installed the monster^H^H^H^H^H^H wonderful build tool, then ran 'maven jar'. Instead of the expected result [you know, building the jar !] Maven performed a series of operations like running unit tests, washing teeth, cooking a turkey. Well, I suppose it was doing this, because I couldn't read the huge gobs of text running quickly on the screen. At the end, it miserably failed, with no logical explanations (too many explanations is the modern equivalent of unexplained). So I gave up. Again.

Fortunately, some kind souls at Jakarta (think of these developers as of a sort of secret congregation) provide clandestine latest 'demavenized' binary builds in obscure places; for JCS, the
location is here. I used the last 1.1 build without problems for a few weeks and I strongly recommend it.

Using the auxiliary disk cache

There's a common misconception that one doesn't need no stinkin' disk cache. Even on Hibernate site the example JCS configurations has the auxiliary disk cache commented out. Maybe this comes from the fact that JCS disk cache suffered from a memory leak (not true any more) or from the simplistic reasoning that disk access is inherently slower than memory access. Well it surely is, but at the same time it's probably much faster than some of the database queries, which could benefit from caching.

Also, it is interesting to note that incorrectly dimensioned 'memory' caches will make the Java process overflow from main memory to the swap disk. So you'll use the disk anyway, only in an un-optimized manner !

I wouldn't advise you to activate the auxiliary cache on disk without limiting its size, otherwise, the cache file would grow indefinitely. Controlling cache size is done by 2 parameters (MaxKeySize and OptimizeAtRemoveCount) example:

jcs.auxiliary.DC.attributes.MaxKeySize=2500
jcs.auxiliary.DC.attributes.OptimizeAtRemoveCount=2500

Only MaxKeySize is not enough, since it will only limit the number of keys pointing to values which are in disk cache. In fact, removing a value from the disk cache will only remove its key. But, the second (OptimizeAtRemoveCount) parameter will tell the cache to recreate a new file after a certain number of 'removes'. This new cache file will keep only the cached values corresponding to the remaining keys, thus cleaning all obsolete values, and of course will replace the old cache file. The size of disk cache and the remove count is of course subject of tuning in your own environment.

Tuning the memory shrinker

Although one of the JCS authors specify that the shrinker “is rarely necessary”, it might come handy especially in memory constrained environments or for really big caches. With one exception: be careful and specify the MaxSpoolPerRun parameter (undocumented yet, but discussed on the mailing list) otherwise the shrinking process might lead to spikes in CPU usage. I am using the shrinker like that:

jcs.default.cacheattributes.UseMemoryShrinker=true
jcs.default.cacheattributes.ShrinkerIntervalSeconds=3600
jcs.default.cacheattributes.MaxSpoolPerRun=300

YMMV.

Cache control via servlet

Again, undocumented, but people seem to know about it. The servlet class is org.apache.jcs.admin.servlet.JCSAdminServlet but do not expect it to work out of the box ! This servlet uses Velocity thus you'll need to :

  • initialize Velocity before trying to access the servlet (or lazy, but you'll have to modify the servlet source)
  • copy the templates into the Velocity template location. The templates (JCSAdminServletDefault.vm and JCSAdminServletRegionDetail.vm) are not (bug ? feature ?) in the jar, so you'll have to retrieve them from the CVS repository. For the moment, they are at this location.

These are my findings. I would have really appreciated to have these few pieces of info before starting the cache tuning. If anybody thinks this article is useful and/or needs to be completed, write a comment, send an email, wave hands. I'll try to come up with more details.

*For a complete list, see the corresponding section at Java-source.

Written by Adrian

February 17th, 2005 at 9:30 pm

Posted in Tools

Tagged with , , , ,

HTTP compression filter on servlets : good idea, wrong layer

3 comments

The Servlet 2.3 specifications introduced the notion of servlet filters, powerful tools but unfortunately used in quite unimaginative ways. Let’s take for instance this ONJava article (“Two Servlet Filters Every Web Application Should Have”) written by one of the coauthors to Servlets and JavaServer Pages; the J2EE Web Tier (a well-known servlets and JSP book from O’Reilly), Jayson Falkner*. This article has loads of trackbacks, it became so popular that the filters eventually got published on JavaPerformanceTuning along with an (otherwise very sensible and pragmatic) interview of the author. However, there is a more efficient way of performing these tasks, as undiscriminated page compression and simple time-based caching do not necessarily belong in the servlet container**. As one of the comments (on ONJava) put it : ‘good idea, wrong layer !’. Let’s see why…

There is a simple way to compress pages from any kind of site (be it Java, PHP, or Ruby on Rails), natively, in Apache web server. The trick consists in chaining two Apache modules : mod_proxy and mod_gzip.Via mod_proxy, it becomes possible to configure a certain path on one of your virtual hosts to proxy all requests to the servlet container, then you may selectively compress pages using mod_gzip.

Supposing that the two modules are compiled and loaded in the configuration, and your servlet is located at http://local_address:8080/b2b. You want to make it visible at http://external_address/b2b. To activate the proxy, add the following two lines :

ProxyPass /b2b/ http://local_address:8080/b2b/
ProxyPassReverse /b2b/ http://local_address:8080/b2b/

You can add as many directives as you like, proxy-ing all the servlets for the server (for instance, one of the configuration I’ve looked at has a special servlet for dynamic image generation and one for dynamic PDF documents generation – the output will not be compressed, but they all had to be proxy-ed). Time-based caching is also possible with mod_proxy, but this subject deserves a little article by itself. For the moment, we’ll stick to simple transparent proxying and compression.

Congratulations, just restart Apache and you have a running proxy. Mod_gzip is a little bit trickier. I’ve adapted a little bit the configuration from the article Getting mod_gzip to compress Zope pages proxied by Apache (haven’t been able to find anything better concerning integration with Java servlet containers) and here’s the result :

#module settings
mod_gzip_on Yes
mod_gzip_can_negotiate Yes
mod_gzip_send_vary Yes
mod_gzip_dechunk Yes
mod_gzip_add_header_count Yes
mod_gzip_minimum_file_size 512
mod_gzip_maximum_file_size	5000000
mod_gzip_maximum_inmem_size	100000
mod_gzip_temp_dir /tmp
mod_gzip_keep_workfiles No
mod_gzip_update_static No
mod_gzip_static_suffix .gz
#includes
mod_gzip_item_include mime ^text/*$
mod_gzip_item_include mime httpd/unix-directory
mod_gzip_item_include handler proxy-server
mod_gzip_item_include handler cgi-script
#excludes
mod_gzip_item_exclude reqheader  "User-agent: Mozilla/4.0[678]"
mod_gzip_item_exclude mime ^image/*$
#log settings
LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" mod_gzip: %{mod_gzip_result}n In:%{mod_gzip_input_size}n Out:%{mod_gzip_output_size}n:%{mod_gzip_compression_ratio}npct." mod_gzip_info
CustomLog /var/log/apache/mod_gzip.log mod_gzip_info

Short explanation. The module is activated and allowed to negotiate (see if a static or cached file was already compressed and reuse it). The Vary header is useful for client-side caches to work, dechunking eliminates the ‘Transfer-encoding: chunked’ HTTP header and joins the page into one big packet before compressing. Header length is added for traffic measuring purposes (we’ll see the ‘right’ figures in the log). Minimum size of a file to be compressed is 512 bytes, setting maximum is also a good idea because a) compressing a huge file will stump your server and b) the limitation guards against infinite loops. Maximum file size to compress in memory is 100KB in my setting, but you should tune this value for optimum performance. Temporary directory is /tmp and workfiles should be kept only if you need to debug mod_gzip. Which you don’t.

We’ll include in the files to be gzipped everything that’s text type, directory listing and … the magic line is the one that specifies that everything coming from the proxy-server is susceptible to be compressed: this will assure the compression of your generated pages. And while you’re at it, why not add the cgi scripts…

The includes specified here are quite generous, let’s now filter some of it: we’ll exclude all the images because they SHOULD be already compressed and optimized for web. And last but not least, we’ll decide the format of the line to be added and the location of the compression log – it will allow us to see whether the filter is effectively running and compute how much bandwidth we have saved.

A compelling reason to use mod_gzip is its maturity. Albeit complex, this Apache module is stable and relatively bug free, which can hardly be said about the various compression filters found on the web. The original code from the O’Reilly article was behaving incorrectly under certain circumstances (corrected later on the book’s site, I’ve tested the code and it works fine). I also had some issues with Amy Roh’s filter (from Sun). Amy’s compression filter can be found in a lot of places on the web (JavaWorld, Sun), but unfortunately does not set the correct ‘Content-Length’ header, thus disturbing httpunit, which in turn has ‘turned 100% red’ my web tests suite – as soon as the compression filter was on. Argh.

For the final word, let’s compare the performance of the two solutions (servlet filter agains mod_proxy+mod_gzip). I’ve used a single machine to install both Apache and the servlet container (Jetty), and Amy Roh’s compression filter. A mildly complex navigation scenario was recorded in TestMaker (a cool free testing tool written in Java), then played a certain number of times (100, to be more specific). The results are expressed in TPS (transactions per second): the bigger, the better. The following median values were obtained : 3.10TPS direct connection to the servlet container, 2.64TPS via the compression filter and 2.81TPS via Apache mod_proxy+mod_gzip. That means a 5% performance hit between the Apache and the filter solution. Of course the figure is highly dependent on my test setup, the specific webapp and a lot of other parameters, however I am confident that Apache is superior in any configuration. You also have to consider that using a proxy has some nice bonuses. For instance, Apache HTTPS virtual sites may encrypt your content in a transparent manner. Apache has very good and fast logging, so it’d be cool to completely disable HTTP requests logging in your servlet container. Moreover, the Apache log format is understood by a myriad of traffic analyzer tools. Load balancing is possible using mod_proxy and another remarkably useful Apache module, mod_rewrite. As Apache runs in a completely different process, you might expect slightly better scalability on multiple processor boxes.

Nota bene: in all the articles I’ve read on the subject of compression, there is this strange statement that compression cannot be detected client-side. Of course you can do it… Supposing you use Firefox (which you should, if you’re serious about web browsing !) with the Web Developer plugin (which you should, if you’re serious about web development !). As depicted in the figure, the plugin helps you to “View Response Headers” (in “Information” menu): the presence or absence of Content-Encoding: gzip is what you’re looking for. Voila ! Just for kicks, look at the response headers on a few well-known sites, and prepare to be surprized (try Microsoft, for instance or Slashdot for some funny random quotes).

* Jayson Falkner has also authored this article (“Another Java Servlet Filter Most Web Applications Should Have”) which explains how to control the client-side cache via HTTP response headers. While the example is very simple, one can easily extend it to do more complex stuff such as caching according to rules (for instance, caching dynamically generated documents or images according to the context). This _is_ a pragmatic example of servlet filter.

** Unless of course – as one of the commenters explains here – you have some specific constraints against being able to use Apache, such as : embedded environment, forced to use another web server than Apache (alternative solutions might exist for those servers but I am not aware of them), mod_gzip unavailable on the target platform, etc.

Written by Adrian

February 2nd, 2005 at 8:28 am

Posted in Tools

Tagged with , , , , , ,

Using HTTPUnit ‘out of the box’

leave a comment

Recently, HTTPUnit project reached version 1.6. While this nifty API is mainly targeted at unit testing webapps, I have also succesfully used it for other purposes such as :

HTTPUnit as a benchmarking tool

There is a plethora of web benchmarking tools out there, both freeware and commercial. However, my customer requested some features for testing, that I've had troubles satisfying simultaneously with the existing tools:

  • the tests must run on headless systems (console mode, non GUI)
  • load testing should simulate complex and realistic user interactions with the site

AFAIK, all the testing tools that allow recording and replaying of intricate web interaction scenarios are GUI-based. And then, command-line tools are also unfit for the job, take for instance Apache Jmeter which is basically a command-line tool with a Swing GUI slapped on it. While Jmeter is great when it comes to bombing a server (or a cluster, for that matter) with requests, it seriously lacks features when it comes to scripting complex interaction (you'd better know your URL's and cookies by heart … well, almost).

Another problem I see with existing automated testing solutions is with their error detection mechanisms. While the vast majority of tools are scanning for standard HTTP error codes such as 404 or 500 in order to find out if the response is erroneous or not, errors in complex Java apps might come as plain web pages containing strack traces and environment information (a good example is the error page in Apache Tapestry).

So eventually I had to come up with an ad-hoc solution – basic idea was to leverage the existing HTTP unit tests for benchmarking purposes. I had to get out of the toolbox another rather under-rated open-source gem: JUnitPerf, in fact a couple of decorators for Junit tests. LoadTest is the decorator I'm interested in : it allows running a test in multithreaded mode, simulating any number of concurrent users. Thus, I am able to reproduce heavy loads of complex user interaction and precisely count the number of errors. The snippet of code is something like:

SimpleWebTestPerf swtp = new SimpleWebTestPerf("testOfMyWebapp");
Test loadTest = new LoadTest(swtp, nbConcurrentThreads);
TestResult tr = TestRunner.run(loadTest);
int nbErr = tr.errorCount();

Now, we'll call this code with increasing values of nbConcurrentThreads and see where the errors start to appear. Might as well write the results in a log file and even create a nice PNG via JFreeChart. Alas, things become a little trickier when we want to measure the bandwidth; in our case we'll have to write something very lame in the TestCase; and it goes like that:

private long bytes = 0;

private synchronized void addBytes(long b)
{
	bytes += b;
}

/**
 * After a test was run, returns the volume of HTTP data.
 * @return
 */
public long getBytes()
{
	return bytes;
}

public void testProdsPerformance() throws MalformedURLException,
  IOException, SAXException
{
	[...]
	WebConversation wc = new WebConversation();
	WebResponse resp = wc.getResponse(something);
	addBytes(resp.getContentLength());
	[...]
}

Then, in the benchmarking code, we'll do swtp.getBytes() in order to find out how many bytes passed between the server and the test client. It is still unclear for me if this value is correct if mod_gzip is activated on the server (we might actually measure the bandwidth of the 'deflated' pages !?).

In order to measure the elapsed time, we'll do a similar (lame) trick with a private long time member and a private synchronized void addTime(long millis). Unfortunately, we do not [yet?] have a getElapsedTime() for the WebResponse, so we'll have to use the good old System.currentTimeMillis() before and after extraction of each WebResponse. Of course, this is also measuring the parsing time of WebResponse, but this isn't usually a problem when you are testing a large number of concurrent users, as the parsing time is much smaller when compared with the response time of a stressed server. But, you'll need a strong system for the client-side tests.

Another tip I've found: use Random in order to test different slices of data on different test runs. This way, when you run, let's say, the 20 threads test, you'll kick different data compared to the previous test, on 10 threads. In this manner, the results will be less influenced by the tested application cache(s). It's perfectly possible to launch LoadTest threads with delays between thread activation, which means that the Random seed could be different within each simulated client – if you're looking for even more realistic behavior.

HTTPUnit as a Web integration API

Besides being a great testing tool, Httpunit is also a cool API for Web manipulation, you can use it to perform data integration with all sorts of websites. For instance, let's log on the public demo instance of MantisBT bug tracking system, as user 'developer', and extract the descriptions of the first three bugs in the list.

package webtests;

import java.io.IOException;
import java.net.MalformedURLException;

import org.xml.sax.SAXException;

import com.meterware.httpunit.WebConversation;
import com.meterware.httpunit.WebForm;
import com.meterware.httpunit.WebResponse;
import com.meterware.httpunit.WebTable;

/**
 * Simple demo class using httpunit to extract the description of
 * three most recent bugs from the MantisBT public demo,
 * logged as 'developer'.
 * @author Adrian Spinei [email protected]
 * @version $Id: $
 */
public class MantisTest
{

	public static void main(String[] args) throws MalformedURLException, IOException, SAXException
	{
		WebConversation wc = new WebConversation();
		WebResponse resp = wc.getResponse("http://mantisbt.sourceforge.net/mantis/login_page.php");
		WebForm wForm = resp.getFormWithName("login_form");
		wForm.setParameter("username", "developer");
		wForm.setParameter("password", "developer");
		//submit login, conect to front page
		resp = wForm.submit();
		//'click' on the 'View Bugs' link
		resp = resp.getLinkWith("View Bugs").click();
		//retrieve the table containing the bug list
		//you'll have to believe me on this one, I've counted the tables !
		WebTable webTable = resp.getTables()[3];
		//first three rows are : navigation and header, then a blank formatting row

		//interesting data starts from the 4th column

		System.out.println(webTable.getCellAsText(3, webTable.getColumnCount() - 1));
		System.out.println(webTable.getCellAsText(4, webTable.getColumnCount() - 1));
		System.out.println(webTable.getCellAsText(5, webTable.getColumnCount() - 1));
	}
}

The code speaks for itself: HTTPUnit is beautiful, intuitive and easy to use.

Other interesting HTTPUnit-related articles:

Written by Adrian

February 2nd, 2005 at 8:21 am

Posted in Tools

Tagged with , , , ,

Sybase woes – and Jython saves the day

leave a comment

Until now, I've always had a certain respect for Sybase database.
Based on their history, I thought that the product is a sort of MS SQL
without the glitzy features – think Las Vegas without the lights, the cowboy boots and Eiffel Tower. Which gives: huge crowds of fat tourists in tall, dull buildings. But you always have Celine Dion, right ?*

Wrong ! Sybase ASE is a large, enormous, huge piece of steaming …
ummm … code. Ok, ok, I'm a bit over-reacting. In fact, Sybase is a
very good database – if you are still living in the 90s and the only
Linux flavour that you are able to manage is RedHat ASE. Otherwise,
it's a huge … you know.

Let's not talk about the extreme fragility of this … this product.
You never know when it crashes on you without any specific reason.
Sometimes, it's the bad weather. Crash. Restart. Sometimes, you
flushed twice. Crash. Restart. And maybe, yes maybe, you spent more
than 12 minutes in your lunch break. Crash.

Let's just talk about the JDBC drivers – latest version, downloaded
from the site in a temporary moment of insanity. Man, this is cutting
edge. The sharpest cutting edge you'll ever find – the more advanced
JDBC drivers bar none ! Excepting of course the fact that these
classes were compiled in that memorable day of 6 January 2002 [the day
when the last Sybase JDBC drivers were compiled]. But, don't let such
obscure details ruining your enthusiasm. Just download and use them
and you'll be amazed at their unique features – it's the only JDBC
driver which manages to put down DbVisualizer in different and
innovative ways. I'm restarting the poor thing (dbvis) at least 2 or 3
times per day, when working with Sybase.

Also, as a little quasi-undocumented tip, the letter d from
jconn2d.jar does not mean development [drivers] as some of you would be
inclined to think. In fact, it means debug which is the
abbreviation for put me in production poor lousy bastard and I'll
start spewing reams of useless messages through all the logging APIs
known by man and a few yet to be discovered, therefore instantly
slowing down to a crawl your puny little software
.

Ermmm … well, all this nice introduction just to humbly confess that
we do have a couple of them Sybase licences floating around here and
some of our production databases are on Sybase ASE. While I can assure
you that this is going to change at least on the web backend, it's
also true that the beast must be alive in order to keep our company up
and running. And, that's part of my job.

Which is of course very strange for an IT management job. But then again:
when your sole DBA is overwhelmed by a horde of Sybase-specific tasks
(like for instance the log configured to truncate at a specific
threshold which naturally [for Sybase] does not truncate, suddenly
throwing all the users into log suspend in the middle of the peak
production time) – you have to enter into the damn kitchen and do some
cooking !

My endeavour was to perform simple data mining in order to migrate
some reports from a legacy system [no, you don't really want to know
what system]. Due to my limited time, starting a mildly complex Java
reporting project was out of the question, so a scripting language was
the natural choice. Python being the first option – unfortunately,
finding a working Sybase driver for Python is a challenge in itself.
But, thank God for Jython and zxJDBC ! In just a few minutes I was
wiring the tables for reporting. Here's a fake code snippet which lists the 'orders' with amounts < 1000, and you can't go any simpler:

from com.ziclix.python.sql import zxJDBC 

conn = zxJDBC.connect("jdbc:sybase:Tds:myserver.mydomain.com:4110/mydb",
	"user","pass","com.sybase.jdbc2.jdbc.SybDriver")
cursor = conn.cursor() 

cursor.execute("select count(orderid) from orders")
nb_orders=cursor.fetchone()[0]
print '%.0f total orders ...'%(nb_orders,)

cursor.execute("select orderid, amount from orders where amount <1000")
oids=cursor.fetchall()
for o in oids:
	print 'Order %.0f for amount %.2f'%(o[0],o[1])

cursor.close()
conn.close()

Easy as pie. And once you got one down, you got'em all. For instance,
yesterday evening I wrote in about an hour a small Jython script which
exports some data. The same export process (running about an order of
magnitude slower) needed previously a couple days of development in
FoxPro. Ah, FoxPro - very juicy subject, but I'll keep it for my next
horror story. Until then, don't forget, when you have a monster to
tame and no time at all: try Jython !

*Let's suppose for a very brief instant that having Celine Dion at a certain location is a positive thing.

Written by Adrian

February 2nd, 2005 at 8:19 am

Posted in Tools

Tagged with , , , ,