Netuality

Taming the big, bad, nasty websites

Archive for the ‘j2ee’ tag

Java Persistence with Hibernate – the book, my review

leave a comment

You have to know that I’ve tried. Honestly, I did. I hoped to be able to read each and every page of “Java persistence with Hibernate” (revised edition of “Hibernate in action”), by Christian Bauer and Gavin King. But, I gave up before reading a third of it, then I continued only reading some sections. First, because I know Hibernate, I’ve used Hibernate in all the Java projects I’ve been involved with – in the last 5 years or so. Second, because the content from the first edition is more than familiar to me. And third, because this second edition is a massive > 800 pages book (double the number of pages in the first edition). At that rate, the fourth edition will be sold together with some sort of transportation device, because a mere human will not be able to carry that amount of paper. How did this happened ?

Hibernate is the perfect example of a successful Java open-source project. Initially started as a free alternative to commercial object-relational mapping tools, it quickly became mainstream. Lots of Java developers around the world use Hibernate for the data layer inside their projects. It’s very comfortable, just set some attributes or ask for a business object instance and Hibernate does all the ugly SQL for you. As a developer, you are then comfortably protected from that nasty relational database, and gently swim in a sea of nicely bound objects. Right ? No, not exactly. Each object-relationship mapping tool has its own ways of being handled efficiently, and this is where books like “Java persistence with Hibernate” come into play. This book teaches you how to work with Hibernate, with a “real-world” example: the Caveat-Emptor online auction application.

Since the first edition of the book was written, lots of things happened in the Hibernate world and you can see their impact in “Java persistence with Hibernate”. Most important is the release of the 3.x version line and its different ameliorations and new features: code annotations used as mapping descriptors, package naming reorganization inside the API, but most important the standardization under the umbrella of JPA (Java Persistence API) for a smooth integration with EJB 3 inside Java EE 5 servers. And this, is a little bit funny. Yesterday, Hibernate was the main proof that it is possible to make industrial-quality projects within a “J2EE-less” environment, today Hibernate has put a suit and a tie, joined the ranks of Jboss, then Redhat, and it lures the unsuspecting Java developers towards the wonderful and (sometimes) expensive world of Java EE 5 application servers. Which is not necessarily a bad move for the Hibernate API, but it’s a proof that in order to thrive as an open-source project, you need so much more than the Sourceforge account and some passion …

Enough with that, let’s take a look at the book content. Some 75% if it is in fact the content of the first edition, updated and completed. You learn what object-relational mapping is, the advantages, the quirks, the recommended way of developing with Hibernate. For a better understanding, single chapters from the initial book were expanded into 2, sometimes more, chapters. The “unit of work” is now called “a conversation” and you’ve got a whole new chapter (11) about conversations, which is in fact pretty good stuff about session and transaction management. Christian and Gavin done some great writing about concurrency and isolation in the relatively small 10-th chapter – which is a must read even if you’re not interested in Hibernate, but you want to understand once and for all what are these concurrent transaction behaviors everyone is talking about. The entire 13th chapter is dedicated to fetching strategy and caching, which is a must read if you want performance and optimization from your application. There is also a good deal of EJB, JPA and EE 5 – related stuff scattered in multiple chapters. And finally, a solid 50-pages chapter is pimping the JSF (Java Server Faces) compliant web development framework, Jboss Seam. I have only managed to read a few pages of this final chapter, so cannot really comment. Note to self: play a little bit with that Seam thing.

To conclude, is this a fun book ? No. Is this a perfect book to convert young open-source fanatics to the wonders of Hibernate API ? Nope. Is this a book to read cover to cover during a weekend ? Not even close. Then, what is this ? First, it’s the best book out there about Hibernate (and there are quite a few on the market right now), maybe even the best book about ORM in Java, in general. It has lots of references to EJB, JPA and EE, it will help you to easily sell a Hibernate project to the management. Even if the final implementation uses Spring … And finally, it’s the best Hibernate reference money can buy. When you have an issue, open the darn index and search, there are 90% chances your problem will be solved. And that’s a nice accomplishment. Don’t get this book because it’s funny, because it’s a nice read, about a new innovative open-source project. Buy it because it helps you grok ORM, write better code, deliver quality projects.

Written by Adrian

December 17th, 2006 at 2:00 pm

Posted in Books

Tagged with , , , ,

Aggregating webservers logs for an Apache cluster

one comment

One of the ways of scaling a heavy-traffic LAMP web application is to transform the server into a cluster of servers. Some may opt to walk on the easy path by using an overpriced appliance load balancer, but the most daring [and budget-restrained] will go for free software solutions such as pound or haproxy.

Although excellent performers, these free balancers have lots of missing features when compared with counterpart commercial solutions. One of the most embarrassing misses is the lack of flexibility in producing decent access logs. Both pound (LogLevel 4) and haproxy (option httplog) may generate Apache-like logs in their logfiles or the syslog, however none offers the level of customization encountered in Apache. Basically, you're left with using the logs from the cluster nodes. These logs present a couple of problems:

  • the originating IP is always the internal IP of the balancer
  • there is one log/node, while log analysis tools can usually cope with a single log file/report

First problem is relatively easy to solve. Start by activating the X-Forwarded-For header in the balancing software : for instance configuring haproxy with option forwardfor. A relatively unknown Apache module called mod_rpaf will solve the tedious task of extracting the remote IP from X-Forwarded-For header and copying it in the remote address field of Apache logs. For Debian Linux fans, it's nice to note that libapache-mod-rpaf is available via apt.

Now that you have N realistic Apache weblogs, 1 per cluster node, you just have to concatenate and put them in a form understandable by your log analysis tools. Just simply cat-ing them in a big file, won't cut it [arf] because new records will appear in different regions of the file instead of appending chronologically to its tail. The easiest solution in that case is to perform a sort on these logs. Although I am aware of the vague possibility of sorting on the Apache datetime field, even taking the locale into account, I confess my profound inability of finding the right combination of parameters. Instead, I choose to add a custom field in the Apache log; using the following log format:


LogFormat "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\" %c %T \"%{%Y%m%d%H%M%S}t\"" combined

where %{%Y%m%d%H%M%S}t is a standard projection of current datetime in an easily sortable integer, like for instance 20050925120000 – equivalent of 25 Sep 2005 12:00:00. Now, considering the quote as a separator in the Apache log format, is easy to sort upon this custom field [the 10th]:

sort -b -t "\"" -T /opt -k 10 /logpath/access?.log > /logpath/full.log

And there you are, having this nice huge log file to munch on. On a standard P4 with 1GB of RAM it takes less than a minute to obtain a 2GB log file…

In case the web traffic is really big and log analysis process impacts the existing web activity, use a separate machine instead of overloading one of the cluster nodes. For automated transfer of log files, generate ssh keys on all the cluster nodes for paswordless login from the web analytics server in the web logfiles owner account. Minimization of traffic between these machines is done by installing rsync on them and them using rsync via ssh:

rsync -e ssh -Cavz www-data@node1:/var/log/apache/access.log /logpath/access1.log

Now, you know all the steps required to fully automate the log aggregation and its processing. One may ask why all the fuss when in fact a simple subscription to a ASP style web analytics provider should suffice. Yes, it's true however… The cluster that I've recently configured with this procedure has a few million hits per week. Yes, we're talking about page hits. At this level of traffic, the cost for a web analytics service starts from 10.000$/year. It's certainly a nice amount of money, which will allow you to afford your own analytics tool [such as for instance Urchin v5] and keep some cash from the first year. Some might say that this kind of commercial tools have their own load balancer analysis techniques. Sure, but it all comes with a cost. In the case of Urchin, you just saved 695$/node and some bragging rights with your mates. Relax and enjoy.

PS: Yes we're talking millions of page hits LAMP solution not J2EE… Maybe I'll get into details on another occasion, assuming that somebody is interested. Leave a comment, send a mail or something.

Written by Adrian

September 25th, 2005 at 8:20 pm

Posted in Tools

Tagged with , , ,

HTTP compression filter on servlets : good idea, wrong layer

3 comments

The Servlet 2.3 specifications introduced the notion of servlet filters, powerful tools but unfortunately used in quite unimaginative ways. Let’s take for instance this ONJava article (“Two Servlet Filters Every Web Application Should Have”) written by one of the coauthors to Servlets and JavaServer Pages; the J2EE Web Tier (a well-known servlets and JSP book from O’Reilly), Jayson Falkner*. This article has loads of trackbacks, it became so popular that the filters eventually got published on JavaPerformanceTuning along with an (otherwise very sensible and pragmatic) interview of the author. However, there is a more efficient way of performing these tasks, as undiscriminated page compression and simple time-based caching do not necessarily belong in the servlet container**. As one of the comments (on ONJava) put it : ‘good idea, wrong layer !’. Let’s see why…

There is a simple way to compress pages from any kind of site (be it Java, PHP, or Ruby on Rails), natively, in Apache web server. The trick consists in chaining two Apache modules : mod_proxy and mod_gzip.Via mod_proxy, it becomes possible to configure a certain path on one of your virtual hosts to proxy all requests to the servlet container, then you may selectively compress pages using mod_gzip.

Supposing that the two modules are compiled and loaded in the configuration, and your servlet is located at http://local_address:8080/b2b. You want to make it visible at http://external_address/b2b. To activate the proxy, add the following two lines :

ProxyPass /b2b/ http://local_address:8080/b2b/
ProxyPassReverse /b2b/ http://local_address:8080/b2b/

You can add as many directives as you like, proxy-ing all the servlets for the server (for instance, one of the configuration I’ve looked at has a special servlet for dynamic image generation and one for dynamic PDF documents generation – the output will not be compressed, but they all had to be proxy-ed). Time-based caching is also possible with mod_proxy, but this subject deserves a little article by itself. For the moment, we’ll stick to simple transparent proxying and compression.

Congratulations, just restart Apache and you have a running proxy. Mod_gzip is a little bit trickier. I’ve adapted a little bit the configuration from the article Getting mod_gzip to compress Zope pages proxied by Apache (haven’t been able to find anything better concerning integration with Java servlet containers) and here’s the result :

#module settings
mod_gzip_on Yes
mod_gzip_can_negotiate Yes
mod_gzip_send_vary Yes
mod_gzip_dechunk Yes
mod_gzip_add_header_count Yes
mod_gzip_minimum_file_size 512
mod_gzip_maximum_file_size	5000000
mod_gzip_maximum_inmem_size	100000
mod_gzip_temp_dir /tmp
mod_gzip_keep_workfiles No
mod_gzip_update_static No
mod_gzip_static_suffix .gz
#includes
mod_gzip_item_include mime ^text/*$
mod_gzip_item_include mime httpd/unix-directory
mod_gzip_item_include handler proxy-server
mod_gzip_item_include handler cgi-script
#excludes
mod_gzip_item_exclude reqheader  "User-agent: Mozilla/4.0[678]"
mod_gzip_item_exclude mime ^image/*$
#log settings
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" mod_gzip: %{mod_gzip_result}n In:%{mod_gzip_input_size}n Out:%{mod_gzip_output_size}n:%{mod_gzip_compression_ratio}npct." mod_gzip_info
CustomLog /var/log/apache/mod_gzip.log mod_gzip_info

Short explanation. The module is activated and allowed to negotiate (see if a static or cached file was already compressed and reuse it). The Vary header is useful for client-side caches to work, dechunking eliminates the ‘Transfer-encoding: chunked’ HTTP header and joins the page into one big packet before compressing. Header length is added for traffic measuring purposes (we’ll see the ‘right’ figures in the log). Minimum size of a file to be compressed is 512 bytes, setting maximum is also a good idea because a) compressing a huge file will stump your server and b) the limitation guards against infinite loops. Maximum file size to compress in memory is 100KB in my setting, but you should tune this value for optimum performance. Temporary directory is /tmp and workfiles should be kept only if you need to debug mod_gzip. Which you don’t.

We’ll include in the files to be gzipped everything that’s text type, directory listing and … the magic line is the one that specifies that everything coming from the proxy-server is susceptible to be compressed: this will assure the compression of your generated pages. And while you’re at it, why not add the cgi scripts…

The includes specified here are quite generous, let’s now filter some of it: we’ll exclude all the images because they SHOULD be already compressed and optimized for web. And last but not least, we’ll decide the format of the line to be added and the location of the compression log – it will allow us to see whether the filter is effectively running and compute how much bandwidth we have saved.

A compelling reason to use mod_gzip is its maturity. Albeit complex, this Apache module is stable and relatively bug free, which can hardly be said about the various compression filters found on the web. The original code from the O’Reilly article was behaving incorrectly under certain circumstances (corrected later on the book’s site, I’ve tested the code and it works fine). I also had some issues with Amy Roh’s filter (from Sun). Amy’s compression filter can be found in a lot of places on the web (JavaWorld, Sun), but unfortunately does not set the correct ‘Content-Length’ header, thus disturbing httpunit, which in turn has ‘turned 100% red’ my web tests suite – as soon as the compression filter was on. Argh.

For the final word, let’s compare the performance of the two solutions (servlet filter agains mod_proxy+mod_gzip). I’ve used a single machine to install both Apache and the servlet container (Jetty), and Amy Roh’s compression filter. A mildly complex navigation scenario was recorded in TestMaker (a cool free testing tool written in Java), then played a certain number of times (100, to be more specific). The results are expressed in TPS (transactions per second): the bigger, the better. The following median values were obtained : 3.10TPS direct connection to the servlet container, 2.64TPS via the compression filter and 2.81TPS via Apache mod_proxy+mod_gzip. That means a 5% performance hit between the Apache and the filter solution. Of course the figure is highly dependent on my test setup, the specific webapp and a lot of other parameters, however I am confident that Apache is superior in any configuration. You also have to consider that using a proxy has some nice bonuses. For instance, Apache HTTPS virtual sites may encrypt your content in a transparent manner. Apache has very good and fast logging, so it’d be cool to completely disable HTTP requests logging in your servlet container. Moreover, the Apache log format is understood by a myriad of traffic analyzer tools. Load balancing is possible using mod_proxy and another remarkably useful Apache module, mod_rewrite. As Apache runs in a completely different process, you might expect slightly better scalability on multiple processor boxes.

Nota bene: in all the articles I’ve read on the subject of compression, there is this strange statement that compression cannot be detected client-side. Of course you can do it… Supposing you use Firefox (which you should, if you’re serious about web browsing !) with the Web Developer plugin (which you should, if you’re serious about web development !). As depicted in the figure, the plugin helps you to “View Response Headers” (in “Information” menu): the presence or absence of Content-Encoding: gzip is what you’re looking for. Voila ! Just for kicks, look at the response headers on a few well-known sites, and prepare to be surprized (try Microsoft, for instance or Slashdot for some funny random quotes).

* Jayson Falkner has also authored this article (“Another Java Servlet Filter Most Web Applications Should Have”) which explains how to control the client-side cache via HTTP response headers. While the example is very simple, one can easily extend it to do more complex stuff such as caching according to rules (for instance, caching dynamically generated documents or images according to the context). This _is_ a pragmatic example of servlet filter.

** Unless of course – as one of the commenters explains here – you have some specific constraints against being able to use Apache, such as : embedded environment, forced to use another web server than Apache (alternative solutions might exist for those servers but I am not aware of them), mod_gzip unavailable on the target platform, etc.

Written by Adrian

February 2nd, 2005 at 8:28 am

Posted in Tools

Tagged with , , , , , ,

JUnit Recipes – for the gourmets of Java unit testing

leave a comment

A mini-review

The book already has stellar ratings on Amazon, JavaRanch and other select places, and after reading a few chapters, the only thing I can do is add this post on the praise list.

Why only a few chapters ? Well, you see, this is not exactly the type of book that you read from cover to cover, it's in fact a 720-pages solid collection of JUnit best practices, the most comprehensive you'll ever found in organized, written form. Until now, I have found precise answers to all my JUnit questions.

The book is organized in three big sections weighting some 200-250 pages each. The first one ('The building blocks') is hugely useful if you are a JUnit newbie or even an absolute beginner. It's a detailed introduction to everything you'll need to know in order to start using JUnit: basics, usage in mainstream IDEs, testing patterns, managing test suites, test data and troobleshooting advice. Even seasoned programmers will find useful pieces of advice. I especially liked the 5th chapter ('Working with test data') : an exhaustive description of all the possible ways of organizing your test data. To be honest, it's one of the few chapters from the first section, that I've really read.

The second part ('Testing J2EE') covers testing of XML, JDBC, EJB, web components (JSP, servlets, Velocity templates, out-of-container testing & such) and J2EE applications (security and again some webapp testing – pageflow, broken links, Struts navigation). I can't really pronounce upon this section since I've read only a few subchapters (DBUnit and some related JDBC testing, as well as a few pages from the web testing chapter). But every piece of advice I've got was rock solid.

I've read in its entirety the third part ('More JUnit techniques'). Under this bland title you'll find a group of not-so-common JUnit info such as usage of GSBase addon (funny that I wasn't aware that such a useful addon exists). There's also an intriguing 'Odds and ends' chapter containing some interesting recipes (here's a good one for the QA-freaks like me : 'verify that your test classes adhere to basic syntax rules of JUnit' – sure, why not ?).

Something that I've really missed is a chapter dedicated to mock objects recipes. Yes, there is a quick explanation in the first section – and a reference to Easy Mock or some other mocking API pops now and then in different chapters – there's even an essay about mocking at the end of the book. But the main mock objects dish isn't there. I would've also loved to see some automated GUI-testing recipes (Abbot, Marathon & related tools). But then again, it's a >700-pages book so I'm probably asking too much.

To conclude, 'JUnit Recipes' is the best JUnit book I've ever came into contact with and it supercedes on my list 'JUnit in Action'. Which, interesting enough, was published by the same Manning almost a year ago. Can't have too many good JUnit books, don't they ?

Written by Adrian

November 21st, 2004 at 11:41 am

Posted in Books

Tagged with , , , , , ,

Book review – Tapestry In Action

leave a comment

My first contact with Tapestry was more than 18 months ago. Back then, I was interested to find a web framework for integration with our custom Avalon-based (using the now-obsoleted) Phoenix server*. The web interface was ment to be backoffice stuff, for simple administration tasks as well as statistics reports. Given that the data access and bussines logic were already developed, we were looking for something simple to plug into a no-frills servlet container such as Jetty. we managed very easily to integrate Jetty as a Phoenix service and pass data through the engine context. But when we finally integrated Tapestry [into Jetty [inside Phoenix]] and make it display some aggregated statistics, the project funding was cut and the startup went south. But, that’s another story and rather uninteresting one.

Meanwhile, things have changed a bit. Tapestry had become a firsthand Apache Jakarta project, the Tapestry users list is more and more crowded, and again I see it used in my day work (by Teodor Danciu, one of my coworkers and incidentally author of Jasper Reports) and doing some moonlighting by myself for an older web project idea. And there is exceptional Eclipse support via Spindle plugin. While the ‘buzzword impact’ on Tapestry on a Java developer CV doesn’t yet measure up with Struts, this framework has obviously gained a lot of attention lately.

So, what’s so special about it ? If I’d have to choose only one small phrase I’d quote Howard Lewis Ship, Tapestry lead developer, from the preface of his book ‘Tapestry in Action’:

The central goal of Tapestry is to make the easiest choice the correct choice.

In my opinion this is the weight conceptual center of the framework. Everything, from the template system which has only the bare minimum scripting power, passing through the componentized model, up to the precise detailed error-reporting (quite unique feature in the opensource frameworks world) gently pushes you (the developer) to Do The Right Thing. To: put logic where it belongs (classes not templates), organize repetitive code in components, ignore the HTTP plumbing and use a real, consistent, MVC model in your apps (forms are readable and writable components of your pages). You don’t need to be Harry Tuttle to make a good Tapestry webapp, just a decent Java developer is enough. That’s more than I can tell about Struts …

Coming from a classic JSP-based webapp world, Tapestry is really a culture shock. The most appropriate way to visualise the difference is to imagine a pure C programmer abruptly passing to C++, into the objects world**. For a while, he will try to emulate the ‘old’ way of work, but soon enough he’ll give up and start coding his own classes. However, this C programmer will have to make some serious efforts, not necessarily because OOP is hard to leard, but in order to break his/her old habits.

“Tapestry in Action” is your exit route from the ugly world of HTTP stateless pages and spaghetti HTML intertwingled with Java code and various macros. It’ one of the best JSP detoxification pills available on the market right now.

The first part of the book (‘Using basic Tapestry components’) is nothing to brag about. It’s basically an updated and nicely organized version of the various tutorials already available via the Tapestry site, excepting probably some sections in chapter 5 (‘Form input validation’). By the way, the chapter 5 is freely downloadable on the Manning site and is a perfect read if you want a glimpse of the fundamental differences between Tapestry and a classic web framework (form validation being an essential part of any dynamic site). However, if you want to go over the ‘Hangman’*** phase you really need to dig into the next two book sections.

The second section ‘Creating Tapestry components’ is less covered by the documentation and tutorials. I’m specifically pointing here to the subsections ‘Tapestry under the hood’ (juicy details about pages and components lifecycle) and ‘Advanced techniques’ (there’s even an ‘Integrating with JSP’ chapter !). While it is true that any point from this chapter will generally be revealed by a search on Tapestry user list or (if you’re patient) by a kind soul answering your question on this same list, it’s nethertheless a good thing to have all the answers nicely organized on the corner of your desk.

The third and last chapter (‘Building complete Tapestry application’) is a complete novelty for Tapestry fans. It’s basically a thorough description of how to build a web application (a ‘virtual library’) from scratch using Tapestry. While the Jboss-EJB combination chosen by the author is not exactly my cup of tea (I’m rather into the Jetty+Picocontainer+Hibernate stuff) I can understand the strong appeal that it is suppposed to have among the J2EE jocks. Anyway, given the componentized nature of Tapestry, I should be able to migrate it relatively easily if I feel the need for it. The example app is contained in a hefty 1Meg downloadable archive of sources, build and deployment scripts included.

To conclude, ‘Tapestry in Action’ is a great book about how to change the way you are developing web applications. The steep learning courve is a little price to pay for a two or three-fold improvement in overall productivity. And this book should get you started really quick.

*Which AFAIK is still used at our former customer.
**There were some posts on Tapestry user list on about a certain conceptual ressemblance with Apple’s WebObjects. I can’t really pronounce upon this because I do not know WebObjects, but the name in itself is an interesting clue.
***’Hangman’ is the Tapestry ‘Petshop’ (although there is also a ‘real’ Tapestry Petshop referenced in the Wiki).

Written by Adrian

March 18th, 2004 at 2:56 pm

Posted in Books

Tagged with , , , , ,