Netuality

Taming the big bad websites

Archive for June, 2004

Portability is for canoes and system software …

leave a comment

… at least that's what Mr. David Gristwood says in this (otherwise excellent) entry ('21 Rules of Thumb – How Microsoft develops its Software') on his MSDN weblog. Davis thinks that :


Even discounting the added development burden, with the addition of each additional platform the job of QA increases substantially. While clever QA management can minimize the burden somewhat, the complexity of multi-platform support is beyond the reach of most development organizations. Place your bets. Demand multi-platform support from your system software vendor, then build your product on the absolute fewest number of platforms possible.

What kind of 'portability' are we talking about in the context of software development at Microsoft ? He is probably making allusions to software being developed simultaneously for desktop and pocket Windows, which is in fact quite a challlenge for QA and for the developer team. But if it's a tongue-in-the-cheek reference to Java WORA, I found this entry to be somewhat funny. Let's – for the sake of the argument – suppose that you develop for multiple platforms and your QA team is able to thoroughly test only one of them. Basically, this means that your product is going to work OK on the main platform and have some flaws (most probably in the GUI area) on other platforms. How is this worse than having a product which purposedly works on a single target platform ? Humm, is JVM 'system software' after all ?

Written by Adrian

June 27th, 2004 at 12:10 am

Posted in AndEverythingElse

Tagged with , , , ,

Hallowed be thy tablename !

leave a comment

If you haven't had the opportunity to work on a really big project, naming is probably not on your top list of programming best practices. And you are certainly going to regret that when your project grows.

Of course, everybody, including good old Scott, knows that CUST signifies CUSTOMER and DEPT signifies DEPARTMENT. And statistically speaking, the chances for these abreviations to mean something else is very small – as long as your domain model is, also, quite small. But, when the number of classes in the domain is in the hundreds or even in the thousands you'll suddenly find out that CUST may signify CUSTOMS (as in 'Customs Tax'), CUSTOMIZATION or even CUSTARD. I am working right now in the development team of an ERP for agro-food industry and wouldn't be amazed to see such an attribute name. I've seen worst, some details of the implemented business model are a total blasphemy for human logic and common sense.

Anyway, the problem is even worse in these big projects because domain model classes are not written by hand, they are generated. While this is hardly a novelty for you (please don't laugh in the audience), it also means that analysts are composing the datamodel, then classes/mappings/SQL schema/docs are generated, finally programmers will write the business logic and infrastructure integration using the generated artifacts. Names are usually propagated all along the generation chain. And when a programmer finds 'Cust' in the name of an attribute, how does she know it's a 'Customer' and not some 'Custard' ? Especially when the documentation is scarce and the author analyst is in a well-deserved six-months sabbatical in Anctarctica.

Hence, the need for standardization. This is usually done via a dictionary containing the abbreviations and their meaning(s). The rule is very simple : every word in the datamodel must be composed of abbreviations from the dictionary. Some programmers might argue that there is no need for abbreviations and full words are ok – lovely code such as '.getSecondaryBillingAddressForService(currentBill.getBillableServicesList(i).get(currentService)).getStreet().getName()'. This is perfectly understandable, however let's not forget that some databases (Oracle, Sap DB, etc.) have issues with table and column names longer than 32 characters, like for instance refusing to create it in the first place. Which is mildly bothersome if you use a relational database*.

And the golden rules of domain model naming are :

  • Be a pedantic bastard. Don't just throw the dictionary in the wild and tell people 'yeah, pleease follow this standard'. Make automatic checking on every piece of datamodel feeding the code generator. The automatic checking should be done at each save operation if possible. I have implemented this inside an Eclipse plugin used by the project analysts: when hitting save on an entity containing invalid names, a window will immediately pop up and inform about the errors. Don't just display the errors, but completely forbid saving if the entity has naming issues. This will keep the naming absolutely pristine, however the analysts might be tempted to create a lynch mob. Do not give up.
  • Avoid synonyms, plurals, etc. This is a software product, not a grammar contest.
  • Throw some stats on the mail from time to time to tell how well the model is named. People will like that.

My current gig involves, among other interesting stuff, managing the naming tools in the various Java projects that we are developing. Unfortunately, the naming rules were not really enforced (they had no pedantic bastards before me ?), so the domain model is only partially compliant. Hence, I'm in the midst of developing tools for automatic renaming of model and the new code is going to disrupt the activity for a while (thank God for autocompletion features in modern IDE's !). Things would have been much smoother if the naming was enforced from the beginning. I think there is not such thing as 'too late' to put naming in order in a big project. And it'll absolutely be done, because there's very strong managerial support for this kind of tasks (main company shareholder and CEO is a former programmer himself, as well as a quality buff – 'when time permits'™).

Unfortunately, I had to allow some 'non-compliant' islands of code in the modules which are already deployed at customers. But, have no false hopes, sooner or later I'm gonna get that code too. I'm a pedantic bastard, and proud of it.

* Now, if you're using a wanabee storage solution like Prevayler to store gigabytes of business data (or more!), you have much bigger problems than naming. Please stop reading this article and do something about it.

Written by Adrian

June 26th, 2004 at 11:54 pm

Posted in Process

Tagged with , ,

… and a few lesser known Java tools

leave a comment

Very very busy lately, but I'd like to share some knowledge about a few useful Java OSS gems that were not easy to find. Mr. Google, please 'index this':

1. Aspirin is a self-contained SMTP server (send only) written in Java, open-sourced and free. It simplifies configuration and deployment by allowing your app server to send emails without passing through an external SMTP server. The project is heavily inspired from Apache James code (thus its licencing terms). The few problems I see right now are : possible performance issues when sending big volumes of mail, behavior still erratic (sometimes sending fails without plausible reason), failure reports which do not provide reasons of failure. However, the thingie works pretty well and is a big time saver because, well, configuration is not the most pleasant part of a complex server.

2. If you produce a lot of reports and want to send them automatically on a remote printing server you may use JIPSI (quickstart in English, but site in German) which implements CUPS as a Java Print Service API. This little beauty was found by one of my coworkers and the 'report guys' seem to be making good use of it.

3. You're in for some serious processing on OpenOffice documents using the freely available DTD's (downloadable from the OOo CVS server) ? Then hold your horses ! I've tried to make sensible use of them and failed abruptly. Let's just say that those DTDs are a big pain in the a**: to begin with, no tool is able to transform them into a schema. I've tried XmlSpy and a few other exotic softwares, without success. Even basic stuff like parsing with a validating parser does not work. So much for the usefulness of open standards. Eventually, I have ended up by using the excellent Writer2Latex. Don't be fooled by the name, you may do all sorts of conversions with it, including Writer to XHTML, which I was interested in. You can even write your own plugin to boot some exotic formats, because Writer2Latex is built around a modified version of XMerge. Officially, XMerge is the solution for visualizing documents on 'Small Devices', but it really is a fancy plugin-based document converter. Most probably (too lazy to check the sources), SAX-based with a nonvalidating parser. Go figure.

4. The Eclipse download site has now links to a BitTorrent tracker. I just used it succesfully in order to download RC3 at a reasonable download rate (anyway, being on wifi right now I wasn't expecting blazing speed). I found interesting that all the other peers were using Azureus, a torrent client written in Java+SWT. Azureus is a fantastic source of knowledge, choke full of tips and techniques for writing professional-looking and very responsive SWT apps. But not only : Azureus is also a great example about how to write a plugin-ready app, which performs automatic updates from the net. Not bad, at all.

Written by Adrian

June 21st, 2004 at 8:14 pm

Posted in Tools

Tagged with , , ,

(Almost) distributed CVS with Eclipse

leave a comment

NOTE : unfortunately, this trick is again unavailable starting with Eclipsev3M9, the last version which made that possible was v3M8



In Eclipse 3.0, in the CVS repository properties it is possible to separate the 'read' and 'write' access locations (see picture). AFAIK, this feature was meant for developers using extssh for CVS connection. They would use an anonymous account for updates in 'clear', and their respective accounts for commits over encrypted link. The basic idea is that clear connections are considerably faster than encrypted ones. However, you may change the whole connection string, not only the login and protocol, thus enabling the usage of two completely different repositories for 'read' and 'write' actions on CVS.

This might come handy in situations like the one we've recently been confronted with, in my team. The main CVS repository is located at a certain geographical distance (in the headquarters, in France) and the VPN bandwidth is nothing to brag about. Working with CVS was decent, but things have started to got meaner lately, mainly because of three reasons:

  • both teams have grown = more activity on the repository,
  • vast majority of developers is integrating frequently, aka every little bit of functionality is committed as soon as it's stable and working. Of course, before committing, there's the mandatory update to check for consistency against the most recent codebase. This means that at least a few synchronizations will be performed by each member of the team, each day.
  • Code generators. It's true that common sense dictates that nothing generated should be stored in CVS, because this may be source of frequent conflicts and loads the repository in an inefficient manner. However, when one routine generation (for each of the few modules) may take between 3 and 10 minutes and produces thousands of files, it becomes pretty obvious that it should not become part of the build process. What gives : in the days when the model has some important changes, a few different subsequent versions of 3-15MB jar files are committed on the repository. Update process starts to slow down and soon a part of the team is lagging, waiting to download the update. The irony is that usually everybody is downloading slowly and inefficiently the SAME big file. Of course, there are also other kinds of traffic via the VPN, such as Netmeeting (but you can't really have a conversation when everybody in the team is updating the codebase and slows the VPN to a crawl).

This boils down to having a 'read-only' local repository, perfect mirror of the main CVS server, which will be used only for updates. Both the server and its mirror are Linuxes. My choice for mirroring was good old rsync. While cvsup seems to be all the rage nowadays, I headed towards rsync because a) it comes pre-installed within any decent Linux distro and b) I am using Gentoo on my home desktop, so rsync is a tool that I learned to use [and abuse] almost daily, via portage.

We won't lose a lot of time explaining how to set up an rsync server, since it's very well explained in this rather old but useful tutorial. There's just a small twist : we were planning to synchronize frequently so we ran rsync daemon independently, not via xinetd. The config file on the server:

2;root@dev  /etc> more rsyncd.conf
motd file = /etc/rsyncd.motd
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
lock file = /var/run/rsync.lock

[cvs]
path = /opt/cvsroot
comment = CVS Rsync Server
uid = nobody
gid = nobody
read only = yes
list = yes
hosts allow = 10.0.10.100/255.255.255.0

is a classical read-only, anonymous (we're on VPN, right ?) mirroring setup. Next step is making a rsync service in /etc/init.d or appending the line “/usr/bin/rsync –daemon” to /etc/rc.d/rc.local, to be sure that the daemon restarts after reboot [especially when you are NOT administrating the server].

On the client, there are some small tricks to make it work. First one, setup your client CVS repository in the same location as on the server, '/opt/cvsroot' in our case (because you are going to synchro CVSROOT as well, which contains absolute paths in some if its files).

The mirroring script is something in the spirit of:

#!/bin/bash

source /etc/profile
cd /opt/cvsroot

RUN=`ps x | grep rsync | grep -v grep | wc -l`
if [ "$RUN" -gt 0 ]; then
#already running
exit 1
fi

rsync -azv --stats 10.0.3.193::cvs/CVSROOT/history /home/cvs/history >/tmp/histo

sum1=`sum /home/cvs/history`
sum2=`sum /opt/cvsroot/CVSROOT/history`

if [ "$sum1" = "$sum2" ]; then
#nothing to do
date > /tmp/empty
exit 0
fi

date > /tmp/started
cat /dev/null > /tmp/ended
rsync -azv --stats --delete --force 10.0.3.193::cvs /opt/cvsroot > /tmp/status
cd /opt
date > /tmp/ended


This (badly written) script is heavily inspired from a script found on a Debian user mailing list:

  • exits in the case of a long running rsync process
  • assures that the full rsync is triggered by changes in CVSROOT/history. This is a neat trick which minimizes the server activity if you are syncing frequently, as we will do.
  • outputs stuff in some files in the /tmp directory. This has a double purpose. First is avoiding root mailbox pollution (because we're 'cron-ing' it, the output will be visible as a few hundreds of mails each day). Second is providing data for a web page on the client machine which tells at a glance what is the synchronization status. AFAIK rsync will not write incomplete files, but the usual sound advice is to make an update between successive synchronizations.

The small 'synchronization status' page (left as an exercise for the reader :) ) just prints the dates and the last few lines of the output files; this is a dumbed-down sample :

CVS synchronization started at Mon Feb 23 09:45:18 EET 2004
, ended at Mon Feb 23 09:46:00 EET 2004

Last empty synchronization recorded at Mon Feb 23 09:40:05 EET 2004

No knowledge about recent deleted files

Last synchronized
_/modules/postpreparation/tools/
_/modules/postpreparation/tools/velocity/
_/gateway/
_/src/mailer/
_/src/old/
_/src/tools/
_/src/tools/Attic/
_/tools/
CVSROOT/history
wrote 60767 bytes read 471522 bytes 13142.94 bytes/sec
Last added files
_/ihm/AlignementChamps.java,v

The only step left is to create a file in cron.d containing the line:

0-59/5 7-23 * * * cvs your_mirroring_script.sh

(meaning sychronization each 5 minutes from 07:00 to 23:00) and you may start enjoying blazingly fast CVS updates in Eclipse.

Written by Adrian

June 2nd, 2004 at 1:44 pm

Posted in Tools

Tagged with , , , ,