Netuality

Taming the big, bad, nasty websites

Archive for the ‘linux’ tag

Monitoring memcached with cacti

3 comments

Memcached is a clusterable cache server from Danga. Or, as they call, it a distributed memory object caching system. Well, whatever. Just note that memcached clients exist for lots of languages (Java, PHP, Python, Ruby, Perl) – mainstream languages in the web world. A lighter version of server was rewritten in Java by Mr. Jehiah Czebotar. Major websites such as Facebook, Slashdot, Livejournal and Dealnews use memcached in order to scale for the huge load they’re serving. Recently, we needed to monitor the memcache servers on a high-performance web cluster serving the Planigo websites. By googling and reading the related newsgroups, I was able to find two solutions:

  • from faemalia.net, a script which is integrated with the MySQL server templates for Cacti. Uses the Perl client.
  • from dealnews.com, a dedicated memcached template for Cacti and some scripts based on the Python client. The installation is thoroughly described here.

These two solutions have the same approach – provide a specialized Cacti template. The charts drawn by these templates are based on data extracted by the execution of memcached client scripts. Maybe very elegant, but could become a pain in the dorsal area. Futzing with Cacti templates was never my favorite pasttime. Just try to import a template exported from a different version of Cacti and you’ll know what I mean. In my opinion, there is a simple way, which consists in installing a memcached client on all the memcached servers, then extracting the statistical values using a script. We’ll use the technique described in one of my previous posts, to expose script results as SNMP OID values. Then, track these values in Cacti via the generic existing mechanism. My approach has the disadvantage of installing a memcached client on all the servers. However, it is very simple to build your own charts and data source templates, as for any generic SNMP data. All you need now a simple script which will print the memcached statistics, one per line. I will provide one-liners for Python, which will obviously work only on machines having Python and the “tummy” client installed. This is the recipe (default location of Python executable on Debian is /usr/bin/python but YMMV):

1. first use this one liner as snmpd exec :

/usr/bin/python -c “import memcache; print (‘%s’%[memcache.Client(['127.0.0.1:11211'], debug=0).get_stats()[0][1],]).replace(\”‘\”,”).replace(‘,’,'\n’).replace(‘[','')
.replace(']‘,”).replace(‘{‘,”).replace(‘}’,”)”

This will display the name of the memcached statistic along with its value and will allow you to hand pick the OIDs that you want to track. Yes, I know it could be done simpler with translate instead of multiple replace. Left as an exercise for the Python-aware reader.

2. after having a complete list of OIDs use this one-liner:

/usr/bin/python -c “import memcache; print ‘##’.join(memcache.Client(['127.0.0.1:11211'], debug=0).get_stats()[0][1].values()).replace(‘##’,'\n’)”

The memcached statistics will be displayed in the same order, but only their values not their names.

And this is the mandatory eye candy:



Written by Adrian

August 2nd, 2006 at 10:54 pm

Posted in Tools

Tagged with , , , , , , ,

Monitoring Windows servers – with SNMP

5 comments

My previous article was focused on Linux monitoring. Often, you’ll have in your datacenter at least a few Windows machines. SQL Server is one of the best excuses these days to get a Microsoft machine in your server room – and you know what, it’s a decent database – well, at least for medium-sized companies like the one I’m working for right now.

It is less known, but yes you can have SNMP support out of the box with Windows 2000 and XP, and it doesn’t need to be the Server flavor [obiously it works the same in 2003 Server]:

  1. Invoke the Control Panel.
  2. Double click the Add/Remove Programs icon.
  3. Select Add/Remove Windows Components. The Windows Component Wizard is displayed.
  4. Check the Management and Monitoring Tools box.
  5. Click the Details button.
  6. Check the Simple Network Management Protocol box and click OK, then Next. You may have to reboot the machine.

After the server is installed, the SNMP service has to be configured. Here’s how:

  1. Invoke the Control Panel.
  2. Double click the Administrative Tools icon.
  3. Double click the Services icon.
  4. Select SNMP Service.
  5. Choose the Security tab.
  6. Add whatever community name is used in your network. Chances are in a local internal LAN the default public works out of the box.
  7. For a sensitive server, you may want to fiddle a little bit with the IP restriction settings, for instance allowing SNMP communication only with the monitoring machine.
  8. Click OK then restart the service.

Next step is Cacti integration. Unfortunately, there is no Windows-specific profile for devices in Cacti. Therefore if you have lots of Windows machines, you’ll have to define your own. Or, take a Generic SNMP-enabled host and use it as a scaffold for each device configuration.

Out of the graphs and datasources already defined in Cacti [I am using 0.8.6c] only two work with Windows SNMP agents: processes and interface traffic values.

It’s a good start, but if you are serious about monitoring, you need to dig a little bit deeper. Once again, the MIB Browser comes to save the day. It’s very simple, just search on the Windows machine for any .mib files you are able to find, copy on your workstation, load them into the MIB browser and make some recursive walks (Get subtree on the root of the MIB).This way, I was able to find some interesting OID for the Windows machine. For instance, .1.3.6.1.2.1.25.3.3.1.2.1 -> .1.3.6.1.2.1.25.3.3.1.2.4 the OID for CPU load on each of the 4 virtual CPUs [it's a dual Xeon with HT].

Memory-related OIDs for my configuration are :

  • .1.3.6.1.2.1.25.2.3.1.5.6 – Total physical memory
  • .1.3.6.1.2.1.25.2.3.1.6.6 – Used physical memory
  • .1.3.6.1.2.1.25.2.3.1.6.6 – Total virtual memory ["virtual"="swap" in Windows lingo]
  • .1.3.6.1.2.1.25.2.3.1.6.6 – Used virtual memory

Here’s a neat memory chart for a windows machine. Notice that the values are in “blocks” which in my case is 64kb. The total physical memory is 4GB.

Most hardware manufacturers do offer SNMP agents for their hardware, as well as the corresponding .mib file . In my case, I was able to install an agent to monitor an LSI Megaraid controller. Here is a chart for the number of disk operations/second:

In one of my next articles, we’ll take a look together at the way you can export “non-standard” data over SNMP from Windows, in the same manner we did on Linux, using custom scripts. Till then, have an excellent week.

Written by Adrian

May 12th, 2006 at 6:52 pm

Posted in Tools

Tagged with , , , ,

Monitor everything on your Linux servers – with SNMP and Cacti

5 comments

Two free open-source tools are running the show for network and server-activity monitoring. The oldest and quite popular among network and system administrators is Nagios. Nagios does not only do monitoring, but also event traps, escalation and notification. The younger challenger is called Cacti. Unlike Nagios, it’s written in a scripting language [PHP] so no compiling is necessary – it just runs out of the box1. Cacti’s problem is that – at its current version – is missing lots of real-time features such as monitoring and notification. All these features are scheduled to be integrated in future versions of the product, but as with any open-source roadmap nothing is guaranteed, Anyway, this article is focusing on Cacti integration because it’s what I am currently using.

Cacti is built upon an open-source graphing tool called MRTG and a communication protocol SNMP. SNMP is not exactly a developer’s cup of tea, being more of a network administrator’s tool2. However, a monitoring server comes extremely handy in performance measurement and tuning, especially for complex performance behavior which can only be benchmarked long-term : such as large caches impact on a web application, or performance of long-running operations.

But is that specific variable you need to monitor, available with SNMP out of the box ? There is a strong chance it is. SNMP being an extensible protocol, lots of organization have recorded their own MIBs and respective implementations. Basically, a MIB is a group of unique identifiers called OIDs. An OID is a sequence of numbers separated by dots, for instance ‘.1.3.6.1.4.1.2021.11′; each number has a special meaning in a standard object tree – this example, the meaning of ‘.1.3.6.1.4.1.2021.11′ is ‘.iso.org.dod.internet.private.enterprises.ucdavis.systemStats’. Even you can have your own MIB in the ‘.iso.org.dod.internet.private.enterprises’ tree, by applying on this page at IANA.

Most probably you don’t really need your own MIB, no matter how ‘exotic’ your monitoring is, because:

a) it’s already there, in the huge list of existing MIBs and implementations

and

b) you are not bound to the existing official MIBs, in fact you can create your own MIB as long as you replicate it in the snmp configuration on all the servers that you want to monitor.

To take a look at existing MIBs, free tools are available on the net, IMHO the best one being MibBrowser. This multiplatform [Java] MIB browser has a free version which should be more than enough for our basic task. The screen capture shown here depicts a “Get Subtree” operation on the ‘.1.3.6.1.4.1.2021.11′ MIB; the result is a list of single value MIBs, such for instance ‘.1.3.6.1.4.1.2021.11.11.0′ which has the alias ‘ssCpuIdle.0′ and value 97 [meaning that the CPU is 97% idle]. You can see the alias by loading the corresponding MIB file [select File/Load MIB then choose 'UCD-SNMP-MIB.txt' from the list of predefined MIBs].

From command line, in order to display existing MIB values, you can use snmpwalk:

snmpwalk -Os -c [community_name] -v 1 [hostname] .1.3.6.1.4.1.111111.1

3 and the result is:

.1.3.6.1.4.1.2021.11 OID (.iso.org.dod.internet.private.enterprises.ucdavis.systemStats)
snmpwalk -v 1 -c sncq localhost .1.3.6.1.4.1.2021.11
UCD-SNMP-MIB::ssIndex.0 = INTEGER: 1
UCD-SNMP-MIB::ssErrorName.0 = STRING: systemStats
UCD-SNMP-MIB::ssSwapIn.0 = INTEGER: 0
UCD-SNMP-MIB::ssSwapOut.0 = INTEGER: 0
UCD-SNMP-MIB::ssIOSent.0 = INTEGER: 4
UCD-SNMP-MIB::ssIOReceive.0 = INTEGER: 2
UCD-SNMP-MIB::ssSysInterrupts.0 = INTEGER: 4
UCD-SNMP-MIB::ssSysContext.0 = INTEGER: 1
UCD-SNMP-MIB::ssCpuUser.0 = INTEGER: 2
UCD-SNMP-MIB::ssCpuSystem.0 = INTEGER: 1
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 96
UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 17096084
UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 24079
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 6778580
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 599169454
UCD-SNMP-MIB::ssCpuRawKernel.0 = Counter32: 6778580
UCD-SNMP-MIB::ssIORawSent.0 = Counter32: 998257634
UCD-SNMP-MIB::ssIORawReceived.0 = Counter32: 799700984
UCD-SNMP-MIB::ssRawInterrupts.0 = Counter32: 711143737
UCD-SNMP-MIB::ssRawContexts.0 = Counter32: 1163331309
UCD-SNMP-MIB::ssRawSwapIn.0 = Counter32: 23015
UCD-SNMP-MIB::ssRawSwapOut.0 = Counter32: 13730

Each of this values has its own significance, like for instance ‘ssCpuIdle.0′ which announces that the CPU is 96% idle.
In order to retrieve just a single value of the list, use its alias as a parameter to the snmpget command, for instance

snmpget -Os -c [community_name] -v 1 [hostname] UCD-SNMP-MIB::ssCpuIdle.0

Sometimes, you want to monitor something which you do not seem to find in the list of MIBs. Say, for instance, the performance of a MySQL database that your’re pounding pretty hard with your webapp4. The easiest way of doing this is to pass through a script – snmp implementations can take the result of any script and expose it through the protocol, line by line.

Supposing you want to keep track of the values obtained with the following script:

#!/bin/sh
/usr/bin/mysqladmin -uroot status | /usr/bin/awk '{printf("%f\n%d\n%d\n",$4/
10,$6/1000,$9)}'

The mysqladmin command and a bit of simple awk magic display the following three values, each on a separate line:

  • number of opened connections / 10
  • number of queries / 1000
  • number of slow queries

It is interesting to not that, while the first value is instantaneous gauge-like, the following two are incremental, growing and growing as long as new queries and new slow queries are recorded. Will keep this in mind for later, when we will track these values.

But for now, let’s see how these three values are exposed through snmp. The first step is to tell the SNMP daemon that the script has an associated MIB. This is done in the configuration file, usually located at /etc/snmp/snmp.d. The following line attaches the script [for example /home/user/myscript.sh] execution to a certain OID:

exec .1.3.6.1.4.1.111111.1 MySQLParameters /home/user/myscript.sh

the ‘.1.3.6.1.4.1.111111.1′ OID is a branch of ‘.1.3.6.1.4.1′ [meaning '.iso.org.dod.internet.private.enterprises']. We tried to make it look ‘legitimate’ but obviously you can use here any sequence you want to.

After restarting the daemon, let’s interrogate Mibbrowser for the freshly created OID, see the following image snmpwalk -Os -c [community_name] -v 1 [hostname] .1.3.6.1.4.1.111111.1 ; the result is:

enterprises.111111.1.1.1 = INTEGER: 1
enterprises.111111.1.2.1 = STRING: "MySQLParameters"
enterprises.111111.1.3.1 = STRING: "/etc/snmp/mysql_params.sh"
enterprises.111111.1.100.1 = INTEGER: 0
enterprises.111111.1.101.1 = STRING: "0.900000"
enterprises.111111.1.101.2 = STRING: "18551"
enterprises.111111.1.101.3 = STRING: "108"
enterprises.111111.1.102.1 = INTEGER: 0
enterprises.111111.1.103.1 = ""

Great ! Now we have the proof that it really works and our specific values extracted with a custom script are visible through SNMP. Let’s go back to Cacti and see how we can make some nice charts out of them5.

Cacti has this nice feature of defining ‘templates’ that you can reuse afterwards. My strategy is to define a data template for each one of the 3 parameters I want to chart, using the ‘Duplicate’ function applied to the ‘SNMP – Generic OID Template’.

On the duplicate datasource template, you have to change the datasource title, name to display in charts, data source type [use DERIVE for incremental counters and GAUGE for instantaneous values], specific OID and the snmp community. Do it for the three values.

Using the three new datasource templates, create a chart template for ‘MySQL Activity’. That’s a bit more complicated, but it boils down to the following procedure, repeated for each of the 3 data sources:

  • add a data source and associate a graph [I always use AREA for the first graph as a background and LINE3 for the other, but it's just a matter of taste]
  • associate labels with current or computed values: CURRENT, AVERAGE, MAX in this example

All the rest is really fine tuning – deciding for better colors, wether to use autoscale or fixed scale and so on. By now, your graph template should be ready to use.

Note that for the incremental values ['DERIVE' type data sources] I’ve used titles such as ‘Thousands queries/5 min’ – the 5 minutes come from the Cacti poller which is set to query for data each 5 minutes. The end result is something like this one :

On this real production chart you’ll see a few interesting patterns. For instance, at 3 o’clock in the morning, there is a huge spike in all the charted parameters – indeed, a cron’ed script was provoking this spike. From time to time, a small burst of slow queries is recorded – still under investigation. What is interesting here is that these spikes were previously undetectable on the load average chart, which look clean and innocuous:

To conclude, SNMP is a valuable resource for server performance monitoring. Often, investigating specific parameters and displaying them in tools such as Cacti can bring interesting insights upon the behavior of servers.

Some SNMP implementations in different programming languages:

  • Java: Westhawk’s Java SNMP stack [free w commercial support], AdventNet SNMP API [commercial, with a feature-restricted un-expiring free version], iREASONING SNMP API [commercial implementation], SNMP4J [free and feature-rich implementation - thank you Mathias for the tip]
  • PHP: client-only supported by the php-snmp extension, part of the PHP distribution [free]
  • Python: PySNMP is a Python SNMP framework, client+agents [free].
  • Ruby: client-only implementation Ruby SNMP [free]

1 If you’re running Debian, Cacti comes with apt so it’s a breeze to install and run [apt-get install cacti]

2 a bit out of the scope of this article, SNMP also allows writing values on remote servers, not only retrieving monitored values.

3 Replace [hostname] with the server hostname and [community_name] with the SNMP community – default being ‘public’. The SNMP community is a way of authenticating a client to a SNMP server; although the system can be used for pretty sophisticated stuff, most of the time the servers have a read-only passwordless community, visible only in the internal network for monitoring purposes.

4 In fact, a commercial implementation of SNMP for MySQL does exist.

5 The procedure described here applies to Cacti v0.8.6.c

Written by Adrian

March 5th, 2006 at 5:27 pm

Posted in Tools

Tagged with , , , , , , , ,

Aggregating webservers logs for an Apache cluster

one comment

One of the ways of scaling a heavy-traffic LAMP web application is to transform the server into a cluster of servers. Some may opt to walk on the easy path by using an overpriced appliance load balancer, but the most daring [and budget-restrained] will go for free software solutions such as pound or haproxy.

Although excellent performers, these free balancers have lots of missing features when compared with counterpart commercial solutions. One of the most embarrassing misses is the lack of flexibility in producing decent access logs. Both pound (LogLevel 4) and haproxy (option httplog) may generate Apache-like logs in their logfiles or the syslog, however none offers the level of customization encountered in Apache. Basically, you're left with using the logs from the cluster nodes. These logs present a couple of problems:

  • the originating IP is always the internal IP of the balancer
  • there is one log/node, while log analysis tools can usually cope with a single log file/report

First problem is relatively easy to solve. Start by activating the X-Forwarded-For header in the balancing software : for instance configuring haproxy with option forwardfor. A relatively unknown Apache module called mod_rpaf will solve the tedious task of extracting the remote IP from X-Forwarded-For header and copying it in the remote address field of Apache logs. For Debian Linux fans, it's nice to note that libapache-mod-rpaf is available via apt.

Now that you have N realistic Apache weblogs, 1 per cluster node, you just have to concatenate and put them in a form understandable by your log analysis tools. Just simply cat-ing them in a big file, won't cut it [arf] because new records will appear in different regions of the file instead of appending chronologically to its tail. The easiest solution in that case is to perform a sort on these logs. Although I am aware of the vague possibility of sorting on the Apache datetime field, even taking the locale into account, I confess my profound inability of finding the right combination of parameters. Instead, I choose to add a custom field in the Apache log; using the following log format:


LogFormat "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\" %c %T \"%{%Y%m%d%H%M%S}t\"" combined

where %{%Y%m%d%H%M%S}t is a standard projection of current datetime in an easily sortable integer, like for instance 20050925120000 – equivalent of 25 Sep 2005 12:00:00. Now, considering the quote as a separator in the Apache log format, is easy to sort upon this custom field [the 10th]:

sort -b -t "\"" -T /opt -k 10 /logpath/access?.log > /logpath/full.log

And there you are, having this nice huge log file to munch on. On a standard P4 with 1GB of RAM it takes less than a minute to obtain a 2GB log file…

In case the web traffic is really big and log analysis process impacts the existing web activity, use a separate machine instead of overloading one of the cluster nodes. For automated transfer of log files, generate ssh keys on all the cluster nodes for paswordless login from the web analytics server in the web logfiles owner account. Minimization of traffic between these machines is done by installing rsync on them and them using rsync via ssh:

rsync -e ssh -Cavz www-data@node1:/var/log/apache/access.log /logpath/access1.log

Now, you know all the steps required to fully automate the log aggregation and its processing. One may ask why all the fuss when in fact a simple subscription to a ASP style web analytics provider should suffice. Yes, it's true however… The cluster that I've recently configured with this procedure has a few million hits per week. Yes, we're talking about page hits. At this level of traffic, the cost for a web analytics service starts from 10.000$/year. It's certainly a nice amount of money, which will allow you to afford your own analytics tool [such as for instance Urchin v5] and keep some cash from the first year. Some might say that this kind of commercial tools have their own load balancer analysis techniques. Sure, but it all comes with a cost. In the case of Urchin, you just saved 695$/node and some bragging rights with your mates. Relax and enjoy.

PS: Yes we're talking millions of page hits LAMP solution not J2EE… Maybe I'll get into details on another occasion, assuming that somebody is interested. Leave a comment, send a mail or something.

Written by Adrian

September 25th, 2005 at 8:20 pm

Posted in Tools

Tagged with , , ,

Sybase woes – and Jython saves the day

leave a comment

Until now, I've always had a certain respect for Sybase database.
Based on their history, I thought that the product is a sort of MS SQL
without the glitzy features – think Las Vegas without the lights, the cowboy boots and Eiffel Tower. Which gives: huge crowds of fat tourists in tall, dull buildings. But you always have Celine Dion, right ?*

Wrong ! Sybase ASE is a large, enormous, huge piece of steaming …
ummm … code. Ok, ok, I'm a bit over-reacting. In fact, Sybase is a
very good database – if you are still living in the 90s and the only
Linux flavour that you are able to manage is RedHat ASE. Otherwise,
it's a huge … you know.

Let's not talk about the extreme fragility of this … this product.
You never know when it crashes on you without any specific reason.
Sometimes, it's the bad weather. Crash. Restart. Sometimes, you
flushed twice. Crash. Restart. And maybe, yes maybe, you spent more
than 12 minutes in your lunch break. Crash.

Let's just talk about the JDBC drivers – latest version, downloaded
from the site in a temporary moment of insanity. Man, this is cutting
edge. The sharpest cutting edge you'll ever find – the more advanced
JDBC drivers bar none ! Excepting of course the fact that these
classes were compiled in that memorable day of 6 January 2002 [the day
when the last Sybase JDBC drivers were compiled]. But, don't let such
obscure details ruining your enthusiasm. Just download and use them
and you'll be amazed at their unique features – it's the only JDBC
driver which manages to put down DbVisualizer in different and
innovative ways. I'm restarting the poor thing (dbvis) at least 2 or 3
times per day, when working with Sybase.

Also, as a little quasi-undocumented tip, the letter d from
jconn2d.jar does not mean development [drivers] as some of you would be
inclined to think. In fact, it means debug which is the
abbreviation for put me in production poor lousy bastard and I'll
start spewing reams of useless messages through all the logging APIs
known by man and a few yet to be discovered, therefore instantly
slowing down to a crawl your puny little software
.

Ermmm … well, all this nice introduction just to humbly confess that
we do have a couple of them Sybase licences floating around here and
some of our production databases are on Sybase ASE. While I can assure
you that this is going to change at least on the web backend, it's
also true that the beast must be alive in order to keep our company up
and running. And, that's part of my job.

Which is of course very strange for an IT management job. But then again:
when your sole DBA is overwhelmed by a horde of Sybase-specific tasks
(like for instance the log configured to truncate at a specific
threshold which naturally [for Sybase] does not truncate, suddenly
throwing all the users into log suspend in the middle of the peak
production time) – you have to enter into the damn kitchen and do some
cooking !

My endeavour was to perform simple data mining in order to migrate
some reports from a legacy system [no, you don't really want to know
what system]. Due to my limited time, starting a mildly complex Java
reporting project was out of the question, so a scripting language was
the natural choice. Python being the first option – unfortunately,
finding a working Sybase driver for Python is a challenge in itself.
But, thank God for Jython and zxJDBC ! In just a few minutes I was
wiring the tables for reporting. Here's a fake code snippet which lists the 'orders' with amounts < 1000, and you can't go any simpler:

from com.ziclix.python.sql import zxJDBC 

conn = zxJDBC.connect("jdbc:sybase:Tds:myserver.mydomain.com:4110/mydb",\
	"user","pass","com.sybase.jdbc2.jdbc.SybDriver")
cursor = conn.cursor() 

cursor.execute("select count(orderid) from orders")
nb_orders=cursor.fetchone()[0]
print '%.0f total orders ...'%(nb_orders,)

cursor.execute("select orderid, amount from orders where amount <1000")
oids=cursor.fetchall()
for o in oids:
	print 'Order %.0f for amount %.2f'%(o[0],o[1])

cursor.close()
conn.close()

Easy as pie. And once you got one down, you got'em all. For instance,
yesterday evening I wrote in about an hour a small Jython script which
exports some data. The same export process (running about an order of
magnitude slower) needed previously a couple days of development in
FoxPro. Ah, FoxPro - very juicy subject, but I'll keep it for my next
horror story. Until then, don't forget, when you have a monster to
tame and no time at all: try Jython !

*Let's suppose for a very brief instant that having Celine Dion at a certain location is a positive thing.

Written by Adrian

February 2nd, 2005 at 8:19 am

Posted in Tools

Tagged with , , , ,