Monitor your stuff: Nagios
Nagios is a system monitoring tool that runs on Unix/Linux platforms, but will monitor practically anything – Unix, Windows, databases, environmental/HVAC, you name it. Nagios is good, and works well – but it isn’t easy to configure.
In previous assignments, I used Big Brother to monitor my systems and networks. At that time ( a few years ago now…) Big Brother was open source. Big Brother, was good, fairly easy to configure, and worked well. At some point the rights to Big Brother were acquired by a commercial firm (Quest Software, Inc.) who closed the project. I soured a bit on Big Brother after that.
At another time we used the supposedly “industrial strength” Foglight and Spotlight by Quest. These were billed as “enterprise” monitoring software packages. Spotlight was pretty cool. It has lots of pretty lights to impress pointy haired bosses with. Foglight was not particularly impressive. First of all it ran on top of an Access database. Yup, nothing says “enterprise” more than having an Access backend. We found the configuration of monitored client systems to be cumbersome and not flexible enough to give us what we wanted in every situation. Did I say these cost thousands of dollars to purchase, and thousands of dollars to keep in maintenance? Yes, thousands. So with Foglight we had less than what we had with Big Brother and paid for it. Paid a lot for it. They say that later versions of Foglight are “New! Improved!” Well maybe they are. I’m not buying it.
In my new position there was no monitoring infrastructure. With 2 people and dozens of (undocumented) systems and servers, we had no idea which servers were up or down. Customers would tell us when things were broken.
I needed a monitoring tool that was:
- Free, both as in money and as in speech
- Capable of monitoring any type of system (Unix, Windows, Oracle, etc.)
- Rapid to install, rapid to deploy
After some research, I found Nagios. Nagios seemed almost as a spiritual heir to Big Brother. For someone that grew up with Big Brother, Nagios has a familar feel to it. The default install was straightforward (at least on CentOS 4). The install instructions were accurate and good.
When it comes to telling Nagios about the clients to be monitored, though, there are so many options that it was complicated almost to the point of incomprehensibility. I got a few of the server sided tests to work, but many of the more useful tests run on the client (invoked on demand by the Nagios server). In addition there are non-default options in Nagios that give more information, as well as add on packages to extend Nagios (by adding the graphing of trends, say). I didn’t have time time to figure it all out from scratch.
Time for a big fat book.
I was in a hurry, so I headed down to the local Barnes and Noble, and lo and behold I found Nagios System and Network Monitoring
by Wolfgang Barth. (Which was kind of amazing, since the computer section has shrunk over time to maybe 20% of it’s former size.) For me this was a great book. Basically Barth explains Nagios from entry level concepts and configuration, and then works his way up to the more complicated. I basically just followed along, and after a while (a long while) I had my monitoring set up in a useful fashion. I now monitor about 100 servers, which host about 400 different monitored services. Not a huge installation, but not a trivial one either. I’d say I put in about 80 hours of configuration and scripting to get this going. Now, I don’t spend much time on it. Basically when a host is added or removed, you have to tweak the configuration to reflect the changes to your topology.
One thing that I suggest is that you always install a pair of Nagios servers. One will be the primary, the main one that watches everything else. The second one only watches the primary. If you don’t do this, then when the primary fails, there’s no one left to report the failure or any subsequent failure of another system. Your monitoring just silently goes off line. Not good.
A wise person once said that “Linux is only free if your time has no value“. That’s certainly true about Nagios. On the other hand, I had no budget for monitoring, so all I had was time. And like I said, in the past when we bought Foglight we spent something like $25,000 on the initial install, and then thousands, every year, forever, for maintanance. So 80 hours of labor doesn’t seem too bad.
Nagios works better anyway.