Wed, 10 Nov 2010
Zabbix GUIs and Automation
In the T-DOSE Zabbix talk, which I'm happy to say was both well presented and showed some interesting features, I got called out for a quote I made on Twitter (which just goes to show - you never know where what you said is going to show up and haunt you) about the relevance, and I'd say overemphasis, of the GUI to the zabbix monitoring system - and other monitoring systems in general. Rather than argue with the speaker (for the record, I hate it when the audience does that) I thought I'd note my objects with it here instead.
My monitoring system of choice is Nagios. It's starting to get a little long in the tooth (where can I add a new host on the fly?) but it's survived this long because it got a lot of things right. Including its loose coupling and the fact it can read directories of config files. Zabbix, and to a degree Hyperic (which has a command line interface that only a satanist could love), are GUI focused tools (let's ignore auto-detect for now). To add a host you click around. To add another host, you click around. To add a new check you click around. To add a new group you click around. To... you get the idea.
Now that might have been fine a couple of years ago (well, not really) and it's an easy, intuitive way to add new config (well, not in Hyperics case) but it bothers me on two levels, firstly I like the Unix approach that nearly everything is a file, and secondly I no longer have a handful of hosts, I have hundreds to thousands of the things. All spread around multiple production sites for all the usual reasons like load distribution, geographical locality, resilience etc and I generate all the configs for those from my puppet modelling.
Using the right puppet modules my servers know about my clients, my clients can aggregate their services and everything stays in sync. While zabbix allows you to define templates and associate checks using groups and (which Nagios can also do) that's the wrong level for me. My servers have lots of traits I need applied to them, monitoring, trending, logging etc and I want to define that once and have the artifacts actioned where they're needed - not have to work around half an API or click through a GUI. To be honest, the fact that the API seems like such an afterthought bothers me (possibly unreasonably so) as I think it shows a community with different needs to mine.
And now on to using the GUI to actually display information. From the presentation I understand that the Zabbix team are moving in the direction I consider correct - anything you can do from the GUI you'll be able to do from the API. Your monitoring system (and your trending systems) are too important to only access in the way other people think you should. The information in it needs to be presented in different ways to different audiences - and here too I think Nagios (with a little help from MK Livestatus and Nagvis is currently doing an OK job. It's extensible, I have a full query language for retrieving monitoring state and I can convey the information on a screen that highlights my information in the way I like - without making people use the still very unloved Nagios CGIs. Forcing me through a single GUI that allows no comprehensive API access (other than raw SQL) is a losing bet for me.
Hopefully some of that explains my issues with monitoring in general and zabbix (as it is at the moment) in particular. You may not agree with it - but with the tool chain I have these days I think the nicer interfaces, without full API access is a bug, not a feature.
Thu, 03 Jun 2010
Obese Provisioning - Antipattern
One antipattern I'm seeing with increasing frequency is that of obese (or fat, or bloated) system provisioning. It seems as common in people that are just getting used to having an automated provisioning system and are enthusiastic about its power as it is in longer term users who have added layer on layer of cruft to their host builder.
The basic problem is that of adding too much work and intelligence to the actual provisioning stage. Large postrun sections or after_install command blocks should be a warning sign and point to tasks that may well be better off inside a system like Puppet or Chef. It's a seductive problem because it's an easy way to add additional functionality to a host, especially when it allows you to avoid thinking about applying or modifying a general role; even more so if it's one that's already in use on other hosts. Adding a single line in a kickstart or preseed file is quicker, requires no long term thinking and is immediately available.
Unfortunately by going down this path you end up with a lot of one-off host modifications, nearly common additional behaviour and a difficult to refactor build process. A tight coupling between these two stages can make trivial tasks unwieldy and in some cases force work to be made to remove or modify the change for day to day operation after the build has completed.
A good provisioning system should do the bare minimum required to get a machine built. It should be lean, do as little as possible and prepare the host to run its configuration management system. Everything else should be managed from inside that.
Tue, 12 Jan 2010
Spreadsheets Vs Post-It Notes
I'm a fan of documentation, over the years I've ended up supporting more than one business critical system that has less documentation than you get from a
The only downside, and I've been bitten by a couple of things like this over the last week is the case of the spreadsheet vs the post-it note - if you have a lovely, well formatted and information dense spreadsheet that says "A is 1" and when you get to the server room the switch has a post-it, in bad scrawl, that says "B is 2" which do you believe?
Fri, 28 Aug 2009
Find and replace interview question
We've recently been searching for a junior sysadmin to join the team (and I'm very happy to say we've now succeeded) so as part of my day to day tasks I had to come up with a dozen simple questions to weed out the people that have never used anything but webmin (and there is a surprising number of them out there). One of the questions seemed to cause a lot of trouble in the general sense and tripped up the few who even made an attempt -
"How would you change all occurrences of 10.23.34.10 to 10.23.34.101 in a text file?
While most of the candidates failed this one on account of skipping the question we had a couple make attempts (oddly all using the vim replace syntax). While I was hoping for a little bit of sed or perl a solution's a solution. Unfortunately theirs all had edge cases.
First was the obvious trick in the question. All the answers (apart from
one) came back with something like this -
%s/10.23.34.10/10.23.34.101/. First up we have the missing
boundary test. While that regex is fine for 10.23.34.10 it breaks quite
badly on 10.23.34.101 (it replaces it with 10.23.34.1011). Secondly the
dots are unescaped, which means it's not only going to work on IPs, but
I was actually really surprised at how few of the admins could write an awk / perl / shell script and how many paid very little attention to what the regex actually matched. Still, while the question seemed easy enough it had enough awkwardness to enable further discussion about regexs, data checking and the evil of manual changes. And it showed that everybody likes vim :)
Thu, 27 Aug 2009
Large uptimes - a wonderful problem to have
When it comes to the list of problems 'our uptimes are too high' isn't normally in the top five that sysadmins dread.
While having a lengthy uptime used to be a boasting point it can also hide technical issues - such as kernel upgrades you've applied but not enabled (unless you're running something special like ksplice), confidence gaps in high availability systems (when was the last time you did a fail over?) and a general worry that what's running on a host now may not be when it comes back up.
The solution? Embrace the occasional controlled reboot and exercise those HA systems. After all, any machine that can't be rebooted without the customers noticing is a strong candidate for a single point of failure
Wed, 19 Aug 2009
Testing A Production DNS Re-point
We recently consolidated a number of websites used by one of our brands back down to a sensible number (sensible being one). Which, while only a single action point on an email, turned out to be a large amount of DNS and apache vhost wrangling. In order to give myself a safety net, and an easy to check todo list, I decided to invest ten minutes in writing a small test script.
Despite all my best intentions and experimenting with testing DNS with Cucumber and RSpec when the issue came up for real, and on a short deadline, I fell back to old habits and reached for perl. Net::DNS is an excellently documented module and the little utility modules like File::Slurp and URI::Title are perfect for quick tasks like this. The full test_dns_repoints.pl code, which is only 47 lines is something I can see myself using again and again as we repeat this kind of work.
It's amazing how much a little effort can gain you when it comes to testing your infrastructure. Having that safety net, especially when you're a one man team, is a very reassuring feeling and it's one I'm trying to introduce in to more areas of even one off jobs.
Fri, 03 Jul 2009
By Puppet or Package
At work we both build our own packages and use puppet to manage our servers. While the developers package up their work in the systems team we've moved more to deploying programs and their dependencies via Puppet.
While it seems easier, and quicker, to do the pushing that way, at least for scripts, you lose the ability to track what's responsible for putting each file on the system. I'm probably already modelling the more complex parts of what would be in a package (such as services and cronjobs) in the module and thanks to Puppet I'm probably doing it in quite a portable way. Is this actually better than using packages? It's certainly easier than building complex packages but it quickly gets awkward when you start needing to deploy compiled binaries or apps with lots of moving parts.
For now my rule seems to be - use puppet for small, non-compiled, apps and package up anything with lots of dependencies or that needs to be compiled. How do you deploy your infrastructure scripts and supporting artifacts?
Wed, 03 Jun 2009
It's been Critical for how long?
Nagios has a wonderful 'duration' column in its web interface that's always bemused me. At what point does a check being in a warning, or even worse, a critical state stop being a problem worthy of head space and start being normal operating procedure?
Checks can stay in an extended broken state for many reasons but they all seem to be symptoms of a larger problem. If it's a small thing then are you getting enough time to do housekeeping? If it's a big thing do you have enough business buy in to keep things running optimally? Are you monitoring the wrong thing? Is there even anything you can do to fix it? If not then maybe Nagios isn't the best place to put the monitoring, maybe a status report is a better place.
Wed, 04 Feb 2009
Nagios check_http flaps
We recently had an odd one where the Nagios check_http check, which was both checking for the presence of a string in the response and that the page loaded in a certain time frame, went from reporting a 'CRITICAL - string not found' to a 'HTTP WARNING: HTTP/1.1 200 OK'. My first thought, as this was a site pending migration, was that the URL had moved to a slower machine with the fixes released to it. Alas, it's seldom that obvious.
It turns out that somewhere in the Nagios check a slow page that exceeds the -w options threshold overrides the fact that the string is missing, even though that's a warn replacing a crit. Bah.
Tue, 03 Feb 2009
Splitting Syslogs by Facility
Logs are a wonderful thing. If done correctly they point out the source of all errors, show you what's running slow and contain useful information on how your system is running. At every place I've ever worked they've been busy, full of odd one offs and too often overlooked.
I'm going to be doing a fair bit of log processing next week so expect lots of little toolchain scripts like syslog-splitter.pl to be checked in to git and mentioned here.
syslog-splitter takes a logfile as an argument and breaks the logfiles
in to many smaller units, one file per facility (which contains all the
lines for that facility from the logfile), to make it easier to process. I
seem to invoke it followed by
wc -l out/* | sort -nr when on
new machines to work out where I need to invest some time. Over the next
week or so I'll come back to the topic and show how I'm reducing the noise
to help me find the important lines.
Mon, 19 Jan 2009
My Pet Puppet Hate - Adding New Types
Now that chef is out and about people that accepted the massive improvement over all the existing host configuration managers that is Puppet will probably be casting a weary eye its way.
I've got a little too much in puppet at my current position to look at moving for a while yet but now the competition is rising its time to get my boot in and point out what, for me, is the worst part of puppet; how difficult it is to add new types.
One of the greatest strengths of tools like Nagios and Munin is the community tools provided. Nagios has a decent selection of plugins out of the box but a quick google or check of NagiosExchange shows dozens of additions (including some of my own Nagios Plugins.
With puppet on the other hand once you reach the point where you want to write custom types it all gets very heavy, very quickly. The biggest issue, to me at least, is that the level of abstraction feels wrong. Adding a simple type that will add a line to a config file for example (such as /etc/sysctl.conf) should be an easy task but the lack of documentation and the different approaches taken by the existing types (which seem to have been done at very different times and feel quite different) make it awkward to crib from. If instead there was a simple type where you changed the filename and the separator for example then a lot of custom types become within reach for less ruby skilled users.
On the flip-side my current hope is the Augeas type. It understands a lot of config files, provides consistent access to add and append to them and can be wrapped in defines.
Mon, 12 Jan 2009
It's been a while since I've been involved in pen testing but the above quote from Ivan is perfect and its meaning all too often overlooked. When you invest the time in something like pen testing or performance tuning you should always come away with a list of actionable tasks.
By doing this you ensure the work wasn't pointless (or if it was avoid repeating the mistake) and have something you can present to stake holders to get buy in for the next time. It's also easier to automate some of the scut work if you have a solid list of tasks and outcomes.
On the flip side it's also worth considering how actionable some of your other automated processes are. Does every Nagios error have a solution to resolve it? Do actions emerge from your graphs or do they just add background noise?
Tue, 12 Aug 2008
You've gathered the requirements, written the code, debugged it, received the new requirements, rewritten the code, got more change requests, reached a 'compromise' with QA (and hidden the bodies) and now you want to have the sysadmins do the release.
Don't be like everyone else - when it comes to releases too many people fail at the last mile and make obvious mistakes. In an attempt to save myself some pain (and have something to point co-workers at) here are some of the software release principles that I hold dear.
Out of hours releases will have adequate support
Or as I like to think of it - "out of hours releases will hurt you as much as they will me. And a little bit more" If the release is important enough to require me in the office late at night or over the weekend then it's important enough to have development support and a manager present Just In Case. It'll also force people to be a little more considerate of my time and availability.
No live release will happen after 4pm (at the latest)
There's nothing quite as frustrating as getting two third of the way through a live release, hitting a problem or needing clarification of something that the staging environment didn't pick up (yes, I know it should have. Let's fix it for next time) and discovering it's 6pm and everyone else is already on the tube or in the pub.
You then have the pleasure of either backing the release out (if you actually can) and explaining why you killed the scheduled release or hanging around with half an upgraded system waiting for someone to get your voice mails and call you back. Which is even less likely to happen if you ignore...
No release will happen on the day before a non-work day.
Or the day the lead developer goes on holiday.
"We're nearly done. Can you get $dev to have a look at this line in the application error log please?" "Actually he's in Peru for the next three weeks. I'll get someone else who's never seen the system before to confirm that everything's fine." Apart from the obvious sign this is a made up conversation (application error logs that contain information - HAH!) I've been bitten by this a number of times. It always seems to end with some other poor developer with a postage stamp of hand over notes looking sheepishly at me while explaining that the log line could never happen.
You'll provide me with a list of what's changed
When you're developing you should maintain decent change logging above and beyond 'commit -m'. I'd like the world to agree that commit messages are for developers and release notes are for sysadmins; let's pretend I'm not paranoid enough to read the commits list anyway.
If you're using one of agile methodologies that uses stories for everything then feel free to put the story number in the notes to provide some background. However I still expect a one sentence summary of each change.
If you don't have a decent, and comprehensive, list of changes expect me to get... inquisitive about undocumented changes. I will diff the packages (better version of this coming soon) or source code (and if I ever get the time to look at SQL::Translator I'll be aggressively double checking your schemas). If you don't mention it I can't prepare (and add monitoring) for it, QA can't test all the new paths and I'll make a point of in the release retrospective meeting.
It should be possible to stagger releases across machines
I'm a fan of the one, few, many approach to software releases. I want the ability to role out the new system in chunks. I should be able to break off a couple of web servers so I have a warm standby just in case something goes wrong. I know this gets difficult to do once you involve databases but it's still a goal that should be considered - especially with read only copies of the data, replication slaves and data snapshots in the tool belt.
So in closing, I'm a demanding bugger. However, just like my Cron Commandments post, it's nice to have this list somewhere online to point people at.
Sat, 21 Apr 2007
Here's one that has been bothering me. Suppose you have a recurring problem that your "autonomic solution" can handle every time it occurs without any one knowing. At what point does the fact there is a treatable issue propagate up to a real person?
While an automatic "fix and tell me later" approach helps change your work from fire fighting to planned tasks what classifies a temporary problem as being important enough to warrant you investigating it? It's hard enough to justify preventive maintenance with the current systems, if it fixes itself then you may never get given the time to investigate further.
If a problem fixes itself before any one notices or a sysadmin can look at it is it a problem?
Handling Requests: Three Simple Rules
I'm a sysadmin, half my working life seems to be spent handling other peoples requests (which is why I'm trying to move over to infrastructure work - where I can hopefully concentrate on something for three whole minutes). While chatting with a junior admin at a tech talk in the week the following three tips came up:
Use a ticketing system. This one comes up a lot but it's true, never dropping someones request is well worth the time spent setting it up.
Customers sending requests to individuals is a BAD THING. People go on holiday, they get dragged in to meetings. They work on projects. Which of those do you think someone who's been waiting for a request will accept as an excuse? None of them. And telling them that it's their own fault is a great way of annoying them even more - even if it is true. Training your users to reply to all (so follow ups also get tagged by the ticketing system) and to not send a "Just a quick question" mail so their favourite sysadmin helps you keep an eye on the workload while ensuring that things can't drop between the cracks. Even if it's an often repeated uphill struggle.
There is a caveat to this one. If you've got the resources it's often helpful to assign a sysadmin to a new employee for their first couple of days. Asking those awkward new starter questions is a lot easier face to face than on a mailing list of who knows how many. Any requests can then be added in to the ticketing system while the sysadmin is present, showing the starter how to use it, and that the admins actually pay attention to and process tickets. Nothing beats a good first impression.
Lastly, people have an expectation of how long something should take. If you break this unwritten rule, even for a good reason, then they'll notice and it'll be used against you at some future point. While it's not ideal for concentration quickly completing short tasks like password changes can make a huge difference in how your team is perceived.
Tue, 17 Apr 2007
No one likes a whinger - The systems fight back
After my little whine I logged in to do my last checks for the evening to discover that one of our webservers had died due to a hard drive going bang, our production environment Nagios box had lost one of its network connections and a chunk of our SAN kit was complaining about power issues. Turns out that most of these were due to a power surge that killed a network switch and three of the racks power strips. On the very plus side no one outside of the systems team noticed. Resilience is a wonderful thing when you get it right.
Woke up this morning, checked the
Nagioses Nagii and found
out that one of our other products database servers had gone boom (my
fellow sysadmins were fixing that one) and the fail over had mostly worked.
No interesting logs, no hardware problems and a three hour gap in syslog
(and only syslog) to help explain the outage.
What have I learned? That the production servers read my blog. And they hate me.
Tue, 27 Mar 2007
Bonded | Teamed Network Interface Challenge
Here is another one for the sysadmins in the audience:
... many of your servers have multiple network ports in the back?
... many of them have bonding (teaming for the Windows people) enabled?
... do you know when one interface goes down if the machine stays connected?
... long does it take for you to be notified?
... do you know if they start flapping?
... many have their bonded interfaces plugged in to different switches?
... how do you know if some one mistakenly plugs both in to one switch?
I've got a fun week ahead.
Mon, 26 Mar 2007
Monolithic Config Files Considered Harmful^WAwkward to Manage
This came up in conversation with a developer at the Google OpenSource Jam so I thought I'd mention it while it is fresh in my mind (update: at which point I forgot to move it to the published directory. Doh). Breaking up config files isn't done just to annoy people, it's done to make automated and mass management easier.
A solid practical example is the Debian Apache configs. Historically most distros (and too many current ones) used a single config file for Apache. While this made interactive editing easier by presenting all the options in a single place (and in a sensible order) it made it very hard for the package management software (or automation aficionados) to add a module or virtual host without some hairy scripting. Removing settings when a package is removed or updating a small chunk of the config in an upgrade is even more painful; as for preserving local changes - haha.
By breaking the config out in to a number of files / directories and combining them at run time it makes the addition of a new vhost or module config just a file drop and possibly a symlink (often used if the configs are order dependent). This is easier for third party packages to perform and makes provisioning of additional apps a lot easier.
So what's the main downside? Debugging. An "Error on line 50" is harder to track when line 50 could be in one of twelve files. But with a little forethought debug messages can be updated to show all the useful information. So next time you're writing an app of many parts please spare a thought for the sysadmins and make it easily manageable.
Sat, 10 Mar 2007
Importance Levels - A Simple Example
When you're first introduced to an environment you'll have the ever fun task of working out which machines should get the most time; and that order seldom matches which machines actually need the most attention. To help me prioritise I've worked out a simple importance rating system to show where I spend my time.
Below is a simplified version. I use it to assign a single importance number to each machine, and then I allocate a certain amount of time each day to work on the issues, requests and improvements I've got in my todo list for that level. When I've run out of time I move down a number and start working on anything related to machines rated at that importance. The amount of time I put aside for each level decreases as I work towards one.
5: Customer facing systems that generate revenue.
This is my no brainer. Pretty much everything is secondary to keeping the money coming in.
Examples: customer database, webservers and databases related to customer payments.
4: Internal Money Makers and Customer Visible Systems.
I normally put customer facing systems that don't make money in this bracket. An online presence and reputation for availability have been important to most of the companies I've worked at. It sounds horrible but it's a lot easier to save face and beg forgiveness from a five day internal outage than a one day external one; well, sorta. If you're a blogger watched company then this is even more important.
I also put internal money makers at importance 4. "Cash is king" should be true in all departments, including those where Sysadmins dwell. I've only ever had simultaneous problems with both types of importance four systems a couple of times. Each time had circumstances that made the priorities clear.
Examples: corporate website, company blog, invoicing systems, time-sheets at month end.
3: Systems that stop a number of staff working.
I typically put machines that don't directly contribute to the bottom line but are required for the company to continue in this bracket. A short outage of any machines at this level can be survived for a little while but it'll slow a lot of staff down, cause frustration and (after a while) cause major damage.
Examples: internal request tracking/ticketing, bug tracking, build machines, version control
2: Systems that hinder small numbers of staff.
This is another level that I use to cover two types of machine. The first type slows or hinders a number of staff but can be lived without. You can think of these as convenience or favour systems that make tasks easier or more pleasant. You'll often get a disproportionate amount of queries when one of these goes down. This is a good sign and shows you understand what your users care about.
When I'm asked to help with desktop support I lump single user problems here. Although it's frustrating to have a single person unable to work it's often not as bad as any of the higher levels. I put a lot of special cases and caveats here (sales people on presentation days, QA engineers before a release) but the most sensible workaround is to separate desktop and sysadmin roles. You can typically hire desktop support staff cheaper than a sysadmin and give them the opportunity to train with the sysadmins when things are quiet.
Examples: Web front-end to a version control server, centralised log shares for debugging, departmental wikis, individual laptops or desktops.
1: Play / scratch machines that no one really cares about.
Not much lives in this level, if no one cares if it's up or not then you should seriously consider turning it off. The smaller and simpler your environment the easier it'll be to manage.
Examples: sysadmin "play" lab environment, company jukebox
And now some warnings - these categories are (obviously) not perfect. The ratings are host-centric (but can be pulled up a level and applied to clusters or groups of identical machines), it doesn't factor in office politics (some systems are loved by certain members of management and should be treated like one of their loved ones).
It's also worth noting that some systems rise in importance at certain times; examples are month end batch reports, time-sheet systems when invoices are due etc. It shouldn't be too hard to work out most of these (typically cyclical) requirements after speaking to the other staff. Asking about their requirements is always a good way to help build bridges and show you do understand that the systems are there for a reason; to be used.
The Cron Commandments - part 1
Although it's a rare Unix machine that doesn't run at least a couple of custom cronjobs it's an even more special snowflake that does them properly. Below are some of the more common problems I've seen and my thoughts on them.
Always use a script, never a bare command line.
A parenthesis wrapped command-line in a crontab sends shivers down my spine. Nothing says "I didn't really think this through" and "I've done the bare minimum to make it work" in quite the same way.
Don't shout about success
A cronjob that completes successfully shouldn't post anything to
stderr. Most developers have no idea
how annoying it is to get a single line email every minute proclaiming
all's well. It also trains people to delete messages with certain subject
lines without reading them, which'll catch you out when a real
Caveat 1: Logging that the script finished, and adding some timing information, can often be useful. It's good to have an audit trail of what actually ran and how long it took. By logging to syslog you gain the benefits of centralised logs (you are centralising your log files right?) and, because it's passive, the sysadmin doesn't get notified about expected completions unless she looks for them.
Debug information should be an option
A script invoked via cron has a different environment than one run from the command line, it'll work (and break) in different ways - which you'll want to see. It should be possible to enable additional debug without making any changes to the script itself. A command-line flag or environmental variable should be enough to trigger additional debug information. Often all you'll get is an email with the error and the debug information so ensure you can diagnose from your own output.
Beware overrunning jobs
Almost all your cronjobs should check to ensure that another instance isn't already running and exit if it is - after logging the issue. I've lost track of the number of difficult to track bugs caused by a cronjob starting, taking longer to finish than the interval between runs, and then having another job follow it. This often causes deadlocks, resource conflicts, maxed out database connections and corrupted data. Some, very simple, cronjobs don't need this but when in doubt put it in. And log the fact, this can help pick up growth trends ("it took 2 minutes until we added the extra users").
Beware /dev/null redirects in crontabs
Any cronjob that redirects
(worse) both to
/dev/null is going to cause you headaches
and will need some attention. People typically add these when something
is wrong and they lack either the skill or the time to fix it. The
presence of these redirects show a lack of confidence in the script and
should be treated as a red flag. On the plus side they point you at
Avoid running as root
As in most things using root is bad. Try writing your cronjobs so they
can run under a non-privileged user, with a little
in if you need it. It'll save you a lot of hassle when something goes wrong
and the script tries to eat your file system.
And to close, a couple of quick points: test your cronjobs from cron,
not just interactively.
/etc/ is often backed up,
/var/spool/cron/crontabs/ is often missed so think about your
deployment locations. Make sure your admins know about any cronjobs your
packages add. And finally, if you generate your crontabs always add a
newline at the end.
If you at least know why you're breaking some of these rules (and they better be good reasons) then you'll be a good few steps above most developers I've worked with. And we'll get on a lot better.