Wed, 13 Sep 2006
Failover Pairs - A short Rant
Let's cover the basics, if you've got two machines working as an
identical failover pair then THEY SHOULD BE IDENTICAL. Adding services,
hell, adding nearly anything, to only one of them is a mistake. You've
now created a bias on which one you need running and you can no longer
assume they'll both do the same thing in the same situation. Which
defeats the whole point of having them. This might seem obvious, but the
number of people who break this simple rule never fail to make that
pretty little vein in my neck dance.
Now we'll discuss testing the failover. You should do regular, scheduled and signed off, failover tests. It might be difficult to get permission for a test when everything is working. This is typically because people don't have enough confidence in the technology, people and process - often accompanied by uncertainty about the length and impact of the outage. In a very chicken and egg style you can only get confidence by (successfully) performing the test and measuring the impact. You should have a staging setup that'll let you perform the test as many times as you need to get it down pat. And then a couple more times just to be certain before you perform it in production.
This is also solves one of the related problems, things that happen rarely don't get tested or explained and the documentation drifts out of sync with reality. You should have a set of machines in staging that the new guys can play with, these should be tested (with the documentation) on a set schedule.
An untested failover pair are a working machine and a hope - nothing more.
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2006/09/13 23:33 | /misctech | Permanent link to this entry | This entry and same date
Provisioning a Fresh Server Install
Once a machine has settled in to a rack how long does it take you to
turn it in to a working server?
How many of these steps are automated? The longer you can go without making manual changes the more comfortable you can be that the machine's running as it's supposed to be.
What little tweaks do people make once the machine is up? How do you know they've been done correctly on each machine? Do you have a small bundle of configuration checks for local modifications? What happens if they get nuked? Do you notice or do they just drift further out of sync with the baseline deployment (and each other)? Do you use an integrity checker on all machines looking for unauthorised changes?
How long does the complete process take from start to finish? How does this fit in with your MTTR numbers? If it takes an hour to build a server and you've got a MTTR of 30 minutes on a critical mail server then you've got problems.
Do you need to manually add new machines to other, external, systems and/or processes? Nagios for monitoring? DNS? Documentation on your intranet? How do you keep these in sync and how often are they audited?
Why is it not as easy as just plugging the thing in anymore :)
Like this post? - Digg Me! | Add to del.icio.us! | reddit this!
Posted: 2006/09/13 07:45 | /misctech | Permanent link to this entry | This entry and same date

