It's been Critical for how long?

Nagios has a wonderful ‘duration’ column in its web interface that’s always bemused me. At what point does a check being in a warning, or even worse, a critical state stop being a problem worthy of head space and start being normal operating procedure?

Checks can stay in an extended broken state for many reasons but they all seem to be symptoms of a larger problem. If it’s a small thing then are you getting enough time to do housekeeping? If it’s a big thing do you have enough business buy in to keep things running optimally? Are you monitoring the wrong thing? Is there even anything you can do to fix it? If not then maybe Nagios isn’t the best place to put the monitoring, maybe a status report is a better place.