Monitoring alerts and customer satisfaction surveys - UnixDaemon: In search of (a) life

Closing the loop on a monitoring alert is traditionally something that implicitly happens when the dashboard returns to its idyllic green state, the text massage returns a well deserved “Service: OK” or in more extreme cases the incident review is over and actions have been assigned. This however assumes the alert is working well and the operator understands why it woke them up and the value their involvement brings. In more fluid environments alerts can be incorrect, issues that do not require immediate attention and in the worst case ghost calls that mysteriously correct themselves just after you’ve woken up enough to find your MFA device. I’ve been doing a little pondering about how to determine the value around some of the alerts that trigger and I’m considering adding an additional step to one of the monitoring deployments to attempt to capture some additional information about the interactions themselves from the operators.

There should be an additional feeling of responsibility when you add checks and alerts that potentially disturb people outside of working hours, especially when they should be asleep. Every disturbance should be a last resort and any concerns should be grabbed as close to the event as possible and provide a tracked one for the responder to raise any issues or complaints. While a slack channel is often the first place to see this kind of feedback it’s not a great way to gather some semi structured comments and so I want to try a small variant on a customer satisfaction survey using a google form like this one.

Google Form with questions from above example

As a first pass I’m going to iterate on one of the following simple question banks.

Did you take action to resolve this issue?

* Yes (optional comments below)

* No, it fixed itself

* Free form Comment text box

It’s quick, simple and passes the 3am test. On the other hand I think there is more value in being a little more specific and trying the more wordy:

Was this alert due to an actual issue?

* Yes, and it required human intervention

* Yes, and it resolved itself

* No, it was a false positive and required human intervention

* No, it was a false positive and resolved itself

* Free form Comment text box

It feels churlish to add a “Give your alert 1-5 stars” to either of those but I suspect it will make it easier to drill down onto the biggest pains. From a technical perspective these feedback forms will need to be able to gather some useful metadata and store it alongside the feedback results to provide context for discussion and review. The alert details themselves such as hosts, services, error messages and other data from the monitoring system and anything relevant from the alerting system such as date and time, responder name and any escalation paths it went through.

Having this additional process could quickly become yet another burden for the on-call responders so its deployment will need to be carefully planned. On quieter monitoring systems it could be enabled for all out of hours alerts and left to run until you have a decent picture of what’s alerting and how valuable people consider them to be. On more active monitoring platforms starting only with the out of hours or a smaller window of a few hours once a week should provide enough seed data to begin investigating. Considering how bespoke most monitoring and alerting configurations are I’m unsure how reusable a solution for this kind of feedback collection could be but most providers have some form of incident state feedback mechanism and most monitoring systems have a way to run remediation actions or events on state change so it should be generally applicable even if the implementation is very organisation specific.

Will it help whittle down the out of hours alerts that should be tickets? Can the data be used to pinpoint flaky parts of the systems with flapping checks? I don’t know, but I’d rather invest time in hours investigating it than have to deal with it out of hours.