Mutterings about MTTR…

Apr 25

The most meme-able scene from “The Princess Bride” is a great joke about MTTR.

Mutter mutter… MTTR MTTR… ahem.

I used to think that MTTR (Mean Time To Resolution) was a decent metric for how we were tracking with incidents but my thinking has shifted the more I dig into things, because it isn’t what it claims to be. If incident durations spread across a normal distribution, it would make some sense, but incidents tend more towards a log normal distribution, rather than a normal distribution.

An image of a normal distribution in which a mean has meaning.

Because of distribution skew, the mean is pulled higher by outlier incidents, and it creates a misleading impression because the majority of your incidents will resolve far more quickly than the mean.

Generally also, the volume of incidents is low enough in a given timeframe that massive shifts in MTTR can occur with very few data points.

A log-normal distribution. This is what your incident data (TTR) probably actually looks like.

It also doesn’t factor impact. A three week non-critical incident will dramatically impact your mean more than 5 critical sub 30 minute incidents.

There is another argument to be made about the accuracy of severity levels, even if you were to curate to say “Only SEV 1, and SEV 2 incidents”, and once again, reducing your dataset makes it more volatile, assuming you are incredibly consistent in assigning severity.

Because TTR is unevenly distributed and volatile, its mean pulls focus towards the seemingly big stuff, and away from smaller quick high impact wins because of the skewed conflation between duration and severity, and while having a quantitative value seems nice for a report and a performance measure, if the focus is on that value you aren’t necessarily seeing other factors at play.

It also doesn’t capture modes of partial failure well. ie When the system is functioning but in a degraded state. It focuses on times when system becomes so brittle that it fails, without necessarily identifying the signals precipitating those failures, and it can encourage focusing on quick fixes rather than making systems more resilient (recovery time pressure).

A terrible graph representing a total systems outage. Smiles good, frowns bad.

Supporting graceful degradation by handling partial failure modes may prolong the time it takes a system to recover, but overall yield a better experience for users and integrations that interact with a system.

Another terrible graph representing graceful degradation and recovering. Smiles still good, frowns still bad.

The otherthing I’ve come to realise also is that it gets really dicey as a measure for these partial failure states.

Where do you start the MTTR clock ticking?

There’s definitely something going on, even in the graph above, but like the saying goes “if a partial failure mode occurs in the middle of the night, and no customers are online to see it did it even exist?”

Because MTTR is so rooted in the idea of the system being available or non-available it can put us in a very binary head space. Complex systems aren’t binary. If we think about them that way, we fall into the trap of implementing them with that mindset.

Distributed systems possess a lot more nuance, and need a lot more nuance given to them. Ask anyone who has observed magical cascade failures roll through connected systems.

This is also where we start to think about SLAs. We’ve probably got some nice SLAs all those 9’s looking fine in what we promise to our customers. But SLAs don’t really account for degraded states well either. SLAs might be violated by a degraded state as a system recovers, and yet still provide a better experience that a total outage, especially if we root our SLA in a single metric.

Graceful degradation requires a higher level of complexity to implement (generally) than not, requiring more control and monitoring, which can discourage us from supporting partial failure modes, even though it may minimise impact and reduce total outages.

It might actually make MTTR go up in the short term to do so if we are introducing them to an existing system. In part this can be because we don’t necessarily know how close our system is to it’s boundary, at which point it becomes brittle.

There’s a great case study where Uber actually broke their systems by introducing monitoring to warn them if they were breaking their systems, because the monitor itself carried overhead that pushed things outside it’s competency envelope.

Graceful degradation, and adaptability however is key to helping the system function at its boundary.

Improving those factors, may hurt MTTR…

This matters less, when you realise, as Inigo Montoya so astutely observed:

“I do not think it means what you think it means.”

That’s it… that’s the joke. I’m here all week.

If you are wondering “Ok, so what number do we replace this with?” Well maybe not just one, maybe many, maybe a composite of other measures.

Maybe, technically, we don’t replace it all.

Incident response is about how we as humans respond to incidents. MTTR says nothing about how we deal with surprises, and how we manage them and how we respond to them. That is far more complex and messy than a single metric can ever quantify.

Maybe, just maybe better measures of our success as practitioners aren’t just one thing, but many:

How our systems handle partial failures, and maintain quality in degraded states so that customer experience does not suffer significantly.
How predictable we can make system behaviour under times of increased pressure.
How well we can articulate and understand systems behaviour during incidents.
Getting better at identifying unknown system dependencies or interactions.
The quality of the stories we can construct around incidents and the quality of the timelines we can build for them.
How well we communicate with each other during incidents, and how well we communicate externally.
How we make decisions when faced with uncertainty.
How well we work across departments and organisational boundaries during incidents.
Our ability to anticipate similar incidents, and also to foresee not yet encountered failures. (requisite imagination).
How well we align and business and technical priorities in our work.
How well we improve our documentation over time (run books, technical notes etc).
Our level of confidence in the stability of systems post-recovery.
How well we learn to balance tactical and strategic improvements.
How well we share knowledge within and outside our own teams.

Taking a more qualitative approach acknowledges that systems are much more than machines, and the code that runs on them, but are built on people too.

Some further reading:

https://www.pnas.org/doi/epub/10.1073/pnas.2400215121 (Does counting change what counts? Quantification fixation biases decision-making)
https://devinterrupted.substack.com/p/the-problem-with-mttr-learning-from (The problem with MTTR — Learning from Incident Reports)
https://www.thevoid.community/report (The VOID Report)
https://resilienceinsoftware.org (Resilience In Software Foundation — A great resource for Resilience Engineering)

April Ayres-Griffiths

Mutterings about MTTR…

Psychological Safety as an enabler of adaptability and resilience in Complex Systems