What is MTTR?

How fast you go from "it's down" to "it's back". The only metric that matters at 3 AM.

MTTR = Mean Time To Recovery

The clock starts when something breaks. It stops when it works again. Everything in between is your MTTR. The goal: make that gap as small as humanly possible.

MTTR = Total downtime Γ· Number of incidents

The four stages of every incident

πŸ””

Detection

Someone (or something) notices it's broken

πŸ”

Diagnosis

Figure out what went wrong and where

πŸ”§

Fix

Actually fix the damn thing

βœ…

Verification

Confirm it's really back, not just pretending

Why should you care?

  • β€’Every minute of downtime = lost money and angry users refreshing your status page
  • β€’99.9% uptime SLA sounds generous until you do the math β€” that's 8.7 hours of allowed downtime per year
  • β€’Industry average MTTR is 4+ hours. DORA's "elite" teams do it in under 30 minutes. That's the gap.

Why most people's MTTR is terrible

The laptop dance

3 AM alert β†’ find laptop β†’ boot up β†’ VPN β†’ SSH β†’ wait, which server was it again?

Log spelunking

Which service? Which log file? What am I even looking for?

Fear of making it worse

One typo in a manual command and now you have two incidents

How mttrly cuts each stage

StageWithout mttrlyWith mttrly
DetectionWait for user complaintAlert hits your phone instantly
DiagnosisSSH β†’ grep β†’ scroll β†’ guess/logs --errors from the couch
FixType commands, pray for no typosOne tap. Confirmed. Done.
VerificationOpen dashboards, refresh, squint/status right after. Green? Go back to sleep.

Result: MTTR from hours β†’ minutes

What about Grafana and Datadog?

Great tools for seeing that something broke. Beautiful charts, clever alerts. But when that alert fires at 3 AM β€” you still need to open your laptop, SSH in, and fix it manually. mttrly picks up where they leave off: alert β†’ fix from phone β†’ back to sleep.

Industry Standards & Research

Site Reliability Engineering: How Google Runs Production Systems β†—

Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy Β· Book (Free Online)

The foundational SRE text defining MTTR, monitoring, alerting, and incident response practices used by Google and adopted industry-wide.

DORA State of DevOps Reports β†—

DevOps Research and Assessment (Google Cloud) Β· Annual Research

Industry benchmarks for MTTR, deployment frequency, change failure rate, and lead time. Elite teams achieve MTTR under 1 hour.

Accelerate: The Science of Lean Software and DevOps

Nicole Forsgren, Jez Humble, Gene Kim Β· Book

Research-backed evidence that MTTR is one of the four key metrics predicting software delivery performance and organizational outcomes.

PagerDuty Incident Response Documentation β†—

PagerDuty Β· Open-source Guide

Free incident response handbook covering on-call practices, severity classification, communication during incidents, and post-mortem processes.

Try it on your next incident

Free tier. Setup in 2 minutes. No credit card.