What is MTTR?
How fast you go from "it's down" to "it's back". The only metric that matters at 3 AM.
MTTR = Mean Time To Recovery
The clock starts when something breaks. It stops when it works again. Everything in between is your MTTR. The goal: make that gap as small as humanly possible.
MTTR = Total downtime Γ· Number of incidentsThe four stages of every incident
Detection
Someone (or something) notices it's broken
Diagnosis
Figure out what went wrong and where
Fix
Actually fix the damn thing
Verification
Confirm it's really back, not just pretending
Why should you care?
- β’Every minute of downtime = lost money and angry users refreshing your status page
- β’99.9% uptime SLA sounds generous until you do the math β that's 8.7 hours of allowed downtime per year
- β’Industry average MTTR is 4+ hours. DORA's "elite" teams do it in under 30 minutes. That's the gap.
Why most people's MTTR is terrible
The laptop dance
3 AM alert β find laptop β boot up β VPN β SSH β wait, which server was it again?
Log spelunking
Which service? Which log file? What am I even looking for?
Fear of making it worse
One typo in a manual command and now you have two incidents
How mttrly cuts each stage
| Stage | Without mttrly | With mttrly |
|---|---|---|
| Detection | Wait for user complaint | Alert hits your phone instantly |
| Diagnosis | SSH β grep β scroll β guess | /logs --errors from the couch |
| Fix | Type commands, pray for no typos | One tap. Confirmed. Done. |
| Verification | Open dashboards, refresh, squint | /status right after. Green? Go back to sleep. |
Result: MTTR from hours β minutes
What about Grafana and Datadog?
Great tools for seeing that something broke. Beautiful charts, clever alerts. But when that alert fires at 3 AM β you still need to open your laptop, SSH in, and fix it manually. mttrly picks up where they leave off: alert β fix from phone β back to sleep.
Industry Standards & Research
Site Reliability Engineering: How Google Runs Production Systems β
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy Β· Book (Free Online)
The foundational SRE text defining MTTR, monitoring, alerting, and incident response practices used by Google and adopted industry-wide.
DORA State of DevOps Reports β
DevOps Research and Assessment (Google Cloud) Β· Annual Research
Industry benchmarks for MTTR, deployment frequency, change failure rate, and lead time. Elite teams achieve MTTR under 1 hour.
Accelerate: The Science of Lean Software and DevOps
Nicole Forsgren, Jez Humble, Gene Kim Β· Book
Research-backed evidence that MTTR is one of the four key metrics predicting software delivery performance and organizational outcomes.
PagerDuty Incident Response Documentation β
PagerDuty Β· Open-source Guide
Free incident response handbook covering on-call practices, severity classification, communication during incidents, and post-mortem processes.
Try it on your next incident
Free tier. Setup in 2 minutes. No credit card.