What is MTTR?

How fast you go from "it's down" to "it's back". The only metric that matters at 3 AM.

MTTR = Mean Time To Recovery

The clock starts when something breaks. It stops when it works again. Everything in between is your MTTR. The goal: make that gap as small as humanly possible.

MTTR = Total downtime ÷ Number of incidents

The four stages of every incident

🔔

Detection

Someone (or something) notices it's broken

🔍

Diagnosis

Figure out what went wrong and where

🔧

Fix

Actually fix the damn thing

Verification

Confirm it's really back, not just pretending

Why should you care?

  • Every minute of downtime = lost money and angry users refreshing your status page
  • 99.9% uptime SLA sounds generous until you do the math — that's 8.7 hours of allowed downtime per year
  • Industry average MTTR is 4+ hours. DORA's "elite" teams do it in under 30 minutes. That's the gap.

Why most people's MTTR is terrible

The laptop dance

3 AM alert → find laptop → boot up → VPN → SSH → wait, which server was it again?

Log spelunking

Which service? Which log file? What am I even looking for?

Fear of making it worse

One typo in a manual command and now you have two incidents

How mttrly cuts each stage

StageWithout mttrlyWith mttrly
DetectionWait for user complaintAlert hits your phone instantly
DiagnosisSSH → grep → scroll → guess/logs --errors from the couch
FixType commands, pray for no typosOne tap. Confirmed. Done.
VerificationOpen dashboards, refresh, squint/status right after. Green? Go back to sleep.

Result: MTTR from hours → minutes

What about Grafana and Datadog?

Great tools for seeing that something broke. Beautiful charts, clever alerts. But when that alert fires at 3 AM — you still need to open your laptop, SSH in, and fix it manually. mttrly picks up where they leave off: alert → fix from phone → back to sleep.

Industry Standards & Research

Site Reliability Engineering: How Google Runs Production Systems

Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy · Book (Free Online)

The foundational SRE text defining MTTR, monitoring, alerting, and incident response practices used by Google and adopted industry-wide.

DORA State of DevOps Reports

DevOps Research and Assessment (Google Cloud) · Annual Research

Industry benchmarks for MTTR, deployment frequency, change failure rate, and lead time. Elite teams achieve MTTR under 1 hour.

Accelerate: The Science of Lean Software and DevOps

Nicole Forsgren, Jez Humble, Gene Kim · Book

Research-backed evidence that MTTR is one of the four key metrics predicting software delivery performance and organizational outcomes.

PagerDuty Incident Response Documentation

PagerDuty · Open-source Guide

Free incident response handbook covering on-call practices, severity classification, communication during incidents, and post-mortem processes.

Try it on your next incident

Free tier. Setup in 2 minutes. No credit card.