mttrly for On-Call Engineers

Respond to incidents from anywhere

PagerDuty woke you up. Now what? With mttrly, you can diagnose and fix issues before even getting out of bed.

๐Ÿšจ 3AM PagerDuty: High Error Rate

Woken up by alert. Need to diagnose and fix without leaving bed.

B
Bro Terminal
>_ Interactive session
โฐ
3AM Alert
PagerDuty: errors spiking
โ†’
๐Ÿ”
Quick Check
CPU OK, Disk OK, RAM 94%
โ†’
๐ŸŽฏ
Root Cause
Memory leak from 1am deploy
โ†’
โฎ๏ธ
Rollback
Revert โ†’ restart โ†’ healthy
Before
Wake up, stumble to desk, wait for VPN, SSH, grep logs, diagnose... 15+ minutes.
Traditional on-call:
- Wake up fully
- Get laptop
- VPN connect (slow at 3am)
- SSH into server
- Run diagnostics
- Read logs
- Make decision
- Execute fix

MTTR: 15-30 minutes
After
Phone in hand โ†’ ask "what's wrong" โ†’ tap rollback โ†’ back to sleep. 2 minutes.
With mttrly:
- Open Telegram (5 sec)
- Ask what's wrong (10 sec)
- Review diagnosis (30 sec)
- Choose rollback (5 sec)
- Confirm (5 sec)
- Verify fixed (10 sec)

MTTR: 2 minutes

The Problem

  • โœ—Need laptop to respond to alerts
  • โœ—VPN connects slowly at 3am
  • โœ—Simple fixes take 15+ minutes
  • โœ—Can't leave house during on-call

The Solution

Get alerts in your messenger, check logs, restart services, run playbooks โ€” all from your phone. MTTR drops from hours to minutes.

The Pain of On-Call

You're on-call this week. That means: laptop always charged, hotspot always ready, can't go anywhere without connectivity. A 3am alert means stumbling to your desk, waiting for VPN to connect, typing commands with bleary eyes. Simple fixes take 15+ minutes because of setup time.

Why MTTR Matters

Mean Time To Resolution directly impacts your users and your SLA. Every minute of downtime is lost revenue, frustrated customers, and stress on your team. The industry average MTTR is 4+ hours. Companies with mobile incident response tools cut that to under 30 minutes.

The mttrly On-Call Workflow

1

Alert arrives

PagerDuty/OpsGenie triggers. mttrly also sends an alert to your messenger with initial context.

2

Quick diagnosis

You: "what's wrong?" โ†’ Bro runs HighLatency diagnostic โ†’ CPU 23% (normal), Disk 45% (normal), RAM 94% (HIGH) โ†’ node.js process 3.2GB โ†’ 127 heap warnings โ†’ correlates with deploy 2 hours ago. Diagnosis complete in 15 seconds.

3

Execute fix

Standard fixes become one-tap: /restart nginx, /run clear-cache, /deploy hotfix. Confirmation required for safety.

4

Verify resolution

/status confirms services are healthy. Update the incident. Back to sleep.

Playbooks for Common Incidents

Pre-configure runbooks as mttrly playbooks. High memory? /run memory-cleanup kills memory hogs. Disk full? /run disk-cleanup clears logs and temp files. Database slow? /run db-vacuum runs maintenance. Your tribal knowledge becomes one-tap automation.

โ€œOur average response time dropped from 45 minutes to 4 minutes after adopting mttrly. The on-call engineer can acknowledge and fix most incidents without waking up fully.โ€
โ€” Sarah, SRE Lead at a fintech startup

Example: 3am incident response

๐Ÿšจ PagerDuty: High error rate on prod-api-01
You: /logs prod-api-01 --errors
Found 847 errors in last 5min: "Redis connection timeout"
You: /restart prod-api-01 redis
โœ… Redis restarted. Error rate dropping.
Total incident time: 2 minutes (without leaving bed)