mttrly vs Monitoring Tools
Monitoring tells you something is wrong. mttrly lets you fix it.
Grafana shows beautiful dashboards. Datadog sends smart alerts. But when the alert fires, you still need to SSH in and fix things. mttrly closes the loop.
Alert: CPU High at 99%
What happens next?
WITH MONITORING ONLY (Grafana/Datadog)
- 1. Alert arrives: "CPU 99% on prod-server"
- 2. Open laptop
- 3. Open Grafana dashboard
- 4. See graph: yes, CPU is pegged at 99%
- 5. Now what?
- 6. SSH into server
- 7. Run: top
- 8. Find culprit: node.js eating 98% CPU
- 9. Decide: restart or investigate?
- 10. Run: systemctl restart app
- 11. Go back to Grafana
- 12. Wait for graph to update
- 13. Verify: CPU back to normal
- 14. Update incident ticket
WITH MONITORING + MTTRLY (Recommended)
- 1. Alert from Grafana: "CPU 99%"
- 2. Open mttrly in Telegram (already have)
- 3. You: "why is CPU high?"
- 4. Bro: [runs diagnostic] "node.js API process using 98% CPU Memory usage also high: 1.8GB/2GB Checking logs... Found 47 errors in last 5 min Started after deploy 2 hours ago Diagnosis: Likely infinite loop or memory leak Options: 1. Restart service (quick fix) 2. Roll back deploy (safer) 3. Show recent commit diff"
- 5. You: tap "Restart service"
- 6. Bro: "Restarted. CPU back to 12%. Monitoring..."
- 7. Done
The difference: observation vs action
Two Different Jobs
MONITORING TOOLS excel at:
- •Collecting metrics from everything
- •Visualizing trends over time
- •Correlating events across systems
- •Historical analysis
- •Alerting when thresholds crossed
- •Team dashboards
- •Capacity planning
They answer: "What's happening? What happened?"
MTTRLY excels at:
- •Taking action on alerts
- •Quick diagnosis from mobile
- •Fix without laptop
- •Common operations as button taps
- •Reducing MTTR
- •Emergency response
It answers: "How do I fix it? Right now?"
Best setup: Use both. Monitoring for visibility. mttrly for action.
| Feature | mttrly | Monitoring Tools |
|---|---|---|
| Primary purpose | Incident response | Observability |
| Take action | Yes (restart, deploy) | No (alert only) |
| Setup complexity | 2 minutes | Hours to days |
| Cost | Free tier, $39/mo Pro | $50-500+/month |
| Mobile app | Telegram (already have) | Separate app needed |
| Mobile action | Full control | View-only |
Grafana + Prometheus
Pros
- +Powerful visualization
- +Open source
- +Highly customizable
- +Great for trends and analysis
- +Free to self-host
Cons
- -Complex setup (days to weeks)
- -Requires infrastructure (servers, storage)
- -Alerting requires AlertManager setup
- -No action capability
- -Mobile app is view-only
Grafana shows you the dashboard. mttrly lets you act on what you see. Example workflow: 1. Grafana alert: Disk 90% full → 2. mttrly: /run disk-cleanup → 3. Done in 30 seconds. Grafana gives context. mttrly gives action.
Datadog
Pros
- +Easy setup (agent install)
- +Great APM (application performance monitoring)
- +Smart alerting with ML
- +Extensive integrations
- +Mobile app
Cons
- -Expensive ($15-30/host/month)
- -Can't take actions from mobile app
- -View-only interface
- -Cost scales with infrastructure
Datadog detects issues with precision. mttrly resolves them with speed. Example workflow: 1. Datadog: "Memory leak detected in api service" → 2. mttrly: "restart api service" → 3. Back to normal. Together: Detection + Resolution = Low MTTR
New Relic
Pros
- +Full observability platform
- +AI insights (anomaly detection)
- +Good free tier (100GB/month)
- +Distributed tracing
Cons
- -Complex pricing (confusing)
- -Steep learning curve
- -Actions require external tools
- -Mobile app limited
New Relic tells you what's wrong with AI precision. mttrly gives you power to fix it from anywhere.
Closing the Incident Loop
The Complete Incident Response Stack
DETECT (Monitoring)
Grafana/Datadog/New Relic:
- •Collect metrics
- •Detect anomalies
- •Send smart alerts
DIAGNOSE (mttrly)
Quick mobile diagnosis:
- •Check server health
- •View relevant logs
- •Identify root cause
- •30-90 seconds
RESOLVE (mttrly)
Take action:
- •Restart services
- •Run playbooks
- •Deploy rollback
- •1-2 minutes
VERIFY (Monitoring)
Confirm resolution:
- •Check dashboards
- •Metrics back to normal
- •Incident closed
Total MTTR: 3-5 minutes instead of 15-30 minutes
Monitoring tools excel at detection and observability. mttrly excels at action and resolution. Best setup: Use both. Monitoring for visibility. mttrly for action.