From monitoring to autonomous DevOps.
Three tiers. One goal: server confidence.
Watchdog Mode
Monitoring, alerts, and button-tap actions. No AI, no surprises — just solid, predictable automation.
For DevOps, SRE, and teams who want control
- Status monitoring — CPU, RAM, disk, services — always visible
- Crash alerts — Know instantly when something breaks
- Auto-restart — Apps come back automatically after failure
- Docker management — Container lifecycle from chat
- Triggers — If X happens, do Y. You define the rules.
- Playbooks — Custom scripts run with one tap
- Dashboard — Web UI for multi-server overview
- Audit logs — Full history of every action
Deployment Bro
Talk to your servers in plain English. Bro understands context, diagnoses problems, and executes solutions.
For vibe coders and builders who ship fast
- Natural language — "Why is my app slow?" gets a real answer
- Context-aware — Bro remembers your setup and history
- Smart diagnosis — 9 recipes — reads logs, finds root cause, explains simply
- Git deploys — "Deploy my changes" — done, with auto-rollback
- Investigation Mode — Session bypass for AI clients — one 2FA per session
- Anomaly dashboard — 9 detectors with auto-mute and rate limiting
- 35+ MCP tools — Full access from Claude Code, Cursor, Codex
- BYOK — Use your own OpenAI/Anthropic API key
- Everything in Watchdog — All autopilot features included
LAYER 4: Investigation Mode (for AI clients)
Session-wide approval bypass for AI assistants debugging an incident.
Problem: Per-action 2FA breaks AI flow. A debug session is 5–10 exec actions, each requiring a push notification + approve. AI assistants (Claude Code, Cursor, Codex) lose context between approvals; operators get fatigued.
Solution: Open an investigation, do one 2FA gate to acquire a bypass, then exec for the duration of the session. Bypass auto-closes on inactivity (30 min default), hard cap (2h), or action limit (20 by default).
- ✓Server-scoped — bypass on one server doesn't apply to another
- ✓Atomic action counter — racing callers can't exceed max_actions
- ✓Allowlisted ops only — execute_command + execute_script. write_file, repair_install, update_agent stay per-command
- ✓Tunable per acquisition — inactivity 1–120 min, max actions 1–100
- ✓Hard cap is non-tunable (2h) — prevents accidentally opening bypass for 24h
Designed specifically for the Claude Code / Cursor / Codex workflow. Available on Deployment Bro and above.
Anomaly Management
Proactive monitoring with auto-mute, rate limiting, and a dashboard you actually use.
Detectors that ship today
systemd_pm2_mismatch — Service running under PM2 but no matching systemd unit (or vice versa)port_service_mismatch — Listening port but no matching service definitionkernel_error — dmesg errors — OOM kills, hardware faults, segfaultsstale_backups — Backup jobs not running on schedulesystemd_unit_drift — Unit file changed since last known-good stateincident_spike — Sudden cluster of incidents on a serverdeploy_failure_cluster — Multiple deploys failing in a windowinvestigation_churn — Same problem investigated repeatedly without resolutionagent_update_stale — Agent version is behindHow it works in practice
- ✓FP-budget auto-mute — repeated false positives on a rule auto-mute it
- ✓Notification rate limiter — no 3 AM page-storms
- ✓Acknowledge proactive events — mark as accepted or false positive
- ✓Anomalies dashboard — review, ack, and mute from one screen
- ✓Morning brief — proactive findings + recent knowledge in one digest
Quick comparison
| Capability | Watchdog | Deployment Bro | Deployment Crew |
|---|---|---|---|
| How you interact | Buttons & /commands | Plain English | Plain English |
| Servers | 1 | 3 (+$15/extra) | 9 (+$15/extra) |
| Learning curve | Zero | Zero | Zero |
| Auto-restart | ✓ | ✓ | ✓ |
| Alerts | ✓ | ✓ | ✓ |
| Dashboard | — | ✓ | ✓ |
| Natural language | — | ✓ | ✓ |
| Smart diagnosis (9 recipes) | — | ✓ | ✓ |
| Investigation Mode | — | ✓ | ✓ |
| Anomaly dashboard | — | ✓ | ✓ |
| Proactive monitoring | — | ✓ | ✓ |
| Auto-fix | Auto-restart | Simple issues | Simple issues |
| Git deploys | — | ✓ | ✓ |
| Deploy pipeline | — | ✓ | ✓ + webhooks (soon) |
| Multi-messenger | — | — | Coming Soon |
| Team access | — | — | Coming Soon |
| Provisioning | — | ✓ | ✓ |
| Audit retention | 7 days | 30 days | 90 days (soon) |
| MCP tools | 6 read-only | 35+ | 35+ |
| Support | Community | Priority (24h) |
Feature Deep Dive
Further Reading
Site Reliability Engineering: How Google Runs Production Systems ↗
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy · Book (Free Online)
The foundational SRE text defining MTTR, monitoring, alerting, and incident response practices used by Google and adopted industry-wide.
Accelerate: The Science of Lean Software and DevOps
Nicole Forsgren, Jez Humble, Gene Kim · Book
Research-backed evidence that MTTR is one of the four key metrics predicting software delivery performance and organizational outcomes.
Observability Engineering: Achieving Production Excellence
Charity Majors, Liz Fong-Jones, George Miranda · Book
Modern observability practices that reduce MTTR by improving detection and diagnosis — moving beyond traditional monitoring.