From monitoring to autonomous DevOps.

Three tiers. One goal: server confidence.

Reliable autopilot

Watchdog Mode

Monitoring, alerts, and button-tap actions. No AI, no surprises — just solid, predictable automation.

For DevOps, SRE, and teams who want control

  • Status monitoringCPU, RAM, disk, services — always visible
  • Crash alertsKnow instantly when something breaks
  • Auto-restartApps come back automatically after failure
  • Docker managementContainer lifecycle from chat
  • TriggersIf X happens, do Y. You define the rules.
  • PlaybooksCustom scripts run with one tap
  • DashboardWeb UI for multi-server overview
  • Audit logsFull history of every action
AI-Powered
Your AI companion

Deployment Bro

Talk to your servers in plain English. Bro understands context, diagnoses problems, and executes solutions.

For vibe coders and builders who ship fast

  • Natural language"Why is my app slow?" gets a real answer
  • Context-awareBro remembers your setup and history
  • Smart diagnosis9 recipes — reads logs, finds root cause, explains simply
  • Git deploys"Deploy my changes" — done, with auto-rollback
  • Investigation ModeSession bypass for AI clients — one 2FA per session
  • Anomaly dashboard9 detectors with auto-mute and rate limiting
  • 35+ MCP toolsFull access from Claude Code, Cursor, Codex
  • BYOKUse your own OpenAI/Anthropic API key
  • Everything in WatchdogAll autopilot features included

LAYER 4: Investigation Mode (for AI clients)

Session-wide approval bypass for AI assistants debugging an incident.

Problem: Per-action 2FA breaks AI flow. A debug session is 5–10 exec actions, each requiring a push notification + approve. AI assistants (Claude Code, Cursor, Codex) lose context between approvals; operators get fatigued.

Solution: Open an investigation, do one 2FA gate to acquire a bypass, then exec for the duration of the session. Bypass auto-closes on inactivity (30 min default), hard cap (2h), or action limit (20 by default).

  • Server-scoped — bypass on one server doesn't apply to another
  • Atomic action counter — racing callers can't exceed max_actions
  • Allowlisted ops only — execute_command + execute_script. write_file, repair_install, update_agent stay per-command
  • Tunable per acquisition — inactivity 1–120 min, max actions 1–100
  • Hard cap is non-tunable (2h) — prevents accidentally opening bypass for 24h

Designed specifically for the Claude Code / Cursor / Codex workflow. Available on Deployment Bro and above.

Anomaly Management

Proactive monitoring with auto-mute, rate limiting, and a dashboard you actually use.

Detectors that ship today

systemd_pm2_mismatchService running under PM2 but no matching systemd unit (or vice versa)
port_service_mismatchListening port but no matching service definition
kernel_errordmesg errors — OOM kills, hardware faults, segfaults
stale_backupsBackup jobs not running on schedule
systemd_unit_driftUnit file changed since last known-good state
incident_spikeSudden cluster of incidents on a server
deploy_failure_clusterMultiple deploys failing in a window
investigation_churnSame problem investigated repeatedly without resolution
agent_update_staleAgent version is behind

How it works in practice

  • FP-budget auto-mute — repeated false positives on a rule auto-mute it
  • Notification rate limiter — no 3 AM page-storms
  • Acknowledge proactive events — mark as accepted or false positive
  • Anomalies dashboard — review, ack, and mute from one screen
  • Morning brief — proactive findings + recent knowledge in one digest

Quick comparison

CapabilityWatchdogDeployment BroDeployment Crew
How you interactButtons & /commandsPlain EnglishPlain English
Servers13 (+$15/extra)9 (+$15/extra)
Learning curveZeroZeroZero
Auto-restart
Alerts
Dashboard
Natural language
Smart diagnosis (9 recipes)
Investigation Mode
Anomaly dashboard
Proactive monitoring
Auto-fixAuto-restartSimple issuesSimple issues
Git deploys
Deploy pipeline✓ + webhooks (soon)
Multi-messengerComing Soon
Team accessComing Soon
Provisioning
Audit retention7 days30 days90 days (soon)
MCP tools6 read-only35+35+
SupportCommunityEmailPriority (24h)

Feature Deep Dive

Further Reading

Site Reliability Engineering: How Google Runs Production Systems

Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy · Book (Free Online)

The foundational SRE text defining MTTR, monitoring, alerting, and incident response practices used by Google and adopted industry-wide.

Accelerate: The Science of Lean Software and DevOps

Nicole Forsgren, Jez Humble, Gene Kim · Book

Research-backed evidence that MTTR is one of the four key metrics predicting software delivery performance and organizational outcomes.

Observability Engineering: Achieving Production Excellence

Charity Majors, Liz Fong-Jones, George Miranda · Book

Modern observability practices that reduce MTTR by improving detection and diagnosis — moving beyond traditional monitoring.