# Monitoring Runbook

## Overview

This runbook covers monitoring setup and alert response for the Agent Orchestrator.

## Key Metrics

### System Health

| Metric | Description | Warning | Critical |
|--------|-------------|---------|----------|
| `system_health` | Overall health | degraded | unhealthy |
| `uptime_seconds` | System uptime | < 3600 | < 60 |
| `api_response_time_ms` | API latency | > 500 | > 2000 |

### Agent Metrics

| Metric | Description | Warning | Critical |
|--------|-------------|---------|----------|
| `agents_active` | Active agents | < 2 | 0 |
| `agent_health_check_failures` | Consecutive failures | > 2 | > 5 |
| `agent_stuck_duration_seconds` | Time stuck | > 300 | > 600 |

### Task Metrics

| Metric | Description | Warning | Critical |
|--------|-------------|---------|----------|
| `tasks_pending` | Queue size | > 20 | > 50 |
| `task_failure_rate` | % failures | > 10% | > 25% |
| `task_avg_duration_seconds` | Avg completion | > 300 | > 600 |

### Budget Metrics

| Metric | Description | Warning | Critical |
|--------|-------------|---------|----------|
| `budget_used_percent` | Daily budget used | > 75% | > 90% |
| `cost_per_task_usd` | Average cost | > $1 | > $5 |
| `rate_limit_hits` | Rate limit events | > 5/hr | > 20/hr |

## Monitoring Setup

### 1. Enable Metrics Endpoint

```yaml
# config.yaml
observability:
  metrics:
    enabled: true
    port: 9090
    path: /metrics
```

### 2. Configure Alerts

```yaml
# config.yaml
observability:
  alerts:
    slack:
      enabled: true
      webhook_url: ${SLACK_WEBHOOK_URL}
      channel: "#agent-alerts"

    email:
      enabled: true
      smtp_host: smtp.example.com
      from: alerts@example.com
      to:
        - ops@example.com

    webhook:
      enabled: true
      url: ${ALERT_WEBHOOK_URL}
```

### 3. Prometheus Configuration

```yaml
# prometheus.yml
scrape_configs:
  - job_name: 'agent-orchestrator'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s
```

### 4. Grafana Dashboard

Import the provided dashboard from `ops/dashboards/orchestrator.json`.

## Alert Response

### Agent Health Check Failure

**Symptoms:**
- Agent showing as unhealthy
- Health check returning errors

**Response:**
```bash
# 1. Check agent status
curl http://localhost:8080/api/agents/<agent_id>

# 2. Check tmux session
tmux list-sessions | grep <agent_id>
tmux attach -t <agent_id>

# 3. Check agent logs
tail -f logs/agents/<agent_id>.log

# 4. Attempt soft recovery
python -m agent_orchestrator agent prompt <agent_id> "Please respond with current status"

# 5. If still failing, restart agent
python -m agent_orchestrator agent restart <agent_id>
```

### High Task Failure Rate

**Symptoms:**
- Many tasks failing
- Error rate exceeding threshold

**Response:**
```bash
# 1. Check recent failures
curl "http://localhost:8080/api/tasks?status=failed&limit=10"

# 2. Analyze failure patterns
python -m agent_orchestrator tasks analyze-failures --since 1h

# 3. Check for common errors
grep "ERROR" logs/orchestrator.log | tail -20

# 4. If agent-specific, investigate agent
# If systematic, check external dependencies
```

### Budget Exhaustion

**Symptoms:**
- Budget alerts firing
- Tasks being blocked

**Response:**
```bash
# 1. Check current usage
curl http://localhost:8080/api/costs

# 2. Identify top consumers
curl http://localhost:8080/api/costs/by-agent

# 3. Check for inefficiencies
curl http://localhost:8080/api/costs/recommendations

# 4. Options:
#    a. Increase budget temporarily
#    b. Pause non-critical agents
#    c. Wait for daily reset
```

### Rate Limit Hit

**Symptoms:**
- Rate limit alerts
- Agents showing "exhausted"

**Response:**
```bash
# 1. Check rate limit status
curl http://localhost:8080/api/rate-limits

# 2. Identify affected agents
curl http://localhost:8080/api/rate-limits/alerts

# 3. Check subscription tier
curl http://localhost:8080/api/subscriptions/<agent_id>

# 4. Options:
#    a. Wait for rate limit reset
#    b. Route to different agent
#    c. Upgrade subscription tier
```

### System Unhealthy

**Symptoms:**
- Health check returning unhealthy
- Multiple components failing

**Response:**
```bash
# 1. Check overall health
curl http://localhost:8080/api/health

# 2. Check system stats
curl http://localhost:8080/api/stats

# 3. Check process status
ps aux | grep agent_orchestrator

# 4. Check database
python -m agent_orchestrator db status

# 5. Check disk space
df -h

# 6. Check memory
free -h

# 7. If critical, consider restart
python -m agent_orchestrator restart --graceful
```

## Log Locations

| Log | Location | Purpose |
|-----|----------|---------|
| Main | `logs/orchestrator.log` | Core orchestrator |
| API | `logs/api.log` | REST API |
| Agents | `logs/agents/<id>.log` | Per-agent logs |
| Tracing | `logs/traces/` | Session traces |
| Alerts | `logs/alerts.log` | Alert history |

## Health Check Endpoints

```bash
# System health
curl http://localhost:8080/api/health

# Detailed stats
curl http://localhost:8080/api/stats

# Agent health
curl http://localhost:8080/api/agents/<id>/health

# Database health
curl http://localhost:8080/api/health/db

# Metrics (Prometheus format)
curl http://localhost:9090/metrics
```

## Escalation

| Severity | Response Time | Escalate To |
|----------|---------------|-------------|
| INFO | Next business day | Ops team |
| WARNING | 4 hours | Ops lead |
| ERROR | 1 hour | On-call engineer |
| CRITICAL | 15 minutes | On-call + Team lead |

## Runbook Links

- [Agent Recovery](./agent-recovery.md)
- [Deployment](./deployment.md)
- [Incident Response](./incident-response.md)
- [Backup/Restore](./backup-restore.md)
