# Operations Guide

This document provides operational guidance for running the Agent Orchestration system in production.

## Table of Contents

1. [System Overview](#system-overview)
2. [Configuration](#configuration)
3. [Monitoring](#monitoring)
4. [Troubleshooting](#troubleshooting)
5. [Scaling](#scaling)
6. [Maintenance](#maintenance)

---

## System Overview

The Agent Orchestration system manages multiple AI coding agents with:
- Risk-based autonomy levels
- Budget and cost controls
- Human approval workflows
- Merge gate protection
- Comprehensive observability

### Architecture Components

```
┌─────────────────────────────────────────────────────────────┐
│                    Agent Orchestrator                        │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Agents    │  │   Budget    │  │   Merge     │         │
│  │  Registry   │  │  Controls   │  │   Gate      │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Approval   │  │  Reliability│  │ Observability│        │
│  │  Workflows  │  │   Layer     │  │   Stack      │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  SQLite DB  │  │  MCP Server │  │  External   │         │
│  │  (State)    │  │  (Tools)    │  │  LLM APIs   │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────┘
```

---

## Configuration

### Environment Variables

```bash
# Core Settings
ORCHESTRATOR_DB_PATH=data/orchestrator.db
ORCHESTRATOR_LOG_LEVEL=INFO
ORCHESTRATOR_LOG_FORMAT=json

# LLM Provider API Keys
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=...

# Budget Controls
DEFAULT_DAILY_BUDGET_USD=10.00
DEFAULT_MONTHLY_BUDGET_USD=200.00
BUDGET_ALERT_THRESHOLD=0.8

# Rate Limiting
RATE_LIMIT_REQUESTS_PER_MINUTE=60
RATE_LIMIT_TOKENS_PER_MINUTE=100000

# Approval Settings
APPROVAL_TIMEOUT_SECONDS=300
REQUIRE_APPROVAL_FOR_RISK_LEVEL=high

# Merge Gate
PROTECTED_BRANCHES=main,master,production
REQUIRE_TESTS_PASS=true
REQUIRE_APPROVALS=1
```

### Interactive Authentication (Preferred)

CLI agents (Claude Code, Gemini CLI, Codex CLI) should authenticate via their interactive/web flows. API keys are intended for specialized tasks only.

Operational guidance:

- Authenticate each CLI once before starting the orchestrator.
- Avoid setting API keys unless a task explicitly requires API access.
- If a CLI session expires or hits a limit, re-authenticate and reroute tasks to another CLI agent.

See `ops/runbooks/cli-authentication.md` for the step-by-step procedure.

### Configuration Files

#### Rate Limiting (`config/rate_limits.yaml`)

```yaml
providers:
  anthropic:
    requests_per_minute: 60
    tokens_per_minute: 100000
    concurrent_requests: 5

  openai:
    requests_per_minute: 60
    tokens_per_minute: 90000
    concurrent_requests: 10

  google:
    requests_per_minute: 60
    tokens_per_minute: 120000
    concurrent_requests: 5

backoff:
  initial_delay: 1.0
  max_delay: 60.0
  exponential_base: 2.0
```

#### Agent Budgets (`config/budgets.yaml`)

```yaml
agents:
  claude-code:
    daily_limit_usd: 5.00
    monthly_limit_usd: 100.00
    max_tokens_per_request: 100000

  gemini-cli:
    daily_limit_usd: 3.00
    monthly_limit_usd: 60.00
    max_tokens_per_request: 50000

alerts:
  warning_threshold: 0.8
  critical_threshold: 0.95
  notification_channels:
    - slack
    - email
```

### Session Limits and Rebalancing

To favor interactive CLI sessions:

- Track per-agent usage in the orchestrator metrics and logs.
- When a CLI agent approaches its session or weekly limit, mark it as unavailable and route tasks to another CLI agent.
- Reserve API-based agents for specialized tasks that require programmatic access.

---

## Monitoring

### Health Checks

The system exposes health check endpoints:

```bash
# Basic health check
curl http://localhost:8080/health

# Detailed health status
curl http://localhost:8080/health/detailed
```

Response format:
```json
{
  "status": "healthy",
  "components": {
    "database": "healthy",
    "mcp_server": "healthy",
    "agents": {
      "active": 3,
      "stuck": 0,
      "failed": 0
    }
  },
  "uptime_seconds": 86400
}
```

### Key Metrics

| Metric | Description | Alert Threshold |
|--------|-------------|-----------------|
| `agent_task_duration_seconds` | Time to complete tasks | > 300s warning |
| `agent_error_rate` | Errors per minute | > 5/min critical |
| `budget_usage_percent` | Budget consumption | > 80% warning |
| `rate_limit_hits` | Rate limit encounters | > 10/min warning |
| `circuit_breaker_state` | Provider health | OPEN = critical |
| `approval_queue_size` | Pending approvals | > 10 warning |
| `merge_lock_duration` | Lock hold time | > 600s warning |

### Log Analysis

#### Log Format (JSON)

```json
{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "logger": "agent_orchestrator.agents",
  "message": "Task completed",
  "agent_id": "claude-code-1",
  "task_id": "task-123",
  "duration_ms": 1500,
  "tokens_used": 5000,
  "cost_usd": 0.05
}
```

#### Common Log Queries

```bash
# Find all errors in the last hour
grep '"level":"ERROR"' /var/log/orchestrator.log | \
  jq 'select(.timestamp > "2024-01-15T09:30:00Z")'

# Track budget usage by agent
grep '"message":"Budget updated"' /var/log/orchestrator.log | \
  jq '{agent: .agent_id, usage: .usage_percent}'

# Monitor rate limiting
grep '"message":"Rate limited"' /var/log/orchestrator.log | \
  jq -s 'group_by(.provider) | map({provider: .[0].provider, count: length})'
```

### Alerting Rules

#### Stuck Agent Detection

```yaml
alert: StuckAgent
expr: agent_last_activity_seconds > 600
for: 5m
labels:
  severity: warning
annotations:
  summary: "Agent {{ $labels.agent_id }} appears stuck"
  description: "No activity for {{ $value }} seconds"
```

#### Budget Exceeded

```yaml
alert: BudgetExceeded
expr: budget_usage_percent > 0.95
for: 1m
labels:
  severity: critical
annotations:
  summary: "Budget exceeded for {{ $labels.agent_id }}"
  description: "Usage at {{ $value | humanizePercentage }}"
```

#### Circuit Breaker Open

```yaml
alert: CircuitBreakerOpen
expr: circuit_breaker_state == 2  # OPEN state
for: 2m
labels:
  severity: critical
annotations:
  summary: "Circuit breaker open for {{ $labels.provider }}"
  description: "Provider may be experiencing issues"
```

---

## Troubleshooting

### Common Issues

#### 1. Agent Not Responding

**Symptoms:**
- Task stuck in "in_progress" state
- No recent log entries for agent
- Health check shows agent as unhealthy

**Diagnosis:**
```bash
# Check agent status
curl http://localhost:8080/api/agents/{agent_id}/status

# View recent logs
grep "agent_id.*{agent_id}" /var/log/orchestrator.log | tail -50

# Check for stuck work
curl http://localhost:8080/api/agents/{agent_id}/in_flight
```

**Resolution:**
1. Check if agent is waiting for approval
2. Verify LLM API connectivity
3. Check rate limit status
4. Force stop if necessary:
   ```bash
   curl -X POST http://localhost:8080/api/agents/{agent_id}/stop
   ```

#### 2. Rate Limit Errors

**Symptoms:**
- `RateLimitError` in logs
- Tasks failing with 429 errors
- Backoff delays increasing

**Diagnosis:**
```bash
# Check rate limit state
curl http://localhost:8080/api/providers/{provider}/rate_limit

# View backoff status
grep "Rate limited" /var/log/orchestrator.log | tail -20
```

**Resolution:**
1. Reduce concurrent requests
2. Increase backoff delays
3. Enable fallback models
4. Contact provider for limit increase

#### 3. Budget Exceeded

**Symptoms:**
- Tasks rejected with "budget exceeded"
- Agents paused automatically
- Alert notifications triggered

**Diagnosis:**
```bash
# Check budget status
curl http://localhost:8080/api/budgets/{agent_id}

# View usage history
curl http://localhost:8080/api/budgets/{agent_id}/history?days=7
```

**Resolution:**
1. Review usage patterns
2. Increase budget limits if justified
3. Reset daily budget (with approval):
   ```bash
   curl -X POST http://localhost:8080/api/budgets/{agent_id}/reset
   ```

#### 4. Merge Lock Stuck

**Symptoms:**
- Branch locked for extended period
- Other agents waiting for merge
- Timeout errors in merge operations

**Diagnosis:**
```bash
# Check lock status
curl http://localhost:8080/api/merge/locks

# View lock holder
curl http://localhost:8080/api/merge/branches/{branch}/lock
```

**Resolution:**
1. Check if merge is still in progress
2. Verify lock holder agent is active
3. Force release lock if necessary:
   ```bash
   curl -X DELETE http://localhost:8080/api/merge/branches/{branch}/lock?force=true
   ```

#### 5. Circuit Breaker Open

**Symptoms:**
- Requests failing immediately
- "Circuit breaker open" in logs
- Provider marked as unavailable

**Diagnosis:**
```bash
# Check circuit breaker status
curl http://localhost:8080/api/providers/{provider}/circuit_breaker

# View failure history
grep "circuit_breaker" /var/log/orchestrator.log | tail -30
```

**Resolution:**
1. Check provider status page
2. Wait for recovery timeout
3. Force half-open to test:
   ```bash
   curl -X POST http://localhost:8080/api/providers/{provider}/circuit_breaker/test
   ```

### Emergency Procedures

#### Force Shutdown

```bash
# Graceful shutdown (waits for in-flight work)
kill -SIGTERM $(pgrep -f orchestrator)

# Check shutdown progress
tail -f /var/log/orchestrator.log | grep shutdown

# Force immediate shutdown (if graceful hangs)
kill -SIGKILL $(pgrep -f orchestrator)
```

#### Emergency Budget Override

```bash
# Temporarily increase budget (requires admin)
curl -X POST http://localhost:8080/api/admin/budgets/override \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"agent_id": "claude-code-1", "override_usd": 50.00, "duration_hours": 4}'
```

#### Bypass Approval (Emergency Only)

```bash
# Bypass pending approval (requires emergency admin)
curl -X POST http://localhost:8080/api/admin/approvals/{approval_id}/bypass \
  -H "Authorization: Bearer $EMERGENCY_TOKEN" \
  -d '{"reason": "Production incident requires immediate action"}'
```

---

## Scaling

### Horizontal Scaling

The system can scale horizontally with the following considerations:

#### Database
- Use PostgreSQL for multi-instance deployments
- Enable connection pooling
- Configure read replicas for observability queries

#### State Management
- Use Redis for distributed locking
- Configure merge lock to use distributed backend
- Enable session affinity for agent assignments

#### Load Balancing

```nginx
upstream orchestrator {
    least_conn;
    server orchestrator-1:8080;
    server orchestrator-2:8080;
    server orchestrator-3:8080;
}
```

### Vertical Scaling

| Resource | Minimum | Recommended | High Load |
|----------|---------|-------------|-----------|
| CPU | 2 cores | 4 cores | 8 cores |
| Memory | 4 GB | 8 GB | 16 GB |
| Disk | 20 GB | 50 GB | 100 GB |

### Performance Tuning

```yaml
# Tune for high throughput
performance:
  # Database connections
  db_pool_size: 20
  db_max_overflow: 40

  # Async workers
  worker_pool_size: 100

  # Caching
  cache_ttl_seconds: 60
  cache_max_entries: 10000

  # Rate limiting
  rate_limit_check_interval: 0.1
```

---

## Maintenance

### Routine Tasks

#### Daily
- [ ] Review error logs
- [ ] Check budget usage across agents
- [ ] Verify all health checks passing
- [ ] Review pending approvals

#### Weekly
- [ ] Audit trail review
- [ ] Rate limit analysis
- [ ] Budget optimization review
- [ ] Clean up old task records

#### Monthly
- [ ] Database maintenance (VACUUM, ANALYZE)
- [ ] Log rotation and archival
- [ ] Security audit of MCP tools
- [ ] Provider API key rotation

### Database Maintenance

```bash
# SQLite maintenance
sqlite3 /var/data/orchestrator.db "VACUUM;"
sqlite3 /var/data/orchestrator.db "ANALYZE;"

# Clean old records (> 90 days)
sqlite3 /var/data/orchestrator.db "DELETE FROM audit_events WHERE timestamp < datetime('now', '-90 days');"
sqlite3 /var/data/orchestrator.db "DELETE FROM task_history WHERE completed_at < datetime('now', '-30 days');"
```

### Backup and Recovery

```bash
# Backup database
cp /var/data/orchestrator.db /var/backups/orchestrator-$(date +%Y%m%d).db

# Backup with integrity check
sqlite3 /var/data/orchestrator.db ".backup /var/backups/orchestrator-$(date +%Y%m%d).db"

# Verify backup
sqlite3 /var/backups/orchestrator-$(date +%Y%m%d).db "PRAGMA integrity_check;"

# Restore from backup
cp /var/backups/orchestrator-20240115.db /var/data/orchestrator.db
```

### Upgrade Procedures

1. **Prepare**
   ```bash
   # Notify users of maintenance window
   # Stop accepting new tasks
   curl -X POST http://localhost:8080/api/admin/maintenance/start
   ```

2. **Backup**
   ```bash
   # Full backup before upgrade
   ./scripts/backup.sh
   ```

3. **Upgrade**
   ```bash
   # Stop current instance
   systemctl stop orchestrator

   # Deploy new version
   ./scripts/deploy.sh

   # Run migrations
   ./scripts/migrate.sh

   # Start new version
   systemctl start orchestrator
   ```

4. **Verify**
   ```bash
   # Check health
   curl http://localhost:8080/health

   # Run smoke tests
   ./scripts/smoke-test.sh

   # End maintenance
   curl -X POST http://localhost:8080/api/admin/maintenance/end
   ```

---

## Appendix

### Error Codes

| Code | Meaning | Action |
|------|---------|--------|
| `E001` | Budget exceeded | Review usage, increase if needed |
| `E002` | Rate limited | Wait for backoff, reduce load |
| `E003` | Approval timeout | Re-request or escalate |
| `E004` | Merge conflict | Resolve conflicts manually |
| `E005` | Circuit breaker open | Wait for provider recovery |
| `E006` | Agent stuck | Force stop and restart |
| `E007` | Authentication failed | Check API keys |
| `E008` | MCP tool denied | Review tool permissions |

### Support Contacts

- **On-call**: oncall@example.com
- **Escalation**: platform-team@example.com
- **Security Issues**: security@example.com

### References

- [Implementation Plan](./IMPLEMENTATION_PLAN.md)
- [API Documentation](./API.md)
- [Security Guide](./SECURITY.md)
