# Agent Recovery Runbook

## Purpose

This runbook describes how to recover from common agent failure scenarios.

## Prerequisites

- Access to the orchestrator logs
- tmux installed (for workspace recovery)
- git installed (for worktree recovery)

## Scenario 1: Stuck Agent

### Symptoms
- Agent shows no output for extended period
- Token burn rate is high but no progress
- Health check reports `is_stuck: true`

### Diagnosis
```bash
# Check agent status
tmux capture-pane -t <agent-id> -p -S -100

# Check health samples
sqlite3 data/orchestrator.db "SELECT * FROM health_samples WHERE agent_id='<agent-id>' ORDER BY sampled_at DESC LIMIT 5;"
```

### Resolution

1. **Try auto-unstick prompt** (orchestrator does this automatically):
   ```
   The control loop will send a nudge prompt like:
   "You seem stuck. Try a different approach or ask for help."
   ```

2. **Manual intervention** (if auto-unstick fails):
   ```bash
   # Attach to the agent's tmux session
   tmux attach -t <agent-id>

   # Send interrupt
   Ctrl+C

   # Or kill the session entirely
   tmux kill-session -t <agent-id>
   ```

3. **Record the stuck pattern** (for future learning):
   - Add to `/ops/patterns/error-fixes.md`
   - Include: error pattern, resolution, prevention

## Scenario 2: Crashed Workspace

### Symptoms
- tmux session doesn't exist
- git worktree is corrupted

### Resolution

```bash
# List existing worktrees
git worktree list

# Remove corrupted worktree
git worktree remove agent-<agent-id> --force

# Recreate workspace
./scripts/create_worktrees.sh -a <agent-id>
```

## Scenario 3: Budget Exceeded

### Symptoms
- Tasks rejected with "budget exceeded"
- Agent cannot route new tasks

### Resolution

1. **Check current spend**:
   ```bash
   sqlite3 data/orchestrator.db "SELECT * FROM usage_daily WHERE date=date('now');"
   ```

2. **Wait for daily reset** (happens at midnight UTC)

3. **Or increase budget** (requires approval):
   - Update `BUDGETS` in config
   - Log the change in decisions/

## Scenario 4: Secret Leak Detection

### Symptoms
- Memory Write Gate blocks a patch with "secrets detected"
- Librarian quarantines a memory item

### Resolution

1. **DO NOT approve the patch**

2. **Identify the source**:
   ```bash
   # Check recent task outputs
   grep -r "sk-" data/logs/
   grep -r "password" data/logs/
   ```

3. **Remove from any cached memory**:
   ```bash
   # Purge from working memory
   # Delete any temp files
   rm -rf data/working_memory/<task-id>/
   ```

4. **Rotate the exposed secret**

5. **Add pattern to SecretRedactor** if not already covered

## Prevention

- Always run `SecretRedactor.redact()` on agent output
- Use `SecretFileGuard` to prevent writing secret files
- Review memory patches before approval
- Regular maintenance runs scan for leaked secrets
