# Agent Orchestration Research Document

**Date:** January 15, 2026
**Version:** 4.0 (Implementation-Ready with Full Control Systems)
**Purpose:** Design a multi-agent orchestration system with unified usage tracking, intelligent routing, risk-based autonomy, and robust state management

---

## Table of Contents

1. [Project Requirements](#project-requirements)
2. [Critical Concept: Two Control Planes](#critical-concept-two-control-planes)
3. [The Adapter Layer (Key Architecture Decision)](#the-adapter-layer-key-architecture-decision)
4. [Project Journal Protocol (Context Handoff)](#project-journal-protocol-context-handoff)
5. [Orchestrator Persistence (SQLite from Day 1)](#orchestrator-persistence-sqlite-from-day-1)
6. [Unified Usage/Limits Ledger](#unified-usagelimits-ledger)
7. [MCP/Tool Budget Controls](#mcptool-budget-controls)
8. [Task Routing: Best CLI for the Job](#task-routing-best-cli-for-the-job)
9. [Control Loop: Monitoring Agent Health](#control-loop-monitoring-agent-health)
10. [Risk Gate: Four-Tier Autonomy](#risk-gate-four-tier-autonomy)
11. [Human Interrupt Interface](#human-interrupt-interface)
12. [Secret Handling Subsystem](#secret-handling-subsystem)
13. [Control Plane A: CLI/Workspace Agents](#control-plane-a-cliworkspace-agents)
14. [Control Plane B: API/Orchestration Agents](#control-plane-b-apiorchestration-agents)
15. [Observability Strategy](#observability-strategy)
16. [Merge Gate Mechanics](#merge-gate-mechanics)
17. [Implementation-Ready Architecture](#implementation-ready-architecture)
18. [Critical Gotchas](#critical-gotchas)
19. [Production CLI + Orchestration Stack](#production-cli--orchestration-stack)
20. [Phased Implementation Plan](#phased-implementation-plan)
21. [References](#references)

---

## Project Requirements

### Core Objectives

- Python-initiated regular prompts to check status and review work
- Plan and assign new work tasks to agents
- Multi-model support (Claude Code, Gemini, Codex, other AI models)
- Clear role assignment for agents to limit costs
- Track Claude Code usage limits
- Direct CLI collaboration with models
- Real-time visibility into agent work

### Key Criteria

| Requirement | Priority | Notes |
|-------------|----------|-------|
| Multi-model orchestration | High | Claude, Gemini, Codex, custom models |
| Role-based agent assignment | High | Limit costs, specialize tasks |
| CLI interaction | High | Direct user collaboration |
| Agent visibility/dashboard | High | See work in real-time |
| Cost tracking | High | Token usage, API costs |
| Python-based | Medium | Scheduled prompts, automation |
| Open source | Medium | Customizable, self-hosted |

---

## Critical Concept: Two Control Planes

**Key insight:** CLI agents and API agents are fundamentally different "control planes" and your orchestrator must handle both.

### Control Plane A: CLI (Interactive Coding Tools)

| Agent | Description | Best For |
|-------|-------------|----------|
| **Claude Code** | Local CLI experience, agentic coding | Workspaces, human-in-the-loop |
| **Gemini CLI** | Open-source terminal agent with ReAct loop + MCP | Terminal tool behavior |
| **Codex CLI** | Local terminal tool, API-backed | Quick code generation |

**Characteristics:**
- Workspaces and terminal sessions
- Git branches and file system access
- Human-in-the-loop interaction
- Best for: real repo edits, tests, git commits

### Control Plane B: API (Programmatic Orchestration)

| Agent | Description | Best For |
|-------|-------------|----------|
| **Claude Agent SDK** | "Claude Code as a library" - same agent loop + tools | Orchestration, automation |
| **OpenAI Agents SDK** | Handoffs, sessions, guardrails, usage reporting | Codex/GPT programmatic access |
| **Gemini API / Vertex** | Direct API calls | Scriptable model calls |

**Characteristics:**
- Fast, cheap, scriptable
- Programmatic control
- Easy cost tracking
- Best for: automation, scheduled tasks, pipelines

---

## The Adapter Layer (Key Architecture Decision)

**This is the unlock for a clean implementation.** Define one interface your orchestrator talks to:

### Interface Design

```python
from abc import ABC, abstractmethod
from typing import AsyncIterator
from dataclasses import dataclass
from datetime import datetime
from enum import Enum

class RiskLevel(Enum):
    LOW = "low"           # Auto-edit allowed
    MEDIUM = "medium"     # Edits OK, commands need approval
    HIGH = "high"         # Suggest-only + ask user
    CRITICAL = "critical" # Auto-reject, never allow

@dataclass
class AgentResponse:
    content: str
    tokens_used: int
    cost: float
    model: str
    metadata: dict
    artifacts: 'TaskArtifacts'

@dataclass
class TaskArtifacts:
    """Required outputs from every task"""
    diff_summary: str
    tests_run: list[str]
    test_results: dict  # {"passed": int, "failed": int, "skipped": int}
    risk_items_encountered: list[str]
    next_action_recommendation: str
    files_modified: list[str]
    commits_created: list[str]

@dataclass
class UsageStats:
    tokens_input: int
    tokens_output: int
    total_cost: float
    requests_count: int
    errors_count: int
    last_activity: datetime

@dataclass
class StatusPacket:
    """Required output at end of every agent run"""
    agent_id: str
    task_id: str
    timestamp: datetime
    status: str  # "completed", "blocked", "failed", "needs_approval"
    progress_summary: str
    artifacts: TaskArtifacts
    state_changes: dict  # What changed in project_state.json
    blockers: list[str]
    next_steps: list[str]

class BaseAdapter(ABC):
    """Base interface for all agent adapters"""

    @abstractmethod
    async def execute(self, task: str, context: dict) -> AgentResponse:
        """Execute a task and return response"""
        pass

    @abstractmethod
    async def stream(self, task: str, context: dict) -> AsyncIterator[str]:
        """Stream response for real-time output"""
        pass

    @abstractmethod
    def get_usage(self) -> UsageStats:
        """Return usage statistics"""
        pass

    @abstractmethod
    def is_healthy(self) -> bool:
        """Check if agent is responsive"""
        pass

    @abstractmethod
    def write_status_packet(self) -> StatusPacket:
        """Write status packet at end of run (REQUIRED)"""
        pass


class LLMAdapter(BaseAdapter):
    """Adapter for API-based models (fast, cheap, scriptable)"""
    pass


class CLIAgentAdapter(BaseAdapter):
    """Adapter for CLI workspace agents (repo edits, tests, commits)"""
    pass
```

---

## Project Journal Protocol (Context Handoff)

**This solves "Gemini overwrote Claude's decision" more reliably than hoping agents notice diffs.**

### Hard Rule: Every Agent Run Must

1. **Read** `project_state.json` before starting
2. **Write back** a structured `StatusPacket` at end
3. **Append** to `agent_journal.md` with decision rationale

### project_state.json (Machine-Readable)

```json
{
  "version": "1.0",
  "last_updated": "2026-01-15T14:30:00Z",
  "updated_by": "claude_code_agent_1",

  "current_objectives": [
    {
      "id": "obj-001",
      "description": "Implement user authentication",
      "status": "in_progress",
      "assigned_agent": "claude_code",
      "priority": 1
    }
  ],

  "active_branches": {
    "claude_code": "feature/auth-claude",
    "gemini_cli": "feature/docs-gemini",
    "codex": "feature/tests-codex"
  },

  "active_worktrees": {
    "claude_code": "../agent-claude-workspace",
    "gemini_cli": "../agent-gemini-workspace",
    "codex": "../agent-codex-workspace"
  },

  "constraints": [
    "No changes to production database schema without approval",
    "All new endpoints must have tests",
    "Use existing auth patterns from /src/auth/"
  ],

  "decisions_made": [
    {
      "id": "dec-001",
      "timestamp": "2026-01-15T10:00:00Z",
      "agent": "claude_code",
      "decision": "Use JWT for authentication instead of sessions",
      "rationale": "Better for API-first architecture, matches existing patterns",
      "reversible": true
    }
  ],

  "open_risks": [
    {
      "id": "risk-001",
      "description": "Auth migration may break existing sessions",
      "severity": "medium",
      "mitigation": "Add backwards-compatible session check"
    }
  ],

  "definition_of_done": [
    "All tests pass",
    "Code reviewed by reviewer agent",
    "No critical security issues",
    "Documentation updated",
    "PR approved"
  ],

  "blocked_items": [],

  "shared_context": {
    "api_patterns": "See /docs/api-patterns.md",
    "test_conventions": "pytest with fixtures in /tests/conftest.py",
    "deployment_notes": "Staging deploys automatically on PR merge"
  }
}
```

### agent_journal.md (Human-Readable)

```markdown
# Agent Journal

## 2026-01-15

### 14:30 - Claude Code Agent
**Task:** Implement JWT authentication
**Decision:** Use `python-jose` library for JWT handling
**Rationale:** Already in requirements.txt, well-maintained, matches team's existing usage
**Diff:** [feature/auth-claude@abc123](link)
**Next:** Need to add refresh token logic

### 14:45 - Gemini CLI Agent
**Task:** Update API documentation
**Decision:** Added OpenAPI annotations to new auth endpoints
**Note:** Noticed Claude's JWT implementation - updated docs to match
**Diff:** [feature/docs-gemini@def456](link)

### 15:00 - Codex Agent
**Task:** Generate auth tests
**Decision:** Created 12 test cases covering JWT validation edge cases
**Tests:** All passing
**Diff:** [feature/tests-codex@ghi789](link)
```

### Write Contract Implementation

```python
import json
from pathlib import Path
from datetime import datetime

class ProjectJournal:
    def __init__(self, project_root: Path):
        self.state_file = project_root / "project_state.json"
        self.journal_file = project_root / "agent_journal.md"

    def read_state(self) -> dict:
        """MUST be called at start of every agent run"""
        if self.state_file.exists():
            return json.loads(self.state_file.read_text())
        return self._create_initial_state()

    def write_status(self, packet: StatusPacket) -> None:
        """MUST be called at end of every agent run"""
        # Update machine-readable state
        state = self.read_state()
        state["last_updated"] = datetime.now().isoformat()
        state["updated_by"] = packet.agent_id

        # Apply state changes from packet
        for key, value in packet.state_changes.items():
            if key in state:
                state[key] = value

        self.state_file.write_text(json.dumps(state, indent=2))

        # Append to human-readable journal
        self._append_journal(packet)

    def _append_journal(self, packet: StatusPacket) -> None:
        """Append entry to agent_journal.md"""
        entry = f"""
### {packet.timestamp.strftime('%H:%M')} - {packet.agent_id}
**Task:** {packet.progress_summary}
**Status:** {packet.status}
**Files Modified:** {', '.join(packet.artifacts.files_modified) or 'None'}
**Tests:** {packet.artifacts.test_results}
**Next Steps:** {', '.join(packet.next_steps)}
**Blockers:** {', '.join(packet.blockers) or 'None'}

"""
        with open(self.journal_file, 'a') as f:
            f.write(entry)

    def record_decision(self, agent_id: str, decision: str, rationale: str) -> None:
        """Record a significant decision"""
        state = self.read_state()
        state["decisions_made"].append({
            "id": f"dec-{len(state['decisions_made']) + 1:03d}",
            "timestamp": datetime.now().isoformat(),
            "agent": agent_id,
            "decision": decision,
            "rationale": rationale,
            "reversible": True
        })
        self.state_file.write_text(json.dumps(state, indent=2))
```

---

## Orchestrator Persistence (SQLite from Day 1)

**Don't persist to JSON unless you must—SQLite makes restart recovery and analytics trivial.**

### Schema

```sql
-- Core tables for orchestrator state

CREATE TABLE agents (
    id TEXT PRIMARY KEY,
    tool TEXT NOT NULL,  -- 'claude_code', 'gemini_cli', 'codex', etc.
    worktree_path TEXT,
    branch TEXT,
    status TEXT DEFAULT 'idle',  -- 'idle', 'running', 'stuck', 'paused'
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE tasks (
    id TEXT PRIMARY KEY,
    description TEXT NOT NULL,
    task_type TEXT NOT NULL,  -- matches TaskType enum
    priority INTEGER DEFAULT 0,
    status TEXT DEFAULT 'pending',  -- 'pending', 'assigned', 'running', 'completed', 'failed', 'blocked'
    assigned_agent_id TEXT REFERENCES agents(id),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    started_at TIMESTAMP,
    completed_at TIMESTAMP
);

CREATE TABLE runs (
    id TEXT PRIMARY KEY,
    agent_id TEXT NOT NULL REFERENCES agents(id),
    task_id TEXT REFERENCES tasks(id),
    started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ended_at TIMESTAMP,
    outcome TEXT,  -- 'success', 'failure', 'blocked', 'timeout'
    tokens_input INTEGER DEFAULT 0,
    tokens_output INTEGER DEFAULT 0,
    cost_usd REAL DEFAULT 0.0,
    error_message TEXT,
    status_packet_json TEXT  -- Full StatusPacket as JSON
);

CREATE TABLE health_samples (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL REFERENCES agents(id),
    sampled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_stdout_at TIMESTAMP,
    last_file_change_at TIMESTAMP,
    token_burn_rate REAL,  -- tokens/min
    error_count INTEGER DEFAULT 0,
    consecutive_same_error INTEGER DEFAULT 0,
    pending_permission_prompt BOOLEAN DEFAULT FALSE,
    edit_revert_cycles INTEGER DEFAULT 0,
    is_stuck BOOLEAN DEFAULT FALSE,
    stuck_reason TEXT
);

CREATE TABLE approvals (
    id TEXT PRIMARY KEY,
    agent_id TEXT NOT NULL REFERENCES agents(id),
    run_id TEXT REFERENCES runs(id),
    action_type TEXT NOT NULL,  -- 'file_edit', 'command', 'merge', 'deploy'
    target TEXT NOT NULL,  -- file path or command
    risk_level TEXT NOT NULL,  -- 'low', 'medium', 'high', 'critical'
    status TEXT DEFAULT 'pending',  -- 'pending', 'approved', 'rejected', 'timeout'
    requested_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    decided_at TIMESTAMP,
    decided_by TEXT,  -- 'user', 'auto', 'timeout'
    decision_notes TEXT
);

CREATE TABLE usage_daily (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    date DATE NOT NULL,
    agent_id TEXT NOT NULL REFERENCES agents(id),
    tokens_input INTEGER DEFAULT 0,
    tokens_output INTEGER DEFAULT 0,
    cost_usd REAL DEFAULT 0.0,
    requests_count INTEGER DEFAULT 0,
    errors_count INTEGER DEFAULT 0,
    UNIQUE(date, agent_id)
);

CREATE TABLE mcp_tool_usage (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL REFERENCES agents(id),
    run_id TEXT REFERENCES runs(id),
    tool_name TEXT NOT NULL,
    mcp_server TEXT NOT NULL,
    called_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    tokens_used INTEGER DEFAULT 0,
    duration_ms INTEGER,
    success BOOLEAN DEFAULT TRUE,
    error_message TEXT
);

-- Indexes for common queries
CREATE INDEX idx_runs_agent ON runs(agent_id, started_at);
CREATE INDEX idx_health_agent ON health_samples(agent_id, sampled_at);
CREATE INDEX idx_approvals_pending ON approvals(status) WHERE status = 'pending';
CREATE INDEX idx_usage_date ON usage_daily(date, agent_id);
```

### Python ORM Layer

```python
import sqlite3
from contextlib import contextmanager
from pathlib import Path
from datetime import datetime, date

class OrchestratorDB:
    def __init__(self, db_path: Path):
        self.db_path = db_path
        self._init_schema()

    @contextmanager
    def connection(self):
        conn = sqlite3.connect(self.db_path)
        conn.row_factory = sqlite3.Row
        try:
            yield conn
            conn.commit()
        except Exception:
            conn.rollback()
            raise
        finally:
            conn.close()

    def _init_schema(self):
        """Initialize database schema"""
        with self.connection() as conn:
            conn.executescript(SCHEMA_SQL)  # SQL from above

    def record_run(self, agent_id: str, task_id: str) -> str:
        """Start recording a new run"""
        run_id = f"run-{datetime.now().strftime('%Y%m%d%H%M%S')}-{agent_id}"
        with self.connection() as conn:
            conn.execute(
                "INSERT INTO runs (id, agent_id, task_id) VALUES (?, ?, ?)",
                (run_id, agent_id, task_id)
            )
        return run_id

    def complete_run(self, run_id: str, outcome: str, packet: StatusPacket):
        """Complete a run with status packet"""
        with self.connection() as conn:
            conn.execute("""
                UPDATE runs SET
                    ended_at = ?,
                    outcome = ?,
                    tokens_input = ?,
                    tokens_output = ?,
                    status_packet_json = ?
                WHERE id = ?
            """, (
                datetime.now(),
                outcome,
                packet.artifacts.tokens_input if hasattr(packet.artifacts, 'tokens_input') else 0,
                packet.artifacts.tokens_output if hasattr(packet.artifacts, 'tokens_output') else 0,
                json.dumps(asdict(packet)),
                run_id
            ))

    def record_health_sample(self, health: 'AgentHealthCheck'):
        """Record a health sample"""
        is_stuck, reason = health.is_stuck()
        with self.connection() as conn:
            conn.execute("""
                INSERT INTO health_samples (
                    agent_id, last_stdout_at, last_file_change_at,
                    token_burn_rate, error_count, consecutive_same_error,
                    pending_permission_prompt, edit_revert_cycles,
                    is_stuck, stuck_reason
                ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                health.agent_id,
                health.last_stdout,
                health.last_file_change,
                health.token_burn_rate,
                health.error_count,
                health.consecutive_same_error,
                health.pending_permission_prompt,
                health.edit_revert_cycles,
                is_stuck,
                reason
            ))

    def get_daily_usage(self, agent_id: str, target_date: date = None) -> dict:
        """Get usage for a specific day"""
        target_date = target_date or date.today()
        with self.connection() as conn:
            row = conn.execute("""
                SELECT * FROM usage_daily
                WHERE agent_id = ? AND date = ?
            """, (agent_id, target_date)).fetchone()
            return dict(row) if row else {
                "tokens_input": 0,
                "tokens_output": 0,
                "cost_usd": 0.0,
                "requests_count": 0
            }

    def get_pending_approvals(self) -> list[dict]:
        """Get all pending approval requests"""
        with self.connection() as conn:
            rows = conn.execute("""
                SELECT * FROM approvals
                WHERE status = 'pending'
                ORDER BY requested_at
            """).fetchall()
            return [dict(row) for row in rows]
```

---

## Unified Usage/Limits Ledger

You need two things to control costs:
1. **A unified usage/limits ledger** (tokens, cost, plan/quota, rate limits)
2. **A control loop** that watches agents and decides: auto-prompt vs escalate

### Per-Run Usage (When Tools Expose It)

#### Claude Code (Headless JSON Output)

```bash
# Run headless with structured JSON output
claude -p "Implement feature X" --output-format json

# Or stream for real-time
claude -p "Implement feature X" --output-format stream-json
```

**⚠️ CRITICAL: Headless Mode Does Not Persist**

Anthropic's guidance states that headless mode is **per-session**. You must trigger it each time - there is no persistent agent state between invocations.

**Implications for your design:**
- State MUST be externalized (project_state.json + SQLite)
- Context MUST be re-injected at each invocation
- Session continuity is YOUR responsibility, not Claude Code's

```python
import subprocess
import json

def run_claude_headless(prompt: str, context: dict) -> dict:
    """Run Claude Code headlessly and capture usage metadata

    NOTE: Each invocation is a fresh session. State must be
    passed in via context and captured in response.
    """
    # Inject project state into prompt
    state = ProjectJournal(Path(".")).read_state()
    full_prompt = f"""
    ## Current Project State
    {json.dumps(state, indent=2)}

    ## Your Task
    {prompt}

    ## Requirements
    - Read and respect decisions in project_state.json
    - Output a status packet at the end of your work
    """

    result = subprocess.run(
        ["claude", "-p", full_prompt, "--output-format", "json"],
        capture_output=True,
        text=True
    )
    response = json.loads(result.stdout)

    # Extract usage from response metadata
    usage = {
        "tokens_input": response.get("usage", {}).get("input_tokens", 0),
        "tokens_output": response.get("usage", {}).get("output_tokens", 0),
        "cost": response.get("usage", {}).get("cost", 0.0),
        "session_id": response.get("session_id"),
    }
    return usage
```

#### OpenAI Agents SDK (Built-in Tracking)

```python
from openai_agents import Agent, Runner

runner = Runner()
result = await runner.run(agent=coder, messages=[...])

# Usage is automatically tracked per run
print(f"Tokens: {result.usage.total_tokens}")
print(f"Cost: ${result.usage.cost}")
```

### Plan Limits / Subscription Quotas

**Reality:** "Claude Max vs Pro" / "Codex via ChatGPT plan" / "Gemini free tier" often won't expose a single canonical "tokens remaining" API.

**Solution:** Treat plan limits as **policy inputs**:

```python
@dataclass
class AgentBudget:
    """Policy-based budget configuration"""
    agent_name: str
    daily_token_limit: int
    daily_cost_limit: float  # USD
    rate_limit_rpm: int      # Requests per minute

# Configure budgets as policy
BUDGETS = {
    "claude_code": AgentBudget("claude_code", 500_000, 50.0, 60),
    "gemini_cli": AgentBudget("gemini_cli", 1_000_000, 0.0, 60),  # Free tier
    "codex": AgentBudget("codex", 200_000, 20.0, 30),
}
```

**Budget enforcement:**
- **Configured budgets:** $ / day or tokens / day
- **Observed burn:** usage measured from per-run tracking
- **Rate limits:** treat errors/429s as signals to back off and reschedule

### CLI-Wide Auditing (The "One Dashboard" Solution)

Two tools that work specifically for Claude Code + Gemini CLI + Codex CLI:

#### Token Audit (Local Ledger + MCP Query)

**Repository:** https://github.com/littlebearapps/token-audit

**Features:**
- Reads local session logs
- Live TUI dashboard
- Daily summaries
- Cross-agent cost/token auditing
- Can be queried via MCP
- **Diagnoses token spikes per platform/tooling**

```bash
# Install via PyPI
pip install token-audit

# Run TUI
token-audit

# Query via MCP (from your orchestrator)
token-audit --mcp-query "usage_last_24h"
```

**Use as your "truthy local ledger"** - easy export + MCP query for orchestrator integration.

#### AI Observer (Real-Time Dashboard)

**Repository:** https://github.com/tobilg/ai-observer

**Features:**
- Token usage, costs, latency, errors
- Session activity across all CLI tools
- Real-time visibility + alerts
- Stuck agent detection
- Error spike detection

```bash
# Deploy
./ai-observer --port 3000
```

### Recommended Usage Tracking Stack

| Layer | Tool | Purpose |
|-------|------|---------|
| **Local Ledger** | Token Audit | Truthy source, MCP queryable, daily reports |
| **Real-Time Dashboard** | AI Observer | Visibility, alerts, stuck agent detection |
| **API Tracking** | Provider SDKs | Per-run usage (`run.usage`, JSON output) |
| **Persistence** | SQLite | Restart recovery, analytics, audit trail |
| **Enforcement** | Custom Policy | Budget limits, rate limiting, backoff |

---

## MCP/Tool Budget Controls

**Token bloat is real.** Claude Code issues show cases where tool descriptions can cause huge initial token overhead.

Since Gemini CLI and Claude Code are MCP-heavy, implement explicit controls:

### Per-MCP-Server Budgets

```python
@dataclass
class MCPServerBudget:
    """Budget constraints for an MCP server"""
    server_name: str
    daily_token_limit: int
    max_calls_per_minute: int
    max_calls_per_run: int
    enabled: bool = True

MCP_BUDGETS = {
    "filesystem": MCPServerBudget("filesystem", 100_000, 60, 50, True),
    "github": MCPServerBudget("github", 200_000, 30, 20, True),
    "search": MCPServerBudget("search", 500_000, 10, 5, True),  # Expensive
    "database": MCPServerBudget("database", 50_000, 20, 10, True),
}
```

### Per-Tool Budgets

```python
@dataclass
class ToolBudget:
    """Budget constraints for a specific tool within an MCP server"""
    tool_name: str
    server_name: str
    daily_token_limit: int
    max_calls_per_minute: int
    warning_threshold: float  # 0.0-1.0, warn when this % of budget used

TOOL_BUDGETS = {
    "web_search": ToolBudget("web_search", "search", 300_000, 5, 0.8),
    "read_file": ToolBudget("read_file", "filesystem", 50_000, 30, 0.9),
    "write_file": ToolBudget("write_file", "filesystem", 30_000, 20, 0.7),
    "run_command": ToolBudget("run_command", "filesystem", 20_000, 10, 0.5),
}
```

### Tool Registry Allowlist

**Only approved MCP servers in production:**

```python
class MCPRegistry:
    """Registry of approved MCP servers and tools"""

    # Production allowlist
    ALLOWED_SERVERS = {
        "filesystem",
        "github",
        "search",
    }

    # Explicitly blocked (even if installed)
    BLOCKED_SERVERS = {
        "untrusted_server",
        "experimental_tools",
    }

    # Tools requiring explicit approval per-use
    APPROVAL_REQUIRED_TOOLS = {
        "run_command",
        "write_file",
        "delete_file",
        "git_push",
    }

    def is_allowed(self, server: str, tool: str) -> tuple[bool, str]:
        """Check if server/tool combination is allowed"""
        if server in self.BLOCKED_SERVERS:
            return False, f"Server {server} is blocked"

        if server not in self.ALLOWED_SERVERS:
            return False, f"Server {server} not in allowlist"

        if tool in self.APPROVAL_REQUIRED_TOOLS:
            return False, f"Tool {tool} requires approval"

        return True, "allowed"
```

### MCP Usage Tracking

```python
class MCPUsageTracker:
    """Track MCP tool usage against budgets"""

    def __init__(self, db: OrchestratorDB):
        self.db = db

    def record_tool_call(
        self,
        agent_id: str,
        run_id: str,
        server: str,
        tool: str,
        tokens: int,
        duration_ms: int,
        success: bool
    ):
        """Record a tool call"""
        with self.db.connection() as conn:
            conn.execute("""
                INSERT INTO mcp_tool_usage
                (agent_id, run_id, tool_name, mcp_server, tokens_used, duration_ms, success)
                VALUES (?, ?, ?, ?, ?, ?, ?)
            """, (agent_id, run_id, tool, server, tokens, duration_ms, success))

    def check_budget(self, server: str, tool: str = None) -> tuple[bool, str]:
        """Check if we're within budget"""
        budget = MCP_BUDGETS.get(server)
        if not budget:
            return True, "no_budget_set"

        today_usage = self._get_today_usage(server, tool)

        if today_usage >= budget.daily_token_limit:
            return False, f"daily_limit_exceeded:{today_usage}/{budget.daily_token_limit}"

        if tool:
            tool_budget = TOOL_BUDGETS.get(tool)
            if tool_budget and today_usage >= tool_budget.daily_token_limit:
                return False, f"tool_limit_exceeded:{tool}"

        return True, "within_budget"
```

---

## Task Routing: Best CLI for the Job

### Default Routing Strategy

Based on how teams are actually using these tools:

| Task Type | Best Tool | Why |
|-----------|-----------|-----|
| **Complex multi-file changes** | Claude Code | More careful, asks clarifying questions, maintains coherence |
| **CI/CD + automation** | Codex CLI | Clear approval modes (Suggest / Auto-Edit / Full Auto) |
| **Huge-context tasks** | Gemini CLI | 1M token context + Search grounding for current best practices |
| **Quick code generation** | API (any) | Fast, cheap, scriptable |
| **Interactive review** | Claude Code | Human-in-the-loop friendly |

### Router Implementation

```python
from enum import Enum
from dataclasses import dataclass

class TaskType(Enum):
    MULTI_FILE_REFACTOR = "multi_file_refactor"
    CI_CD_AUTOMATION = "ci_cd_automation"
    LARGE_CONTEXT_ANALYSIS = "large_context_analysis"
    QUICK_GENERATION = "quick_generation"
    INTERACTIVE_REVIEW = "interactive_review"
    TEST_GENERATION = "test_generation"
    DOCUMENTATION = "documentation"

# Default routing table
TASK_ROUTING = {
    TaskType.MULTI_FILE_REFACTOR: ["claude_code", "gemini_cli", "codex"],
    TaskType.CI_CD_AUTOMATION: ["codex", "claude_code", "gemini_cli"],
    TaskType.LARGE_CONTEXT_ANALYSIS: ["gemini_cli", "claude_code", "codex"],
    TaskType.QUICK_GENERATION: ["claude_api", "openai_api", "gemini_api"],
    TaskType.INTERACTIVE_REVIEW: ["claude_code", "gemini_cli"],
    TaskType.TEST_GENERATION: ["codex", "claude_code"],
    TaskType.DOCUMENTATION: ["gemini_cli", "claude_code"],  # Search grounding helps
}

class TaskRouter:
    def __init__(self, adapters: dict, budgets: dict, db: OrchestratorDB):
        self.adapters = adapters
        self.budgets = budgets
        self.db = db

    def route(self, task_type: TaskType) -> str:
        """Route task to best available agent based on preference + quota"""
        preferences = TASK_ROUTING.get(task_type, ["claude_code"])

        for agent_name in preferences:
            if self._has_quota(agent_name):
                return agent_name

        # All preferred agents at quota - fall back or queue
        raise QuotaExceededError(f"No agents available for {task_type}")

    def _has_quota(self, agent_name: str) -> bool:
        """Check if agent has remaining budget"""
        budget = self.budgets.get(agent_name)
        usage = self.db.get_daily_usage(agent_name)

        if usage["cost_usd"] >= budget.daily_cost_limit:
            return False
        if usage["tokens_input"] + usage["tokens_output"] >= budget.daily_token_limit:
            return False
        return True
```

---

## Control Loop: Monitoring Agent Health

The control loop watches agents (progress/health/risk) and decides: **auto-prompt vs escalate to user**.

### Progress Signals to Track

Track each agent instance with heartbeat + progress signals:

| Signal | What to Track | Healthy | Warning |
|--------|---------------|---------|---------|
| **Git diff** | Lines changed in last N minutes | > 0 | 0 for > 10min |
| **Test runs** | Tests executed / logs produced | Regular | None for > 15min |
| **Token burn rate** | Tokens per minute | Steady | Spike or zero |
| **Error rate** | Tool-call failures / permission denials | < 5% | > 20% |
| **Idle but expensive** | High tokens, no diff | N/A | ⚠️ Flag |

### Stuck Agent Heuristics

```python
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class AgentHealthCheck:
    agent_id: str
    last_stdout: datetime
    last_file_change: datetime
    token_burn_rate: float  # tokens/min
    error_count: int
    consecutive_same_error: int
    pending_permission_prompt: bool
    edit_revert_cycles: int  # edit → revert → edit pattern
    auto_prompt_attempts: int = 0  # Track retry attempts

    def is_stuck(self) -> tuple[bool, str]:
        """Detect if agent is stuck and return reason"""
        now = datetime.now()

        # No output + no file changes + usage increasing
        if (now - self.last_stdout > timedelta(minutes=10) and
            now - self.last_file_change > timedelta(minutes=10) and
            self.token_burn_rate > 0):
            return True, "idle_but_burning_tokens"

        # Repeating same error > K times
        if self.consecutive_same_error > 3:
            return True, "repeated_error_loop"

        # Edit oscillation (edit → revert → edit) with no test progress
        if self.edit_revert_cycles > 2:
            return True, "edit_oscillation"

        # Permission prompt pending too long
        if self.pending_permission_prompt:
            return True, "awaiting_human_decision"

        return False, "healthy"
```

### Control Loop Implementation

```python
import asyncio
from enum import Enum

class ControlAction(Enum):
    CONTINUE = "continue"
    AUTO_PROMPT = "auto_prompt"
    ESCALATE = "escalate"
    PAUSE = "pause"
    TERMINATE = "terminate"

class AgentControlLoop:
    def __init__(self, agents: dict, db: OrchestratorDB, check_interval: int = 60):
        self.agents = agents
        self.db = db
        self.check_interval = check_interval
        self.health_checks = {}

    async def run(self):
        """Main control loop - runs continuously"""
        while True:
            for agent_id, adapter in self.agents.items():
                health = self._sample_health(agent_id, adapter)
                self.db.record_health_sample(health)

                action = await self._evaluate_agent(agent_id, health)
                await self._execute_action(agent_id, action)

            await asyncio.sleep(self.check_interval)

    def _sample_health(self, agent_id: str, adapter) -> AgentHealthCheck:
        """Sample current health metrics for an agent"""
        # Implementation would gather actual metrics from:
        # - git diff --stat
        # - test runner output
        # - token usage from adapter.get_usage()
        # - error logs
        pass

    async def _evaluate_agent(self, agent_id: str, health: AgentHealthCheck) -> ControlAction:
        """Evaluate agent health and determine action"""
        is_stuck, reason = health.is_stuck()

        if not is_stuck:
            return ControlAction.CONTINUE

        # Determine action based on stuck reason
        if reason == "idle_but_burning_tokens":
            return ControlAction.AUTO_PROMPT

        if reason == "repeated_error_loop":
            if health.auto_prompt_attempts < 2:
                return ControlAction.AUTO_PROMPT
            return ControlAction.ESCALATE

        if reason == "edit_oscillation":
            return ControlAction.ESCALATE

        if reason == "awaiting_human_decision":
            return ControlAction.ESCALATE

        return ControlAction.CONTINUE

    async def _execute_action(self, agent_id: str, action: ControlAction):
        """Execute the determined action"""
        if action == ControlAction.AUTO_PROMPT:
            await self._auto_prompt_agent(agent_id)
        elif action == ControlAction.ESCALATE:
            await self._escalate_to_user(agent_id)
        elif action == ControlAction.PAUSE:
            await self._pause_agent(agent_id)
        elif action == ControlAction.TERMINATE:
            await self._terminate_agent(agent_id)

    async def _auto_prompt_agent(self, agent_id: str):
        """Feed context back to stuck agent"""
        health = self.health_checks[agent_id]
        health.auto_prompt_attempts += 1

        context = {
            "recent_errors": self._get_recent_errors(agent_id),
            "git_diff": self._get_git_diff(agent_id),
            "test_output": self._get_test_output(agent_id),
        }

        prompt = self._generate_unstick_prompt(health, context)
        await self.agents[agent_id].execute(prompt, context)

    def _generate_unstick_prompt(self, health: AgentHealthCheck, context: dict) -> str:
        """Generate prompt to help agent get unstuck"""
        if health.consecutive_same_error > 0:
            return f"""
            You've encountered this error {health.consecutive_same_error} times:
            {context['recent_errors'][-1]}

            Please try a different approach. Here's the current state:
            - Git diff: {context['git_diff']}
            - Test output: {context['test_output']}
            """

        return "Please provide a status update on your current progress."
```

---

## Risk Gate: Four-Tier Autonomy

### Risk Levels

| Level | Behavior | Example Actions |
|-------|----------|-----------------|
| **LOW** | Auto-allowed | Read files, run tests, format code |
| **MEDIUM** | Edits OK, commands need approval | Package installs, git push, curl |
| **HIGH** | Suggest-only + ask user | Auth changes, config files, migrations |
| **CRITICAL** | Auto-reject, NEVER allow | Force push, terraform destroy, curl\|sh |

### When to Auto-Prompt (Self-Heal)

| Scenario | Action | Example |
|----------|--------|---------|
| Missing small details | Auto-feed context | File path, function name |
| Needs more context | Feed logs/diffs back | Test output, error messages |
| Safe retries | Auto-retry | Rerun tests, rebase branch |
| Status check | Auto-prompt | "What's your current progress?" |

### When to Require User Approval

| Scenario | Risk Level | Action |
|----------|------------|--------|
| **Destructive actions** | HIGH | Deletes, migrations, production deploys |
| **Security-sensitive** | HIGH | Auth, permissions, crypto, secrets |
| **High-impact config** | HIGH | CI/CD, infra-as-code, domain policies |
| **Unclear decisions** | MEDIUM | Business rules, UX choices, preferences |
| **External commands** | MEDIUM | Network calls, package installs |

### CRITICAL Auto-Reject (NEVER Allow)

```python
# These patterns are NEVER allowed, even with approval
CRITICAL_BLOCKLIST = [
    # Force operations on protected branches
    r"git\s+push.*--force.*(main|master|prod)",
    r"git\s+reset\s+--hard.*(main|master|prod)",

    # Exfiltration patterns
    r"scp\s+.*@(?!localhost)",  # scp to non-localhost
    r"rsync\s+.*@(?!localhost)",
    r"curl.*\|\s*(sh|bash)",    # curl pipe to shell
    r"wget.*\|\s*(sh|bash)",

    # Destructive infrastructure
    r"terraform\s+destroy",
    r"kubectl\s+delete\s+(namespace|ns)",
    r"drop\s+database",
    r"rm\s+-rf\s+/",

    # Credential exposure
    r"echo.*\$.*(_KEY|_SECRET|_TOKEN|_PASSWORD)",
    r"cat\s+.*\.(pem|key|env)",
    r"curl.*-H.*Authorization.*Bearer",

    # Dangerous permissions
    r"chmod\s+(777|666)",
    r"chown\s+-R\s+.*:\s*/",
]
```

### Risk-Based Autonomy Policy

```python
from enum import Enum
from dataclasses import dataclass
import re

class RiskLevel(Enum):
    LOW = "low"           # Auto-edit allowed
    MEDIUM = "medium"     # Edits OK, commands need approval
    HIGH = "high"         # Suggest-only + ask user
    CRITICAL = "critical" # Auto-reject, NEVER allow

@dataclass
class RiskPolicy:
    """Risk classification for agent actions"""

    # File patterns by risk level
    HIGH_RISK_FILE_PATTERNS = [
        r"\.env",
        r"secrets?\..*",
        r"credentials?\..*",
        r".*\.pem$",
        r".*\.key$",
        r"docker-compose\.ya?ml$",
        r"Dockerfile$",
        r"\.github/workflows/.*",
        r"terraform/.*",
        r"k8s/.*",
        r"migrations?/.*",
    ]

    MEDIUM_RISK_FILE_PATTERNS = [
        r"package\.json$",
        r"requirements\.txt$",
        r"pyproject\.toml$",
        r".*config.*\..*$",
    ]

    # Command patterns by risk level
    HIGH_RISK_COMMAND_PATTERNS = [
        r"rm\s+-rf",
        r"drop\s+table",
        r"delete\s+from",
        r"kubectl\s+delete",
        r"git\s+push.*--force",
        r"chmod\s+777",
    ]

    MEDIUM_RISK_COMMAND_PATTERNS = [
        r"npm\s+install",
        r"pip\s+install",
        r"docker\s+run",
        r"git\s+push",
        r"curl",
        r"wget",
    ]

    @classmethod
    def classify_file(cls, file_path: str) -> RiskLevel:
        """Classify risk level for file modification"""
        for pattern in cls.HIGH_RISK_FILE_PATTERNS:
            if re.search(pattern, file_path, re.IGNORECASE):
                return RiskLevel.HIGH

        for pattern in cls.MEDIUM_RISK_FILE_PATTERNS:
            if re.search(pattern, file_path, re.IGNORECASE):
                return RiskLevel.MEDIUM

        return RiskLevel.LOW

    @classmethod
    def classify_command(cls, command: str) -> RiskLevel:
        """Classify risk level for command execution"""
        # Check CRITICAL blocklist first
        for pattern in CRITICAL_BLOCKLIST:
            if re.search(pattern, command, re.IGNORECASE):
                return RiskLevel.CRITICAL

        for pattern in cls.HIGH_RISK_COMMAND_PATTERNS:
            if re.search(pattern, command, re.IGNORECASE):
                return RiskLevel.HIGH

        for pattern in cls.MEDIUM_RISK_COMMAND_PATTERNS:
            if re.search(pattern, command, re.IGNORECASE):
                return RiskLevel.MEDIUM

        return RiskLevel.LOW


class AutonomyGate:
    """Gate that enforces risk-based autonomy"""

    def __init__(self, db: OrchestratorDB, default_level: RiskLevel = RiskLevel.MEDIUM):
        self.db = db
        self.default_level = default_level
        self.interrupt_handler = None  # Set by HumanInterruptInterface

    async def check_action(self, agent_id: str, action_type: str, target: str) -> tuple[bool, str]:
        """Check if action is allowed or needs approval

        Returns: (allowed: bool, reason: str)
        """
        if action_type == "file_edit":
            risk = RiskPolicy.classify_file(target)
        elif action_type == "command":
            risk = RiskPolicy.classify_command(target)
        else:
            risk = self.default_level

        # CRITICAL = auto-reject
        if risk == RiskLevel.CRITICAL:
            self._log_blocked(agent_id, action_type, target, risk)
            return False, f"CRITICAL_BLOCKED: {target}"

        # LOW = auto-allowed
        if risk == RiskLevel.LOW:
            return True, "auto_allowed"

        # MEDIUM = edits OK, commands need approval
        if risk == RiskLevel.MEDIUM:
            if action_type == "file_edit":
                return True, "medium_edit_allowed"
            else:
                return await self._request_approval(agent_id, action_type, target, risk)

        # HIGH = always ask
        if risk == RiskLevel.HIGH:
            return await self._request_approval(agent_id, action_type, target, risk)

        return False, "unknown_risk_level"

    def _log_blocked(self, agent_id: str, action_type: str, target: str, risk: RiskLevel):
        """Log a blocked action"""
        with self.db.connection() as conn:
            conn.execute("""
                INSERT INTO approvals
                (id, agent_id, action_type, target, risk_level, status, decided_by)
                VALUES (?, ?, ?, ?, ?, 'rejected', 'auto_critical')
            """, (
                f"approval-{datetime.now().strftime('%Y%m%d%H%M%S')}",
                agent_id, action_type, target, risk.value
            ))

    async def _request_approval(
        self,
        agent_id: str,
        action_type: str,
        target: str,
        risk: RiskLevel
    ) -> tuple[bool, str]:
        """Request user approval for risky action"""
        approval_id = f"approval-{datetime.now().strftime('%Y%m%d%H%M%S')}"

        # Record pending approval
        with self.db.connection() as conn:
            conn.execute("""
                INSERT INTO approvals
                (id, agent_id, action_type, target, risk_level, status)
                VALUES (?, ?, ?, ?, ?, 'pending')
            """, (approval_id, agent_id, action_type, target, risk.value))

        # Request via interrupt handler
        if self.interrupt_handler:
            approved = await self.interrupt_handler.request_approval(
                approval_id, agent_id, action_type, target, risk
            )
            return approved, "user_decision"

        return False, "no_interrupt_handler"
```

---

## Human Interrupt Interface

Define V1 + V2 interfaces for human approval.

### V1: Blocking CLI Prompt

```python
import sys
from datetime import datetime, timedelta

class CLIInterruptHandler:
    """V1: Blocking CLI prompt for approvals"""

    def __init__(self, db: OrchestratorDB, timeout_seconds: int = 300):
        self.db = db
        self.timeout = timedelta(seconds=timeout_seconds)

    async def request_approval(
        self,
        approval_id: str,
        agent_id: str,
        action_type: str,
        target: str,
        risk: RiskLevel
    ) -> bool:
        """Request approval via CLI prompt"""
        print("\n" + "="*60)
        print(f"[ALERT] Approval Required")
        print("="*60)
        print(f"Agent: {agent_id}")
        print(f"Action: {action_type}")
        print(f"Target: {target}")
        print(f"Risk Level: {risk.value.upper()}")
        print("="*60)

        try:
            response = input("Approve? [y/n/s(kip)]: ").strip().lower()

            if response == 'y':
                self._record_decision(approval_id, 'approved', 'user')
                return True
            elif response == 's':
                self._record_decision(approval_id, 'skipped', 'user')
                return False
            else:
                self._record_decision(approval_id, 'rejected', 'user')
                return False

        except (EOFError, KeyboardInterrupt):
            self._record_decision(approval_id, 'rejected', 'interrupt')
            return False

    def _record_decision(self, approval_id: str, status: str, decided_by: str):
        """Record the approval decision"""
        with self.db.connection() as conn:
            conn.execute("""
                UPDATE approvals SET
                    status = ?,
                    decided_at = ?,
                    decided_by = ?
                WHERE id = ?
            """, (status, datetime.now(), decided_by, approval_id))
```

### V2: Webhook/Async Interface

```python
import asyncio
import aiohttp
from dataclasses import dataclass
from typing import Optional

@dataclass
class WebhookConfig:
    """Configuration for webhook notifications"""
    url: str
    secret: str
    channel: str  # Slack channel, email address, etc.

class AsyncInterruptHandler:
    """V2: Webhook + approval queue + timeout"""

    def __init__(
        self,
        db: OrchestratorDB,
        webhook: Optional[WebhookConfig] = None,
        timeout_seconds: int = 3600  # 1 hour default
    ):
        self.db = db
        self.webhook = webhook
        self.timeout = timedelta(seconds=timeout_seconds)
        self.pending_approvals: dict[str, asyncio.Event] = {}

    async def request_approval(
        self,
        approval_id: str,
        agent_id: str,
        action_type: str,
        target: str,
        risk: RiskLevel
    ) -> bool:
        """Request approval via webhook, wait with timeout"""

        # Create event to wait on
        self.pending_approvals[approval_id] = asyncio.Event()

        # Send notification
        await self._send_notification(approval_id, agent_id, action_type, target, risk)

        # Wait for approval with timeout
        try:
            await asyncio.wait_for(
                self.pending_approvals[approval_id].wait(),
                timeout=self.timeout.total_seconds()
            )

            # Check result
            with self.db.connection() as conn:
                row = conn.execute(
                    "SELECT status FROM approvals WHERE id = ?",
                    (approval_id,)
                ).fetchone()
                return row['status'] == 'approved'

        except asyncio.TimeoutError:
            self._record_decision(approval_id, 'timeout', 'system')
            return False

        finally:
            del self.pending_approvals[approval_id]

    async def _send_notification(
        self,
        approval_id: str,
        agent_id: str,
        action_type: str,
        target: str,
        risk: RiskLevel
    ):
        """Send notification via webhook"""
        if not self.webhook:
            return

        payload = {
            "approval_id": approval_id,
            "agent_id": agent_id,
            "action_type": action_type,
            "target": target,
            "risk_level": risk.value,
            "approve_url": f"/approve/{approval_id}",
            "reject_url": f"/reject/{approval_id}",
        }

        # Format for Slack
        if "slack" in self.webhook.url:
            payload = {
                "channel": self.webhook.channel,
                "text": f"🚨 Approval Required: {action_type} on {target}",
                "attachments": [{
                    "color": "warning" if risk == RiskLevel.MEDIUM else "danger",
                    "fields": [
                        {"title": "Agent", "value": agent_id, "short": True},
                        {"title": "Risk", "value": risk.value, "short": True},
                        {"title": "Target", "value": target},
                    ],
                    "actions": [
                        {"type": "button", "text": "Approve", "style": "primary",
                         "url": payload["approve_url"]},
                        {"type": "button", "text": "Reject", "style": "danger",
                         "url": payload["reject_url"]},
                    ]
                }]
            }

        async with aiohttp.ClientSession() as session:
            await session.post(self.webhook.url, json=payload)

    def handle_approval_response(self, approval_id: str, approved: bool, user: str):
        """Handle response from webhook/UI"""
        status = 'approved' if approved else 'rejected'
        self._record_decision(approval_id, status, user)

        # Signal waiting coroutine
        if approval_id in self.pending_approvals:
            self.pending_approvals[approval_id].set()

    def _record_decision(self, approval_id: str, status: str, decided_by: str):
        with self.db.connection() as conn:
            conn.execute("""
                UPDATE approvals SET
                    status = ?,
                    decided_at = ?,
                    decided_by = ?
                WHERE id = ?
            """, (status, datetime.now(), decided_by, approval_id))
```

---

## Secret Handling Subsystem

**Treat this as a design constraint, not a nice-to-have.** Real-world bugs show tokens appearing in transcripts.

### Log Redaction Policy

```python
import re
from typing import Callable

class SecretRedactor:
    """Redact secrets from logs and output"""

    # Patterns to redact
    REDACTION_PATTERNS = [
        # API keys
        (r'sk-[a-zA-Z0-9]{20,}', '[REDACTED_API_KEY]'),
        (r'sk-ant-[a-zA-Z0-9-]{20,}', '[REDACTED_ANTHROPIC_KEY]'),
        (r'AIza[a-zA-Z0-9_-]{35}', '[REDACTED_GOOGLE_KEY]'),

        # Tokens
        (r'ghp_[a-zA-Z0-9]{36}', '[REDACTED_GITHUB_TOKEN]'),
        (r'gho_[a-zA-Z0-9]{36}', '[REDACTED_GITHUB_OAUTH]'),
        (r'Bearer\s+[a-zA-Z0-9._-]+', 'Bearer [REDACTED]'),

        # Passwords in URLs
        (r'://[^:]+:([^@]+)@', '://[user]:[REDACTED]@'),

        # AWS
        (r'AKIA[A-Z0-9]{16}', '[REDACTED_AWS_ACCESS_KEY]'),
        (r'aws_secret_access_key\s*=\s*\S+', 'aws_secret_access_key=[REDACTED]'),

        # Generic patterns
        (r'password\s*[=:]\s*\S+', 'password=[REDACTED]'),
        (r'secret\s*[=:]\s*\S+', 'secret=[REDACTED]'),
        (r'token\s*[=:]\s*\S+', 'token=[REDACTED]'),
    ]

    @classmethod
    def redact(cls, text: str) -> str:
        """Redact all sensitive patterns from text"""
        for pattern, replacement in cls.REDACTION_PATTERNS:
            text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
        return text

    @classmethod
    def wrap_logger(cls, logger: Callable) -> Callable:
        """Wrap a logger to automatically redact secrets"""
        def wrapped(msg: str, *args, **kwargs):
            return logger(cls.redact(msg), *args, **kwargs)
        return wrapped
```

### Never-Store Patterns

```python
# Files that should NEVER be read or stored
NEVER_STORE_PATTERNS = [
    r"\.env$",
    r"\.env\.(local|production|development)$",
    r"secrets?\.ya?ml$",
    r"credentials\.json$",
    r".*\.pem$",
    r".*\.key$",
    r".*_rsa$",
    r".*_dsa$",
    r".*_ed25519$",
    r"\.aws/credentials$",
    r"\.ssh/.*",
    r"\.gnupg/.*",
]

class SecretFileGuard:
    """Guard against reading or storing secret files"""

    @classmethod
    def is_secret_file(cls, file_path: str) -> bool:
        """Check if file is a secret file"""
        for pattern in NEVER_STORE_PATTERNS:
            if re.search(pattern, file_path, re.IGNORECASE):
                return True
        return False

    @classmethod
    def check_read(cls, file_path: str) -> tuple[bool, str]:
        """Check if file can be read

        Returns: (allowed, reason)
        """
        if cls.is_secret_file(file_path):
            return False, f"BLOCKED: {file_path} matches secret file pattern"
        return True, "allowed"
```

### Approval-Required for Secret File Access

```python
# Actions on secret files require explicit approval
SECRET_ACCESS_ACTIONS = {
    "read": RiskLevel.HIGH,
    "write": RiskLevel.CRITICAL,  # Never allow
    "delete": RiskLevel.CRITICAL,
    "copy": RiskLevel.CRITICAL,
    "move": RiskLevel.CRITICAL,
}
```

---

## Control Plane A: CLI/Workspace Agents

### Claude Code (CLI)

**Use For:** Interactive terminal workflows where human oversight is needed

**Headless Automation:**
```bash
# Run with structured JSON output for orchestration
claude -p "Implement feature X" --output-format json

# Stream for real-time processing
claude -p "Implement feature X" --output-format stream-json
```

**⚠️ CRITICAL: Headless Mode Does Not Persist**

Per Anthropic guidance, headless mode is **per-session**. There is no persistent agent state between invocations.

**Your design must:**
- Externalize state to `project_state.json` + SQLite
- Re-inject context at each invocation
- Capture `StatusPacket` output at session end
- Handle session continuity in your orchestrator, not in Claude Code

**When to Use CLI vs SDK:**
| Scenario | Use CLI | Use SDK |
|----------|---------|---------|
| Interactive coding session | ✅ | |
| Human reviewing changes in real-time | ✅ | |
| Automated pipeline task | | ✅ |
| Scheduled batch processing | | ✅ |
| Multi-agent orchestration | | ✅ |

---

### Gemini CLI

**Repository:** https://github.com/google-gemini/gemini-cli

**Key Features:**
- ReAct loop with MCP server support
- 60 req/min, 1000 req/day free tier
- 1M token context window
- Built-in tools (search, file ops, shell)
- **Search grounding** for current best practices

**Best For:** Large context analysis, documentation tasks, "current best practices" queries

---

### Codex CLI

**Repository:** https://github.com/openai/codex-cli

**Key Feature: Tiered Autonomy Modes**

| Mode | Behavior |
|------|----------|
| **Suggest** | Shows proposed changes, requires approval for everything |
| **Auto-Edit** | Automatically applies file edits, requires approval for commands |
| **Full Auto** | Executes everything automatically |

This is the mental model for your orchestrator's risk gate.

---

### Claude Squad vs DIY (AGPL Avoidance)

**Repository:** https://github.com/smtg-ai/claude-squad
**License:** AGPL-3.0 (⚠️ requires source disclosure for any distribution)

**If AGPL is an issue, implement the same pattern yourself:**

```bash
# DIY Claude Squad pattern (MIT-friendly)

# 1. Create git worktrees for each agent
git worktree add ../agent-claude feature/claude-work
git worktree add ../agent-gemini feature/gemini-work
git worktree add ../agent-codex feature/codex-work

# 2. Create tmux sessions for isolation
tmux new-session -d -s claude -c ../agent-claude
tmux new-session -d -s gemini -c ../agent-gemini
tmux new-session -d -s codex -c ../agent-codex

# 3. Run agents in their sessions
tmux send-keys -t claude 'claude' Enter
tmux send-keys -t gemini 'gemini' Enter
tmux send-keys -t codex 'codex' Enter

# 4. Switch between sessions
tmux switch-client -t claude
```

**Python wrapper:**

```python
import subprocess
from pathlib import Path

class DIYAgentWorkspace:
    """AGPL-free multi-agent workspace management"""

    def __init__(self, project_root: Path):
        self.project_root = project_root
        self.workspaces = {}

    def create_workspace(self, agent_id: str, branch: str) -> Path:
        """Create isolated workspace for an agent"""
        workspace_path = self.project_root.parent / f"agent-{agent_id}"

        # Create git worktree
        subprocess.run([
            "git", "worktree", "add",
            str(workspace_path),
            "-b", branch
        ], cwd=self.project_root, check=True)

        # Create tmux session
        subprocess.run([
            "tmux", "new-session", "-d",
            "-s", agent_id,
            "-c", str(workspace_path)
        ], check=True)

        self.workspaces[agent_id] = workspace_path
        return workspace_path

    def run_in_workspace(self, agent_id: str, command: str):
        """Run command in agent's tmux session"""
        subprocess.run([
            "tmux", "send-keys", "-t", agent_id,
            command, "Enter"
        ], check=True)

    def switch_to(self, agent_id: str):
        """Switch to agent's tmux session"""
        subprocess.run([
            "tmux", "switch-client", "-t", agent_id
        ], check=True)

    def cleanup(self, agent_id: str):
        """Clean up workspace"""
        if agent_id in self.workspaces:
            # Kill tmux session
            subprocess.run(["tmux", "kill-session", "-t", agent_id])

            # Remove worktree
            subprocess.run([
                "git", "worktree", "remove",
                str(self.workspaces[agent_id])
            ], cwd=self.project_root)

            del self.workspaces[agent_id]
```

---

## Control Plane B: API/Orchestration Agents

### Claude Agent SDK (Recommended for Orchestration)

**Repository:** https://github.com/anthropics/claude-agent-sdk-python

**Critical:** Bundles Claude Code CLI automatically. "Claude Code as a library."

```python
from claude_agent_sdk import ClaudeSDKClient

client = ClaudeSDKClient()

# Custom tools
@client.tool
def run_tests(path: str) -> str:
    import subprocess
    result = subprocess.run(["pytest", path], capture_output=True)
    return SecretRedactor.redact(result.stdout.decode())  # Redact secrets!

# Hooks for control
@client.hook("before_edit")
def review_changes(file_path: str, changes: str) -> bool:
    risk = RiskPolicy.classify_file(file_path)
    if risk == RiskLevel.CRITICAL:
        return False
    if risk == RiskLevel.HIGH:
        # Queue for approval
        return False
    return True

response = await client.query("Implement user authentication")
```

---

### OpenAI Agents SDK

**Repository:** https://github.com/openai/openai-agents-python

**Key Features:**
- Handoffs for agent coordination
- Sessions for conversation state
- Guardrails for input validation
- Built-in usage reporting (`run.usage`)

```python
from openai_agents import Agent, Runner

coder = Agent(name="coder", instructions="...", model="gpt-4")
reviewer = Agent(name="reviewer", instructions="...", model="gpt-4")

runner = Runner()
result = await runner.run(
    agent=coder,
    messages=[{"role": "user", "content": "Implement login"}],
    handoffs=[reviewer]
)

# Usage automatically tracked
print(f"Tokens: {result.usage.total_tokens}")
print(f"Cost: ${result.usage.cost}")
```

---

### LiteLLM Proxy (Routing + Policy)

**Repository:** https://github.com/BerriAI/litellm

```bash
# Run as proxy
litellm --model claude-3-opus-20240229 --port 8000
```

Then all agents hit `localhost:8000` with unified logging.

---

## Observability Strategy

### What AI Observer Gives You

**Repository:** https://github.com/tobilg/ai-observer

**Measures:**
- Token usage per session
- Cost tracking
- API latency
- Error rates
- Session activity

**Does NOT reliably infer:**
- Semantic "progress" (you need your own signals)
- Task completion status
- Code quality

**Additional Features:**
- Import historical data
- Export to Parquet/DuckDB views (handy for audits)

### Your Plan

| Component | Tool | Purpose |
|-----------|------|---------|
| **Telemetry** | AI Observer | Token/cost/latency/errors |
| **Workflow State** | SQLite | Restart recovery, analytics, audit trail |
| **Local Ledger** | Token Audit | MCP queryable, daily summaries |

### Two-Layer Observability

```
┌─────────────────────────────────────────────────────────────┐
│                    AI Observer (Telemetry)                   │
│                                                              │
│   • Token usage     • Cost tracking    • Error rates        │
│   • Latency         • Session activity • Stuck detection    │
│   • Parquet export  • DuckDB views     • Historical import  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    SQLite (Workflow State)                   │
│                                                              │
│   • agents          • runs             • health_samples     │
│   • tasks           • approvals        • mcp_tool_usage     │
│   • usage_daily     • decisions        • audit trail        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Token Audit (Local Ledger)               │
│                                                              │
│   • MCP queryable   • TUI dashboard    • Daily summaries    │
│   • Cross-agent audit                  • Token spike diagnosis│
└─────────────────────────────────────────────────────────────┘
```

---

## Merge Gate Mechanics

### Definition of "Ready to Merge"

A task/branch is ready to merge when ALL conditions are met:

```python
@dataclass
class MergeReadiness:
    """Check if a branch is ready to merge"""
    branch: str
    agent_id: str

    # Required conditions
    tests_pass: bool
    no_critical_risks: bool
    reviewer_approved: bool
    no_pending_approvals: bool
    status_packet_complete: bool

    # Optional conditions
    documentation_updated: bool = True
    no_merge_conflicts: bool = True

    def is_ready(self) -> tuple[bool, list[str]]:
        """Check if ready to merge, return blockers if not"""
        blockers = []

        if not self.tests_pass:
            blockers.append("Tests not passing")
        if not self.no_critical_risks:
            blockers.append("Critical risks not addressed")
        if not self.reviewer_approved:
            blockers.append("Awaiting reviewer approval")
        if not self.no_pending_approvals:
            blockers.append("Pending approval requests")
        if not self.status_packet_complete:
            blockers.append("Status packet not submitted")
        if not self.no_merge_conflicts:
            blockers.append("Merge conflicts detected")

        return len(blockers) == 0, blockers
```

### Merge Gate Implementation

```python
class MergeGate:
    """Control merges to protected branches"""

    def __init__(self, db: OrchestratorDB, protected_branches: list[str] = None):
        self.db = db
        self.protected_branches = protected_branches or ["main", "master", "prod"]
        self.merge_lock = asyncio.Lock()

    async def request_merge(
        self,
        agent_id: str,
        source_branch: str,
        target_branch: str
    ) -> tuple[bool, str]:
        """Request permission to merge"""

        # Check if target is protected
        if target_branch in self.protected_branches:
            readiness = await self._check_readiness(agent_id, source_branch)
            is_ready, blockers = readiness.is_ready()

            if not is_ready:
                return False, f"Not ready: {', '.join(blockers)}"

        # Acquire merge lock (only one merge at a time)
        async with self.merge_lock:
            return await self._execute_merge(agent_id, source_branch, target_branch)

    async def _check_readiness(self, agent_id: str, branch: str) -> MergeReadiness:
        """Check if branch is ready to merge"""
        # Get latest run for this agent/branch
        with self.db.connection() as conn:
            run = conn.execute("""
                SELECT * FROM runs
                WHERE agent_id = ? AND status_packet_json IS NOT NULL
                ORDER BY ended_at DESC LIMIT 1
            """, (agent_id,)).fetchone()

            pending = conn.execute("""
                SELECT COUNT(*) as count FROM approvals
                WHERE agent_id = ? AND status = 'pending'
            """, (agent_id,)).fetchone()

        if not run:
            return MergeReadiness(
                branch=branch, agent_id=agent_id,
                tests_pass=False, no_critical_risks=False,
                reviewer_approved=False, no_pending_approvals=False,
                status_packet_complete=False
            )

        packet = json.loads(run['status_packet_json'])
        artifacts = packet.get('artifacts', {})

        return MergeReadiness(
            branch=branch,
            agent_id=agent_id,
            tests_pass=artifacts.get('test_results', {}).get('failed', 1) == 0,
            no_critical_risks='critical' not in str(artifacts.get('risk_items_encountered', [])).lower(),
            reviewer_approved=True,  # TODO: Check reviewer agent approval
            no_pending_approvals=pending['count'] == 0,
            status_packet_complete=True,
            no_merge_conflicts=self._check_conflicts(branch)
        )

    def _check_conflicts(self, branch: str) -> bool:
        """Check for merge conflicts"""
        result = subprocess.run(
            ["git", "merge", "--no-commit", "--no-ff", branch],
            capture_output=True
        )
        # Abort the test merge
        subprocess.run(["git", "merge", "--abort"], capture_output=True)
        return result.returncode == 0

    async def _execute_merge(
        self,
        agent_id: str,
        source: str,
        target: str
    ) -> tuple[bool, str]:
        """Execute the merge"""
        try:
            subprocess.run(
                ["git", "checkout", target],
                check=True, capture_output=True
            )
            subprocess.run(
                ["git", "merge", "--no-ff", source, "-m", f"Merge {source} (by {agent_id})"],
                check=True, capture_output=True
            )
            return True, "merged"
        except subprocess.CalledProcessError as e:
            return False, f"Merge failed: {e.stderr.decode()}"
```

---

## Implementation-Ready Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              USER INTERFACE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────────────┐  │
│  │  DIY Workspaces  │  │   AI Observer    │  │      Token Audit         │  │
│  │ (tmux+worktrees) │  │ (Real-time dash) │  │   (Local ledger/TUI)     │  │
│  │                  │  │                  │  │                          │  │
│  │ • Parallel agents│  │ • Stuck detection│  │ • MCP queryable          │  │
│  │ • tmux sessions  │  │ • Error alerts   │  │ • Daily summaries        │  │
│  │ • git worktrees  │  │ • Cost tracking  │  │ • Token spike diagnosis  │  │
│  │                  │  │ • Parquet export │  │                          │  │
│  │ ✅ MIT-friendly  │  └──────────────────┘  └──────────────────────────┘  │
│  └──────────────────┘                                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                        PROJECT JOURNAL PROTOCOL                              │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  project_state.json          │  agent_journal.md                    │   │
│  │  • objectives                │  • decisions + rationale             │   │
│  │  • branches/worktrees        │  • daily logs                        │   │
│  │  • constraints               │  • diff links                        │   │
│  │  • decisions_made            │                                      │   │
│  │  • open_risks                │  StatusPacket (required output)      │   │
│  │  • definition_of_done        │  • progress, artifacts, blockers     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────────────────────┤
│                           CONTROL LOOP                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                     Python Orchestrator                                │  │
│  │                                                                        │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────┐ │  │
│  │  │Task Router│ │Health Chk │ │ Risk Gate │ │Usage Ledger│ │Merge Gate│ │  │
│  │  │           │ │           │ │           │ │           │ │         │ │  │
│  │  │• task type│ │• git diff │ │LOW:auto   │ │• tokens   │ │• tests  │ │  │
│  │  │• quota    │ │• tests    │ │MED:ask cmd│ │• cost     │ │• review │ │  │
│  │  │• risk     │ │• tokens   │ │HIGH:ask   │ │• rate     │ │• risks  │ │  │
│  │  │           │ │• errors   │ │CRIT:block │ │           │ │• lock   │ │  │
│  │  └───────────┘ └───────────┘ └───────────┘ └───────────┘ └─────────┘ │  │
│  │                                                                        │  │
│  │  ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐  │  │
│  │  │ MCP Budget Control│  │ Secret Handling   │  │ Human Interrupt   │  │  │
│  │  │ • server budgets  │  │ • log redaction   │  │ V1: CLI blocking  │  │  │
│  │  │ • tool budgets    │  │ • never-store     │  │ V2: webhook/async │  │  │
│  │  │ • allowlist       │  │ • approval-req    │  │ • timeout policy  │  │  │
│  │  └───────────────────┘  └───────────────────┘  └───────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
├─────────────────────────────────────────────────────────────────────────────┤
│                          PERSISTENCE (SQLite)                                │
├─────────────────────────────────────────────────────────────────────────────┤
│  agents │ tasks │ runs │ health_samples │ approvals │ usage_daily │ mcp_use │
├─────────────────────────────────────────────────────────────────────────────┤
│                          ADAPTER LAYER                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│         ┌─────────────────┐              ┌─────────────────┐                │
│         │ CLIAgentAdapter │              │   LLMAdapter    │                │
│         └────────┬────────┘              └────────┬────────┘                │
│                  │                                │                         │
├──────────────────┼────────────────────────────────┼─────────────────────────┤
│   CONTROL PLANE A (CLI)                CONTROL PLANE B (API)                │
│                  │                                │                         │
│   ┌──────────────┴──────────────┐    ┌───────────┴───────────┐             │
│   ▼              ▼              ▼    ▼          ▼           ▼             │
│ ┌───────┐  ┌──────────┐  ┌─────────┐ ┌───────┐ ┌───────┐ ┌───────┐       │
│ │Claude │  │ Gemini   │  │ Codex   │ │Claude │ │OpenAI │ │Gemini │       │
│ │Code   │  │ CLI      │  │ CLI     │ │Agent  │ │Agents │ │ API   │       │
│ │       │  │          │  │         │ │SDK    │ │SDK    │ │       │       │
│ │⚠️ No  │  │ReAct+MCP │  │Autonomy │ │       │ │       │ │       │       │
│ │persist│  │1M context│  │ modes   │ │       │ │.usage │ │       │       │
│ └───────┘  └──────────┘  └─────────┘ └───────┘ └───────┘ └───────┘       │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Critical Gotchas

### 1. Concurrency & Repo Safety

```bash
# Git worktrees per agent (REQUIRED)
git worktree add ../agent-claude feature/claude-work
git worktree add ../agent-gemini feature/gemini-work
```

**Enforce:**
- ✅ Git worktrees per agent
- ✅ Merge gate - one merge at a time
- ✅ Reviewer agent approval
- ✅ Tests must pass

### 2. Key Hygiene

```python
# DO THIS
import os
from dotenv import load_dotenv
load_dotenv()
ANTHROPIC_KEY = os.environ["ANTHROPIC_API_KEY"]

# NEVER THIS
ANTHROPIC_KEY = "sk-ant-xxxxx"  # Compromised!
```

### 3. Headless Mode Doesn't Persist

Claude Code headless mode is per-session. State continuity is YOUR responsibility:
- `project_state.json` for context
- SQLite for orchestrator state
- StatusPacket for run outputs

### 4. License Awareness

| Tool | License | Commercial Use |
|------|---------|----------------|
| Claude Squad | AGPL-3.0 | ⚠️ Requires source disclosure |
| DIY (tmux+worktrees) | N/A | ✅ Your code |
| CrewAI | MIT | ✅ Permissive |
| LangGraph | MIT | ✅ Permissive |
| AI Observer | MIT | ✅ Permissive |
| Token Audit | MIT | ✅ Permissive |

---

## Production CLI + Orchestration Stack

### The Stack

| Component | Tool | Why |
|-----------|------|-----|
| **Multi-agent workspaces** | DIY (tmux+worktrees) | AGPL-free, same pattern |
| **Local ledger** | Token Audit | MCP queryable, token spike diagnosis |
| **Real-time dashboard** | AI Observer | Stuck detection, Parquet export |
| **Persistence** | SQLite | Restart recovery, analytics |
| **Claude automation** | Claude Agent SDK | Headless via `--output-format json` |
| **OpenAI/Codex work** | OpenAI Agents SDK | Built-in usage tracking |
| **Gemini tasks** | Gemini CLI or API | Context-heavy, search-grounded |
| **Unified routing** | LiteLLM (optional) | Only if centralized policy needed |

### Installation

```bash
# Core orchestration
pip install claude-agent-sdk
pip install openai-agents
pip install langgraph

# Observability
pip install token-audit  # From littlebearapps/token-audit

# Optional routing
pip install litellm

# System dependencies (for DIY workspaces)
# tmux, git (already installed on most systems)
```

---

## Phased Implementation Plan

### Phase 1: Foundation (Adapter Layer + Persistence)

- [ ] Create project structure
- [ ] Set up SQLite schema
- [ ] Implement `BaseAdapter` interface
- [ ] Implement `LLMAdapter` for Claude Agent SDK
- [ ] Implement `LLMAdapter` for OpenAI Agents SDK
- [ ] Implement `ProjectJournal` (project_state.json + StatusPacket)
- [ ] Environment variable configuration
- [ ] Secret redaction in logging

### Phase 2: CLI Integration

- [ ] Implement `CLIAgentAdapter` for Claude Code (JSON output mode)
- [ ] Implement `CLIAgentAdapter` for Gemini CLI
- [ ] Implement DIY workspace manager (tmux + worktrees)
- [ ] Handle headless mode state injection/capture
- [ ] Set up git worktrees per agent

### Phase 3: Control Loop + Risk Gate

- [ ] Implement health check signals (git diff, test runs, token burn)
- [ ] Implement stuck agent heuristics
- [ ] Build auto-prompt logic for recoverable situations
- [ ] Implement four-tier risk classification (LOW/MEDIUM/HIGH/CRITICAL)
- [ ] Implement CRITICAL blocklist (auto-reject)
- [ ] Build `AutonomyGate`

### Phase 4: Human Interrupt + Approvals

- [ ] Implement V1 CLI blocking prompt
- [ ] Implement V2 webhook/async handler
- [ ] Build approval queue with timeout
- [ ] Integrate with risk gate

### Phase 5: Budget Controls

- [ ] Implement MCP server budgets
- [ ] Implement per-tool budgets
- [ ] Build tool registry allowlist
- [ ] Integrate Token Audit MCP queries
- [ ] Set up budget alerts

### Phase 6: Merge Gate + Observability

- [ ] Implement merge readiness checks
- [ ] Build merge gate with lock
- [ ] Deploy AI Observer dashboard
- [ ] Configure stuck agent alerts
- [ ] Configure cost threshold alerts
- [ ] Set up daily summary reports

### Phase 7: Production Hardening

- [ ] Implement rate limiting with backoff
- [ ] Add fallback models for failures
- [ ] Set up complete audit trail
- [ ] Load testing with multiple concurrent agents
- [ ] Document operational procedures
- [ ] Create runbooks for common issues

---

## References

### Official Documentation

- [Claude Code Headless Mode](https://docs.anthropic.com/claude-code/headless) ⚠️ Per-session only
- [Claude Agent SDK Docs](https://platform.claude.com/docs/en/agent-sdk/overview)
- [OpenAI Agents SDK Docs](https://openai.github.io/openai-agents-python/)
- [Gemini CLI Docs](https://cloud.google.com/docs/gemini-cli)
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
- [LiteLLM Documentation](https://docs.litellm.ai/)

### GitHub Repositories

**Orchestration SDKs:**
- [Claude Agent SDK](https://github.com/anthropics/claude-agent-sdk-python)
- [OpenAI Agents SDK](https://github.com/openai/openai-agents-python)
- [LangGraph](https://github.com/langchain-ai/langgraph)
- [CrewAI](https://github.com/crewAIInc/crewAI)

**CLI Tools:**
- [Claude Squad](https://github.com/smtg-ai/claude-squad) ⚠️ AGPL-3.0
- [Gemini CLI](https://github.com/google-gemini/gemini-cli)
- [Codex CLI](https://github.com/openai/codex-cli)

**Observability:**
- [AI Observer](https://github.com/tobilg/ai-observer)
- [Token Audit](https://github.com/littlebearapps/token-audit) ✅ Correct repo
- [Langfuse](https://github.com/langfuse/langfuse)
- [AgentOps](https://github.com/AgentOps-AI/agentops)

**Routing:**
- [LiteLLM](https://github.com/BerriAI/litellm)

### Known Issues / Context

- [Claude Code Token Overhead from MCP Tools](https://github.com/anthropics/claude-code/issues) - Tool descriptions can cause huge initial token overhead
- [Codex Access Model](https://platform.openai.com/docs/guides/codex) - Auto vs Read-only vs Full Access modes

### Comparison Articles

- [Claude Code vs Cline vs Aider - AIMultiple](https://research.aimultiple.com/agentic-cli/)
- [AI Agent Observability Tools - AIMultiple](https://research.aimultiple.com/agentic-monitoring/)
- [AI Coding Assistants CLI Comparison - DeployHQ](https://www.deployhq.com/blog/ai-coding-assistants-cli)
