⚔️ PR #17: Unified Worker State

Event log recovery + worker_events MCP tool — closing epic cic-bbd

+2,763

Lines added

Files changed

Tests passing

6/6

Subtasks closed

The Problem

Two disconnected views of worker state, neither complete:

graph LR
    subgraph Before["Before This PR"]
        direction TB
        LW["list_workers
(MCP tool)"]
        EL["events.jsonl
(on disk)"]
        LW -.-|"In-memory only
Wiped on restart"| R1["Empty after restart"]
        EL -.-|"No MCP access
Shell scripts only"| R2["Rich history, no API"]
    end
    style Before fill:#1a0000,stroke:#f85149

After restart, list_workers returned empty even if workers were still alive in tmux. Monitoring agents had to shell out and parse events.jsonl manually.

The Solution

graph TB
    subgraph After["After This PR"]
        direction TB
        BOOT["MCP Server Starts"] --> SNAP["Read latest snapshot"]
        SNAP --> EVENTS["Read events since snapshot"]
        EVENTS --> RECOVER["recover_from_events()"]
        RECOVER --> REG["SessionRegistry
(populated)"]
        REG --> LW2["list_workers
shows ALL workers"]
        REG --> WE["worker_events
(NEW MCP tool)"]
        LIVE["Live tmux sessions"] --> REG
    end
    style After fill:#001a00,stroke:#3fb950

Part 1: Event Log Recovery

On startup, the registry reconstructs known worker state from the event log. list_workers becomes the single source of truth.

How Recovery Works

sequenceDiagram
    participant S as MCP Server
    participant R as SessionRegistry
    participant E as events.jsonl
    participant T as tmux

    Note over S: Server starts (lifespan)
    S->>E: get_latest_snapshot()
    E-->>S: snapshot (worker states at time T)
    S->>E: read_events_since(T)
    E-->>S: events after snapshot

    S->>R: recover_from_events(snapshot, events)

    Note over R: For each worker in snapshot
    R->>R: Check if already in live registry
    alt Already live
        R->>R: Skip (don't overwrite)
    else Not live
        R->>R: Create RecoveredSession
        R->>R: Apply subsequent events
    end

    R-->>S: RecoveryReport(added, skipped, closed)

    Note over S: Later: list_workers call
    S->>R: list_all()
    R->>T: Check live sessions
    R-->>S: ManagedSession[] + RecoveredSession[]

Key Types

NEW RecoveredSession

Lightweight, frozen dataclass representing a session restored from the event log. It provides full metadata about the worker's last known state, but has no terminal handle — meaning you can see it in list_workers but can't send messages to it or close it.

RecoveredSession vs Adoption: Recovery gives you visibility — "this worker existed and was idle." Adoption (discover_workers → adopt_worker) gives you control — it reconnects to the live tmux pane and creates a full ManagedSession. Think of recovery as the map, adoption as taking the wheel.

Future: Auto-adopt on startup? Recovery works even when the tmux session is gone (crashed, rebooted, exited cleanly). Adoption requires a live tmux pane. A natural follow-up: after recovery, run discovery → match recovered workers to live tmux sessions → auto-adopt. This would give full control back automatically on restart for any workers still alive.

@dataclass(frozen=True)
class RecoveredSession:
    session_id: str          # "a3f2b1c9"
    name: str                # "Groucho"
    project_path: str        # "/Users/.../claude-team"
    terminal_id: TerminalId  # May be stale after restart
    agent_type: AgentType    # "claude" | "codex"
    status: SessionStatus    # Mapped from event_state
    event_state: EventState  # "idle" | "active" | "closed"
    source: str = "event_log"  # Provenance tracking
    # ... timestamps, optional fields

src/claude_team_mcp/registry.py

NEW RecoveryReport

Returned by recover_from_events() with counts of what happened:

@dataclass(frozen=True)
class RecoveryReport:
    added: int     # Sessions added from event log
    skipped: int   # Already in live registry
    closed: int    # Marked as closed from events
    timestamp: datetime

CHANGED ManagedSession.to_dict()

Now includes source: "registry" so clients can distinguish live vs recovered:

# Live session
{"name": "Ratchet", "source": "registry", "status": "ready", ...}

# Recovered session
{"name": "Hokusai", "source": "event_log", "event_state": "idle", ...}

Startup Flow

Recovery happens in two places for reliability:

Eager recovery in app_lifespan — runs once at boot, logs results
Lazy fallback in list_workers — if registry is empty and recovery hasn't been attempted, triggers it before listing

Part 2: worker_events MCP Tool

New MCP tool for querying the event log programmatically. Replaces shell-script parsing with a clean API.

Tool Signature

worker_events(
    since: str = "",                    # ISO timestamp filter
    limit: int = 100,                   # Max events returned
    include_snapshot: bool = False,     # Include latest snapshot
    include_summary: bool = False,      # Include computed summary
    stale_threshold_minutes: int = 10,  # For stuck detection (10min default)
    project_filter: str = ""            # Filter by project path
) -> dict

Response Shape

{
  "events": [
    {"ts": "2026-01-31T14:53:12Z", "type": "worker_started",
     "worker_id": "aa7f7606", "data": {...}},
    {"ts": "2026-01-31T15:31:26Z", "type": "worker_idle",
     "worker_id": "aa7f7606", "data": {...}},
  ],
  "count": 2,

  # When include_summary=True:
  "summary": {
    "started": ["Ratchet"],
    "closed": [],
    "idle": ["Ratchet"],
    "active": ["Clank"],
    "stuck": [],           # idle > stale_threshold
    "last_event_ts": "2026-01-31T15:31:26Z"
  },

  # When include_snapshot=True:
  "snapshot": {"ts": "...", "data": {...}}
}

Stuck Worker Detection

The summary identifies "stuck" workers — idle longer than stale_threshold_minutes (default: 10 minutes). Designed for fast detection — flag stale workers quickly so monitoring agents can act.

Edge case — thinking agents: Activity is based on event log entries. If an agent is actively thinking but hasn't produced visible output yet, it may appear stale. Increase stale_threshold_minutes for workflows with long thinking phases (e.g., Codex reading a large codebase).

Architecture Overview

graph TB
    subgraph Tools["MCP Tools"]
        LW["list_workers"]
        WE["worker_events"]
        SW["spawn_workers"]
    end

    subgraph Registry["SessionRegistry"]
        LIVE["Live Sessions
(ManagedSession)"]
        REC["Recovered Sessions
(RecoveredSession)"]
        RFROM["recover_from_events()"]
    end

    subgraph Storage["Persistent Storage"]
        ELOG["events.jsonl"]
        SNAP["snapshots"]
    end

    subgraph Terminal["Terminal Backend"]
        TMUX["tmux panes"]
    end

    SW -->|creates| LIVE
    LIVE -->|emits events| ELOG
    ELOG -->|periodic| SNAP

    LW -->|queries| LIVE
    LW -->|queries| REC
    WE -->|reads| ELOG
    WE -->|reads| SNAP

    SNAP -->|on startup| RFROM
    ELOG -->|on startup| RFROM
    RFROM -->|populates| REC

    LIVE ---|terminal handle| TMUX
    REC -.-|no terminal handle| TMUX

    style REC fill:#1a3a1a,stroke:#3fb950
    style WE fill:#1a2a3a,stroke:#58a6ff
    style RFROM fill:#1a3a1a,stroke:#3fb950

Subtasks

✅ cic-bbd.1 — RecoveredSession dataclass registry.py
✅ cic-bbd.2 — recover_from_events() method registry.py
✅ cic-bbd.3 — Startup recovery: seed registry on boot server.py
✅ cic-bbd.4 — worker_events MCP tool + tests tools/worker_events.py
✅ cic-bbd.5 — list_workers source fields tools/list_workers.py
✅ cic-bbd.6 — Comprehensive tests (46 tests) tests/

Files Changed

File	Change	Lines
`registry.py`	RecoveredSession, recover_from_events(), recovery types	+394
`server.py`	Startup recovery, lazy fallback	+76
`tools/__init__.py`	Register worker_events tool	+2
`tools/list_workers.py`	Source field in output	+19
`tools/worker_events.py`	NEW MCP tool	+273
`test_recovered_session.py`	NEW 16 tests	+313
`test_recover_from_events.py`	NEW 18 tests	+705
`test_startup_recovery.py`	NEW 12 tests	+413
`test_worker_events.py`	NEW 10 tests	+539
`test_registry.py`	Source field assertion	+40

View PR #17 on GitHub →