⚔️ PR #17: Unified Worker State

Event log recovery + worker_events MCP tool — closing epic cic-bbd

+2,763
Lines added
10
Files changed
98
Tests passing
6/6
Subtasks closed

The Problem

Two disconnected views of worker state, neither complete:

graph LR
    subgraph Before["Before This PR"]
        direction TB
        LW["list_workers
(MCP tool)"] EL["events.jsonl
(on disk)"] LW -.-|"In-memory only
Wiped on restart"| R1["Empty after restart"] EL -.-|"No MCP access
Shell scripts only"| R2["Rich history, no API"] end style Before fill:#1a0000,stroke:#f85149

After restart, list_workers returned empty even if workers were still alive in tmux. Monitoring agents had to shell out and parse events.jsonl manually.

The Solution

graph TB
    subgraph After["After This PR"]
        direction TB
        BOOT["MCP Server Starts"] --> SNAP["Read latest snapshot"]
        SNAP --> EVENTS["Read events since snapshot"]
        EVENTS --> RECOVER["recover_from_events()"]
        RECOVER --> REG["SessionRegistry
(populated)"] REG --> LW2["list_workers
shows ALL workers"] REG --> WE["worker_events
(NEW MCP tool)"] LIVE["Live tmux sessions"] --> REG end style After fill:#001a00,stroke:#3fb950

Part 1: Event Log Recovery

On startup, the registry reconstructs known worker state from the event log. list_workers becomes the single source of truth.

How Recovery Works

sequenceDiagram
    participant S as MCP Server
    participant R as SessionRegistry
    participant E as events.jsonl
    participant T as tmux

    Note over S: Server starts (lifespan)
    S->>E: get_latest_snapshot()
    E-->>S: snapshot (worker states at time T)
    S->>E: read_events_since(T)
    E-->>S: events after snapshot

    S->>R: recover_from_events(snapshot, events)

    Note over R: For each worker in snapshot
    R->>R: Check if already in live registry
    alt Already live
        R->>R: Skip (don't overwrite)
    else Not live
        R->>R: Create RecoveredSession
        R->>R: Apply subsequent events
    end

    R-->>S: RecoveryReport(added, skipped, closed)

    Note over S: Later: list_workers call
    S->>R: list_all()
    R->>T: Check live sessions
    R-->>S: ManagedSession[] + RecoveredSession[]
      

Key Types

NEW RecoveredSession

Lightweight, frozen dataclass representing a session restored from the event log. It provides full metadata about the worker's last known state, but has no terminal handle — meaning you can see it in list_workers but can't send messages to it or close it.

RecoveredSession vs Adoption: Recovery gives you visibility — "this worker existed and was idle." Adoption (discover_workersadopt_worker) gives you control — it reconnects to the live tmux pane and creates a full ManagedSession. Think of recovery as the map, adoption as taking the wheel.
Future: Auto-adopt on startup? Recovery works even when the tmux session is gone (crashed, rebooted, exited cleanly). Adoption requires a live tmux pane. A natural follow-up: after recovery, run discovery → match recovered workers to live tmux sessions → auto-adopt. This would give full control back automatically on restart for any workers still alive.
@dataclass(frozen=True)
class RecoveredSession:
    session_id: str          # "a3f2b1c9"
    name: str                # "Groucho"
    project_path: str        # "/Users/.../claude-team"
    terminal_id: TerminalId  # May be stale after restart
    agent_type: AgentType    # "claude" | "codex"
    status: SessionStatus    # Mapped from event_state
    event_state: EventState  # "idle" | "active" | "closed"
    source: str = "event_log"  # Provenance tracking
    # ... timestamps, optional fields

src/claude_team_mcp/registry.py

NEW RecoveryReport

Returned by recover_from_events() with counts of what happened:

@dataclass(frozen=True)
class RecoveryReport:
    added: int     # Sessions added from event log
    skipped: int   # Already in live registry
    closed: int    # Marked as closed from events
    timestamp: datetime

CHANGED ManagedSession.to_dict()

Now includes source: "registry" so clients can distinguish live vs recovered:

# Live session
{"name": "Ratchet", "source": "registry", "status": "ready", ...}

# Recovered session
{"name": "Hokusai", "source": "event_log", "event_state": "idle", ...}

Startup Flow

Recovery happens in two places for reliability:

  1. Eager recovery in app_lifespan — runs once at boot, logs results
  2. Lazy fallback in list_workers — if registry is empty and recovery hasn't been attempted, triggers it before listing

Part 2: worker_events MCP Tool

New MCP tool for querying the event log programmatically. Replaces shell-script parsing with a clean API.

Tool Signature

worker_events(
    since: str = "",                    # ISO timestamp filter
    limit: int = 100,                   # Max events returned
    include_snapshot: bool = False,     # Include latest snapshot
    include_summary: bool = False,      # Include computed summary
    stale_threshold_minutes: int = 10,  # For stuck detection (10min default)
    project_filter: str = ""            # Filter by project path
) -> dict

Response Shape

{
  "events": [
    {"ts": "2026-01-31T14:53:12Z", "type": "worker_started",
     "worker_id": "aa7f7606", "data": {...}},
    {"ts": "2026-01-31T15:31:26Z", "type": "worker_idle",
     "worker_id": "aa7f7606", "data": {...}},
  ],
  "count": 2,

  # When include_summary=True:
  "summary": {
    "started": ["Ratchet"],
    "closed": [],
    "idle": ["Ratchet"],
    "active": ["Clank"],
    "stuck": [],           # idle > stale_threshold
    "last_event_ts": "2026-01-31T15:31:26Z"
  },

  # When include_snapshot=True:
  "snapshot": {"ts": "...", "data": {...}}
}

Stuck Worker Detection

The summary identifies "stuck" workers — idle longer than stale_threshold_minutes (default: 10 minutes). Designed for fast detection — flag stale workers quickly so monitoring agents can act.

Edge case — thinking agents: Activity is based on event log entries. If an agent is actively thinking but hasn't produced visible output yet, it may appear stale. Increase stale_threshold_minutes for workflows with long thinking phases (e.g., Codex reading a large codebase).

Architecture Overview

graph TB
    subgraph Tools["MCP Tools"]
        LW["list_workers"]
        WE["worker_events"]
        SW["spawn_workers"]
    end

    subgraph Registry["SessionRegistry"]
        LIVE["Live Sessions
(ManagedSession)"] REC["Recovered Sessions
(RecoveredSession)"] RFROM["recover_from_events()"] end subgraph Storage["Persistent Storage"] ELOG["events.jsonl"] SNAP["snapshots"] end subgraph Terminal["Terminal Backend"] TMUX["tmux panes"] end SW -->|creates| LIVE LIVE -->|emits events| ELOG ELOG -->|periodic| SNAP LW -->|queries| LIVE LW -->|queries| REC WE -->|reads| ELOG WE -->|reads| SNAP SNAP -->|on startup| RFROM ELOG -->|on startup| RFROM RFROM -->|populates| REC LIVE ---|terminal handle| TMUX REC -.-|no terminal handle| TMUX style REC fill:#1a3a1a,stroke:#3fb950 style WE fill:#1a2a3a,stroke:#58a6ff style RFROM fill:#1a3a1a,stroke:#3fb950

Subtasks

Files Changed

FileChangeLines
registry.pyRecoveredSession, recover_from_events(), recovery types+394
server.pyStartup recovery, lazy fallback+76
tools/__init__.pyRegister worker_events tool+2
tools/list_workers.pySource field in output+19
tools/worker_events.pyNEW MCP tool+273
test_recovered_session.pyNEW 16 tests+313
test_recover_from_events.pyNEW 18 tests+705
test_startup_recovery.pyNEW 12 tests+413
test_worker_events.pyNEW 10 tests+539
test_registry.pySource field assertion+40

View PR #17 on GitHub →

✦ Made with pagedrop.ai