⚔️ PR #17: Unified Worker State
Event log recovery + worker_events MCP tool — closing epic cic-bbd
The Problem
Two disconnected views of worker state, neither complete:
graph LR
subgraph Before["Before This PR"]
direction TB
LW["list_workers
(MCP tool)"]
EL["events.jsonl
(on disk)"]
LW -.-|"In-memory only
Wiped on restart"| R1["Empty after restart"]
EL -.-|"No MCP access
Shell scripts only"| R2["Rich history, no API"]
end
style Before fill:#1a0000,stroke:#f85149
After restart, list_workers returned empty even if workers were still alive in tmux. Monitoring agents had to shell out and parse events.jsonl manually.
The Solution
graph TB
subgraph After["After This PR"]
direction TB
BOOT["MCP Server Starts"] --> SNAP["Read latest snapshot"]
SNAP --> EVENTS["Read events since snapshot"]
EVENTS --> RECOVER["recover_from_events()"]
RECOVER --> REG["SessionRegistry
(populated)"]
REG --> LW2["list_workers
shows ALL workers"]
REG --> WE["worker_events
(NEW MCP tool)"]
LIVE["Live tmux sessions"] --> REG
end
style After fill:#001a00,stroke:#3fb950
Part 1: Event Log Recovery
On startup, the registry reconstructs known worker state from the event log. list_workers becomes the single source of truth.
How Recovery Works
sequenceDiagram
participant S as MCP Server
participant R as SessionRegistry
participant E as events.jsonl
participant T as tmux
Note over S: Server starts (lifespan)
S->>E: get_latest_snapshot()
E-->>S: snapshot (worker states at time T)
S->>E: read_events_since(T)
E-->>S: events after snapshot
S->>R: recover_from_events(snapshot, events)
Note over R: For each worker in snapshot
R->>R: Check if already in live registry
alt Already live
R->>R: Skip (don't overwrite)
else Not live
R->>R: Create RecoveredSession
R->>R: Apply subsequent events
end
R-->>S: RecoveryReport(added, skipped, closed)
Note over S: Later: list_workers call
S->>R: list_all()
R->>T: Check live sessions
R-->>S: ManagedSession[] + RecoveredSession[]
Key Types
NEW RecoveredSession
Lightweight, frozen dataclass representing a session restored from the event log. It provides full metadata about the worker's last known state, but has no terminal handle — meaning you can see it in list_workers but can't send messages to it or close it.
RecoveredSession vs Adoption: Recovery gives you visibility — "this worker existed and was idle." Adoption (discover_workers → adopt_worker) gives you control — it reconnects to the live tmux pane and creates a full ManagedSession. Think of recovery as the map, adoption as taking the wheel.
Future: Auto-adopt on startup? Recovery works even when the tmux session is gone (crashed, rebooted, exited cleanly). Adoption requires a live tmux pane. A natural follow-up: after recovery, run discovery → match recovered workers to live tmux sessions → auto-adopt. This would give full control back automatically on restart for any workers still alive.
@dataclass(frozen=True)
class RecoveredSession:
session_id: str # "a3f2b1c9"
name: str # "Groucho"
project_path: str # "/Users/.../claude-team"
terminal_id: TerminalId # May be stale after restart
agent_type: AgentType # "claude" | "codex"
status: SessionStatus # Mapped from event_state
event_state: EventState # "idle" | "active" | "closed"
source: str = "event_log" # Provenance tracking
# ... timestamps, optional fields
src/claude_team_mcp/registry.py
NEW RecoveryReport
Returned by recover_from_events() with counts of what happened:
@dataclass(frozen=True)
class RecoveryReport:
added: int # Sessions added from event log
skipped: int # Already in live registry
closed: int # Marked as closed from events
timestamp: datetime
CHANGED ManagedSession.to_dict()
Now includes source: "registry" so clients can distinguish live vs recovered:
# Live session
{"name": "Ratchet", "source": "registry", "status": "ready", ...}
# Recovered session
{"name": "Hokusai", "source": "event_log", "event_state": "idle", ...}
Startup Flow
Recovery happens in two places for reliability:
- Eager recovery in
app_lifespan — runs once at boot, logs results
- Lazy fallback in
list_workers — if registry is empty and recovery hasn't been attempted, triggers it before listing
Part 2: worker_events MCP Tool
New MCP tool for querying the event log programmatically. Replaces shell-script parsing with a clean API.
Tool Signature
worker_events(
since: str = "", # ISO timestamp filter
limit: int = 100, # Max events returned
include_snapshot: bool = False, # Include latest snapshot
include_summary: bool = False, # Include computed summary
stale_threshold_minutes: int = 10, # For stuck detection (10min default)
project_filter: str = "" # Filter by project path
) -> dict
Response Shape
{
"events": [
{"ts": "2026-01-31T14:53:12Z", "type": "worker_started",
"worker_id": "aa7f7606", "data": {...}},
{"ts": "2026-01-31T15:31:26Z", "type": "worker_idle",
"worker_id": "aa7f7606", "data": {...}},
],
"count": 2,
# When include_summary=True:
"summary": {
"started": ["Ratchet"],
"closed": [],
"idle": ["Ratchet"],
"active": ["Clank"],
"stuck": [], # idle > stale_threshold
"last_event_ts": "2026-01-31T15:31:26Z"
},
# When include_snapshot=True:
"snapshot": {"ts": "...", "data": {...}}
}
Stuck Worker Detection
The summary identifies "stuck" workers — idle longer than stale_threshold_minutes (default: 10 minutes). Designed for fast detection — flag stale workers quickly so monitoring agents can act.
Edge case — thinking agents: Activity is based on event log entries. If an agent is actively thinking but hasn't produced visible output yet, it may appear stale. Increase stale_threshold_minutes for workflows with long thinking phases (e.g., Codex reading a large codebase).
Architecture Overview
graph TB
subgraph Tools["MCP Tools"]
LW["list_workers"]
WE["worker_events"]
SW["spawn_workers"]
end
subgraph Registry["SessionRegistry"]
LIVE["Live Sessions
(ManagedSession)"]
REC["Recovered Sessions
(RecoveredSession)"]
RFROM["recover_from_events()"]
end
subgraph Storage["Persistent Storage"]
ELOG["events.jsonl"]
SNAP["snapshots"]
end
subgraph Terminal["Terminal Backend"]
TMUX["tmux panes"]
end
SW -->|creates| LIVE
LIVE -->|emits events| ELOG
ELOG -->|periodic| SNAP
LW -->|queries| LIVE
LW -->|queries| REC
WE -->|reads| ELOG
WE -->|reads| SNAP
SNAP -->|on startup| RFROM
ELOG -->|on startup| RFROM
RFROM -->|populates| REC
LIVE ---|terminal handle| TMUX
REC -.-|no terminal handle| TMUX
style REC fill:#1a3a1a,stroke:#3fb950
style WE fill:#1a2a3a,stroke:#58a6ff
style RFROM fill:#1a3a1a,stroke:#3fb950
Subtasks
- ✅ cic-bbd.1 — RecoveredSession dataclass registry.py
- ✅ cic-bbd.2 —
recover_from_events() method registry.py
- ✅ cic-bbd.3 — Startup recovery: seed registry on boot server.py
- ✅ cic-bbd.4 —
worker_events MCP tool + tests tools/worker_events.py
- ✅ cic-bbd.5 —
list_workers source fields tools/list_workers.py
- ✅ cic-bbd.6 — Comprehensive tests (46 tests) tests/
Files Changed
| File | Change | Lines |
registry.py | RecoveredSession, recover_from_events(), recovery types | +394 |
server.py | Startup recovery, lazy fallback | +76 |
tools/__init__.py | Register worker_events tool | +2 |
tools/list_workers.py | Source field in output | +19 |
tools/worker_events.py | NEW MCP tool | +273 |
test_recovered_session.py | NEW 16 tests | +313 |
test_recover_from_events.py | NEW 18 tests | +705 |
test_startup_recovery.py | NEW 12 tests | +413 |
test_worker_events.py | NEW 10 tests | +539 |
test_registry.py | Source field assertion | +40 |
View PR #17 on GitHub →