Core work survives channel outages
Crons, heartbeats, active turns, task queues, and local conversations continue while Discord, Telegram, or another transport reconnects.
OpenClaw should keep conversations, crons, heartbeats, tasks, and local control working when every channel is disconnected or the channel Gateway is restarting. The public openclaw gateway experience stays intact; its internal ownership model changes.
Yes, when the goal is expressed as product resilience and local autonomy rather than process purity. The right product promise is simple: channel connectivity is optional infrastructure, not the condition under which OpenClaw can think, schedule, or manage its own state.
Keep openclaw gateway, gateway.*, the current port, authentication model, Control UI URL, config shape, and existing plugins working. Internally, make a new Host the owner of core lifecycle and state. Move the channel edge behind explicit ingress and delivery contracts. Split processes only after the ownership boundary is proven in-process.
The current Gateway is both the channel edge and the composition root for almost every always-on system. That makes a transport restart, channel fault, or Gateway deployment wider than it needs to be. Detachment turns channel connectivity into a replaceable edge while preserving a stable, local OpenClaw runtime.
Crons, heartbeats, active turns, task queues, and local conversations continue while Discord, Telegram, or another transport reconnects.
The TUI and Control UI can talk to a stable Host even when no channel is configured, authenticated, enabled, or healthy.
A malformed native event, channel SDK failure, or transport memory leak cannot take down the scheduler and agent runtime with it.
A port, TLS, plugin, or channel setting restarts only the runtime that owns it instead of interrupting unrelated core work.
Transport credentials, channel SDKs, and untrusted native payloads can be isolated from agent state, schedules, and durable core data.
A detachable connector contract forces channel plugins to declare portable ingress and delivery behavior instead of relying on broad in-process helpers.
Host-only proof becomes possible. Channel conformance can be tested separately. Core no longer needs Gateway RPC as an internal service locator.
Independent channel scaling or a remote edge becomes possible later, without making that operational complexity part of the initial product.
The architecture has four internal roles. The compatibility supervisor preserves today's public launcher. The Host owns core execution and state. The Control Server preserves the current protocol for UIs and clients. Channel Gateways own only transport concerns.
Owns agents, conversations, sessions, crons, heartbeats, tasks, routing policy, durable delivery intent, and canonical state. It must run correctly with zero Channel Gateways.
| Owner | Owns | Must not own | Failure behavior |
|---|---|---|---|
| Host | Core lifecycle, agent turns, sessions, cron, heartbeat, tasks, policy, canonical state, durable delivery intent | Native channel SDKs, reconnect loops, channel credentials, transport-specific payloads | Continues with channels absent; graceful local control remains available |
| Control Server | Existing Gateway protocol, authentication, subscriptions, UI/TUI/CLI/node access | Core scheduling or channel transport lifecycle | Restartable without cancelling Host work |
| Channel Gateway | Connection, auth to channel, native parse/normalize, acknowledgements, send, typing, receipts, transport health | Product commands, provider policy, agent state, cron, heartbeat scheduling | Can restart independently; resumes from Host delivery intent |
| Compatibility supervisor | Current launcher, process supervision, lifecycle ordering, health aggregation, legacy plugin placement | Business logic or durable state | Preserves the current operator experience and hides internal topology |
The current system already validates, diffs, and hot-applies most configuration changes. The architectural problem is the fallback boundary: when a change is startup-bound or lacks a safe scoped reload contract, the planner restarts the one Gateway process that also owns agents, cron, heartbeat, channels, plugins, and control connections.
restartGateway: true.| Changed path | Current hybrid behavior | Proposed hybrid behavior | Unaffected work |
|---|---|---|---|
messages.*, routing.* |
Swap the Gateway process runtime snapshot | Swap the Host snapshot | Control Server and connectors continue |
cron.* |
Stop and rebuild Gateway-owned cron in-process | Restart the Host scheduler subsystem | Active turns, Control Server, and connectors continue |
channels.telegram.* |
Restart Telegram inside the monolithic Gateway | Restart only the affected Telegram connector or account | Host, Control Server, and other connectors continue |
gateway.port, bind, TLS, HTTP |
Restart the complete Gateway process | Restart only the Control Server | Agents, cron, heartbeat, tasks, and connectors continue |
| Host-owned plugin/runtime setting | Restart the complete Gateway process | Restart the Host or affected Host subsystem only | Control Server and detachable connectors continue or reconnect |
| Unknown or ambiguous path | Fail-safe full Gateway restart | Fail closed; use combined-service restart during migration, require owner metadata before detached default | No silent partial application |
type GatewayReloadPlan = {
changedPaths: string[];
restartGateway: boolean;
restartCron: boolean;
restartHeartbeat: boolean;
restartChannels: Set<ChannelId>;
reloadPlugins: boolean;
};
// Any restart-required path ultimately restarts
// the process that owns every runtime.
type ConfigApplyPlan = {
desiredRevision: ConfigRevision;
actions: Array<{
owner: RuntimeOwner;
paths: string[];
mode: "dynamic" | "hot" |
"restart-subsystem" | "restart-process";
}>;
};
// Every owner reports its active revision.
// Unaffected owners do not restart.
gateway.reload.mode="hybrid" gains owner-scoped actions by default."hot", "off", and explicit "restart" retain their shipped operator semantics; explicit restart remains a full compatibility-supervised service restart.restartPrefixes retain their current safe co-located restart meaning.Users should not need to understand the new topology. Existing configuration, commands, URLs, ports, auth, state, and plugin packages continue to work. The compatibility layer belongs at the public edge and plugin boundary, never as duplicate core execution paths.
openclaw gateway remains the normal command.openclaw.json stays valid.openclaw doctor --fix.These are illustrative shapes, not final API names. The important design constraints are serializability, explicit ownership, idempotency, and no function-valued runtime objects crossing the Host/connector boundary.
export interface OpenClawHost {
start(): Promise<void>;
stop(reason: HostStopReason): Promise<void>;
health(): HostHealthSnapshot;
services: HostServices;
}
export interface HostServices {
conversations: ConversationService;
schedules: ScheduleService;
delivery: DeliveryIntentService;
}
type ChannelIngressEnvelope = {
id: IngressId;
account: ChannelAccountRef;
receivedAt: IsoTimestamp;
event: PortableChannelEvent;
};
type ChannelDeliveryCommand = {
id: DeliveryId;
target: PortableChannelTarget;
presentation: PortablePresentation;
};
type ChannelDeliveryReceipt = {
deliveryId: DeliveryId;
status: "accepted" | "sent" | "failed";
nativeMessageId?: string;
retry?: RetryDirective;
};
// Host owns retry policy and durable intent.
// Connector owns native transport execution.
{
"openclaw": {
"channel": {
"connectorApi": "v2",
"execution": "detachable"
},
"compat": {
"pluginApi": ">=current"
}
}
}
A one-shot rewrite would combine lifecycle, protocol, plugin SDK, state ownership, channel migration, and upgrade risk in one release. The safer route is a sequence of vertical slices where each phase leaves one canonical path and produces independently measurable product value.
Document the target ownership model, public compatibility contract, lifecycle ordering, and failure semantics before moving runtime code.
Every current Gateway-owned subsystem and restart-required config path has one named future owner. Existing public surfaces, reload modes, and plugin lanes have explicit compatibility treatment.
approve nowCreate the Host as the only owner of core service startup and shutdown. Move the most separable services first: cron, then heartbeat after its delivery dependency is made explicit.
OpenClawHost as a narrow composition root, initially created by the current Gateway launcher.Cron and heartbeat continue through a simulated channel-manager restart. Config planning names affected owners without changing shipped reload behavior. No migrated service has two lifecycle owners.
approve nowReplace internal Gateway RPC and the embedded Gateway stub with typed Host services for conversations, sessions, schedules, tasks, and delivery intent.
TUI conversations, session history, cron control, and agent tools work with the Host running and no Control Server or channels.
approve nowSplit the mixed Gateway request context into transport/auth concerns and Host service calls. Keep existing methods, subscriptions, authentication, and URLs stable.
gateway.*, current WebSocket events, Control UI behavior, CLI access, nodes, and hooks.The current UI, TUI remote mode, CLI, and nodes pass unchanged protocol tests. A Control Server-owned config change causes no active-turn or scheduler interruption.
approve nowAdd a serializable connector contract while preserving all v1 channel plugins through automatic co-located execution. Migrate bundled channels one at a time.
A migrated connector can restart during live traffic with zero lost or duplicate deliveries, while an unchanged external v1 fixture still works.
proof gatedAfter logical ownership and connector contracts are stable, let the compatibility supervisor run the Host, Control Server, and detachable connectors as separate local processes.
Fault injection and restart-required config edits prove connector and Control Server restarts do not interrupt Host work. Startup cost and resource use remain inside agreed budgets.
proof gatedThe sequence below keeps each change reviewable and reversible at the release level. Bundled callers migrate in the same change that introduces a modern API; compatibility remains narrowly scoped to shipped public contracts and external plugins.
| ID | Work package | Primary surfaces | Required proof |
|---|---|---|---|
| P0 | Architecture ADR, public invariants, owner matrix, lifecycle state model | docs/, architecture checks | Owner review; no runtime change |
| P1 | Dependency guardrails: block new core use of Gateway request context and channel-native runtime | src/gateway/, src/agents/, lint/architecture tests | Current main remains green; intentional exceptions enumerated |
| P2 | Create in-process Host lifecycle shell and health snapshot | src/host/, current Gateway launcher | Start/stop ordering, repeated start rejection, clean shutdown |
| P2A | Extract Config Coordinator and owner-scoped apply-plan metadata while retaining current combined execution | src/gateway/config-reload*, runtime snapshot, Host lifecycle | Current reload behavior unchanged; every restart-required path resolves to an owner or explicit compatibility fallback |
| P3 | Move cron lifecycle and canonical state access to Host | src/cron/, src/gateway/server-cron.ts | Host-only cron execution; current Gateway cron RPC parity |
| P4 | Add Host delivery-intent port and move heartbeat lifecycle | src/infra/heartbeat-*, delivery boundary | Heartbeat schedules without channels; delivery queues until connector returns |
| P5 | Introduce conversation, session, schedule, and task Host services | src/agents/, session/state owners | Typed service contract tests and deterministic event ordering |
| P6 | Move embedded TUI to Host services | src/tui/embedded-backend.ts, src/tui/tui-backend.ts | Local TUI turn and history with Gateway disabled |
| P7 | Migrate agent tools away from internal Gateway RPC; delete embedded Gateway stub | src/agents/openclaw-tools.ts, src/agents/tools/ | Agent tool parity; no internal Gateway call path remains |
| P8 | Split request context and extract Control Server facade with owned config lifecycle | src/gateway/server-request-context.ts, methods, protocol, server config | Existing protocol/auth behavior unchanged; Control Server config restart does not interrupt Host work |
| P9 | Route Control UI, CLI, remote TUI, nodes, and hooks through Control Server | ui/, gateway clients, node APIs | Current user workflows pass without config edits |
| P10 | Define connector v2 protocol and conformance kit | src/plugin-sdk/, channel contracts, docs | Serializable schema round trips; version/capability negotiation |
| P11 | Implement legacy v1 co-location classifier and diagnostics | plugin loader, compat registry, doctor diagnostics | Unchanged external fixture loads; no user action required |
| P12 | Migrate first bundled connector as a complete vertical slice | one bundled channel, Host ingress/delivery services | Crash/restart, dedupe, order, receipts, multi-account proof |
| P13 | Migrate remaining bundled connectors incrementally | bundled channel plugins | Per-channel conformance plus real channel proof where feasible |
| P14 | Add supervisor, private IPC, leases, restart budgets, config revision convergence, and health aggregation | launcher, Host/Control/connector process entrypoints | Fault injection; SQLite single-writer; bounded restart loops; desired/active revision visibility |
| P15 | Add zero-touch published-upgrade lane and cohort rollout controls | upgrade tests, release checks, telemetry/diagnostics | Last stable to new release with no doctor and no config edit |
Green unit tests are not enough for this change. The acceptance suite must deliberately remove, restart, and corrupt channel-side infrastructure while proving that core work continues and existing installations upgrade without intervention.
| Scenario | Expected result | Phase gate |
|---|---|---|
| Host only, no channels configured | Local conversation, sessions, cron, heartbeat scheduling, and tasks work | P1-P2 |
| All channels disabled | Host remains healthy; delivery intent is explicit rather than silently lost | P1-P2 |
| Connector crashes during active turn | Turn completes; delivery resumes idempotently after connector restart | P4-P5 |
| Control Server restarts | Active Host work and channel connections continue; clients reconnect | P3 |
| Control Server-owned config change | Changing port, bind, TLS, HTTP, or Control UI settings restarts only Control Server; Host work continues | P3-P5 |
| Connector-owned config change | Changing one channel account or credential restarts only the affected connector/account | P4-P5 |
| Mixed-owner config change | Complete config validates once; every owner reports desired/active revision; failure is visible with no silent partial activation | P5 |
| Existing explicit restart mode | gateway.reload.mode="restart" retains full compatibility-supervised service restart semantics | release |
| Legacy external v1 channel plugin | Loads unchanged in automatic co-located mode with clear diagnostics | P4-P5 |
| Connector v2 plugin | Runs detached using only declared serializable contracts | P4-P5 |
| Last stable published upgrade | Starts successfully with existing config/state/plugins, no doctor, no edits | release |
| Multi-account concurrent traffic | Routing, ordering, dedupe, and delivery receipts remain correct | P4-P5 |
| Process fault injection | Leases prevent dual writers; restart budgets prevent crash loops | P5 |
| Config reload and shutdown | Owner-scoped lifecycle ordering is deterministic; no orphaned work, duplicate execution, or hidden revision drift | P5 |
Physical separation is straightforward only after delivery, lifecycle, state, and plugin contracts are unambiguous. The roadmap treats every ambiguity as a rollout blocker because those are the places that create duplicate work, lost messages, or surprise plugin failures.
| Risk | Severity | Mitigation | Verification |
|---|---|---|---|
| Dual execution or duplicate delivery | critical | Host is the single owner of intent, dedupe keys, retry policy, and completion state; leases prevent dual active owners. | Kill/restart fault injection under concurrent traffic; assert exactly-once product outcomes where supported. |
| External plugin incompatibility | critical | Automatic legacy co-location, additive connector v2, named compatibility record, fixtures from real external plugin shapes. | Published-plugin compatibility matrix and unchanged v1 fixture in every release lane. |
| Lifecycle globals and hidden coupling | high | Move services behind Host lifecycle; delete module-global ownership as each subsystem migrates; make state transitions closed and observable. | Repeated start/stop, partial startup failure, config reload, and shutdown ordering tests. |
| SQLite multi-writer corruption or lock contention | critical | Host remains the single writer for core state; connectors exchange envelopes and never open Host-owned stores. | Process concurrency test, lease loss test, and database integrity checks. |
| Protocol complexity and version drift | high | Version private connector protocol additively, keep schemas small, carry prepared facts, and use conformance tests. | Round-trip schema tests, old/new connector negotiation fixtures, deterministic payload snapshots. |
| Config revision split-brain | critical | Validate once, persist one desired revision, require explicit path ownership, expose active revision per owner, and fail closed on ambiguous or failed convergence. | Mixed-owner config fault injection, owner restart during apply, rollback/retry tests, and health assertions for desired versus active revision. |
| Authentication or trust boundary regression | high | Control Server preserves existing auth; internal IPC is private, authenticated, and never user-configurable in initial rollout. | Auth parity tests, local privilege boundary review, hostile envelope validation. |
| Operator complexity | high | One command, one health story, one log correlation ID, and supervisor-owned diagnostics; topology stays hidden by default. | Fresh install and upgrade smoke from an operator perspective. |
| Resource and startup regression | medium | Delay process split until measured; set explicit budgets; group low-volume connectors if isolation value does not justify cost. | Before/after startup, RSS, CPU, and file descriptor benchmarks. |
The proposal intentionally leaves implementation choices that do not affect the first product outcome open. Approve the ownership model and compatibility contract now; choose private transport and advanced deployment options only when the relevant phase is ready.
openclaw gateway remains the compatibility supervisor and public operator surface.Do not decide removal in this roadmap. Treat the shipped plugin SDK as a public contract. First ship connector v2, publish migration guidance, measure adoption, and collect concrete maintenance or security reasons. Any removal would require a separate owner-approved deprecation window and major-release decision.
Choose a channel with a complete test harness, low operational blast radius, and representative ingress/delivery behavior. The first migration should validate the contract, not prove that the hardest channel can be rewritten. Follow with a high-volume channel only after conformance and fault proof are stable.
Defer the choice until phase 5. The logical API must not be designed around a transport. Evaluate local Unix domain sockets or an equivalent cross-platform private transport against authentication, lifecycle, backpressure, observability, and Windows support. It must remain an internal implementation detail.
gateway_start and gateway_stop hooks mean after the split?Preserve their shipped behavior as lifecycle hooks for the overall OpenClaw service under the compatibility supervisor. Add narrower modern lifecycle contracts for Host services and detachable connectors rather than silently changing the meaning of existing hooks.
Only after local detachment demonstrates measurable reliability value and there is concrete operator demand. Remote connectivity introduces a public trust, networking, deployment, and support contract that should not be bundled into the core resilience refactor.
The recommendation is based on current upstream main at 8b0eac7927d5e7695d058d3503edcd3f8e278b67. The current code already contains separable service contracts and a local TUI path, but the Gateway still composes and shuts down the core runtime, and current channel plugin types are too broad and function-valued to cross a process boundary.
| Finding | Current source evidence | Roadmap implication |
|---|---|---|
| Gateway is the monolithic composition root | src/gateway/server.impl.ts:548, src/gateway/server.impl.ts:953, src/gateway/server.impl.ts:1033, src/gateway/server.impl.ts:1567 | Create Host lifecycle and move service ownership before process work. |
| Current config reload already plans by path but has a process-wide restart fallback | src/gateway/config-reload.ts:117, src/gateway/config-reload-plan.ts:56, src/gateway/config-reload-plan.ts:324, src/gateway/server-reload-handlers.ts:543 | Preserve validation and path planning; replace the coarse restart result with owner-scoped actions. |
| A full Gateway close stops core and edge systems together | src/gateway/server-close.ts:797, src/gateway/server-close.ts:802, src/gateway/server-close.ts:832 | Config restart blast radius is direct evidence for separating runtime owners. |
| Cron already has a separable service contract | src/cron/service-contract.ts:22, src/cron/service.ts:14, src/gateway/server-cron.ts:130 | Use cron as the first Host-owned vertical slice. |
| Heartbeat has a lifecycle but reaches channel plugins | src/infra/heartbeat-runner.ts:150, src/infra/heartbeat-runner.ts:243, src/infra/heartbeat-runner.ts:2127 | Add a Host-owned delivery port before moving lifecycle. |
| Local TUI proves conversations can work without Gateway | src/tui/tui-backend.ts:130, src/tui/embedded-backend.ts:358, src/tui/embedded-backend.ts:1041 | Unify local conversation flow on Host services and delete duplicated orchestration. |
| Agent internals use Gateway as a capability API | src/agents/openclaw-tools.ts:385, src/agents/tools/embedded-gateway-stub.ts:211 | Replace internal Gateway RPC with injected Host capabilities. |
| Control UI is fully Gateway-bound | ui/src/ui/gateway.ts:473, ui/src/ui/app-gateway.ts:767, ui/src/ui/controllers/cron.ts:194 | Preserve the current protocol through a Control Server facade. |
| Gateway request context mixes domains | src/gateway/server-methods.ts:260, src/gateway/server-request-context.ts:15, src/gateway/server-methods/chat.ts | Split transport/auth context from Host service dependencies. |
| Current channel plugin runtime is broad and in-process | src/gateway/server-channels.ts:172, src/channels/plugins/types.adapters.ts:244, src/plugins/types.ts:2610 | Keep v1 co-located; add a narrow serializable connector v2. |
| Current turn types contain functions and callbacks | src/channels/turn/types.ts:258, src/channels/turn/types.ts:303, src/channels/turn/types.ts:371 | Do not attempt to send current assembled turns over IPC; define portable envelopes. |
| Repo already has a named plugin compatibility policy | docs/plugins/sdk-migration.md:204, docs/plugins/compatibility.md:10, src/plugins/compat/registry.ts | Use an explicit compatibility record and additive API before any deprecation. |
| Existing upgrade proof invokes doctor | docs/reference/test.md:58 | Add a new last-stable zero-touch upgrade lane that does not run doctor. |
| Product docs define Gateway as always-on control plane | docs/gateway/index.md:73, docs/web/tui.md:37, docs/web/control-ui.md:10 | Keep the public name while changing internal ownership; update docs only as behavior ships. |
ui/src/styles/base.css, ui/src/styles/components.css, ui/src/styles/layout.css, and ui/src/styles/activity.css. It is a standalone artifact with all CSS and interaction logic inline.