docs: add smart model routing plan

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-06 20:18:23 -05:00
parent 9a8a169e95
commit 1a3aee9ee1
1 changed files with 362 additions and 0 deletions
@@ -0,0 +1,362 @@
+# 2026-04-06 Smart Model Routing
+
+Status: Proposed
+Date: 2026-04-06
+Audience: Product and engineering
+Related:
+- `doc/SPEC-implementation.md`
+- `doc/PRODUCT.md`
+- `doc/plans/2026-03-14-adapter-skill-sync-rollout.md`
+
+## 1. Purpose
+
+This document defines a V1 plan for "smart model routing" in Paperclip.
+
+The goal is not to build a generic cross-provider router in the server. The goal is:
+
+- let supported adapters use a cheaper model for lightweight heartbeat orchestration work
+- keep the main task execution on the adapter's normal primary model
+- preserve Paperclip's existing task, session, and audit invariants
+- report cost and model usage truthfully when more than one model participates in a single heartbeat
+
+The motivating use case is a local coding adapter where a cheap model can handle the first fast pass:
+
+- read the wake context
+- orient to the task and workspace
+- leave an immediate progress comment when appropriate
+- perform bounded lightweight triage
+
+Then the primary model does the substantive work.
+
+## 2. Hermes Findings
+
+Hermes does have a real "smart model routing" feature, but it is narrower than the name suggests.
+
+Observed behavior:
+
+- `agent/smart_model_routing.py` implements a conservative classifier for "simple" turns
+- the cheap path only triggers for short, single-line, non-code, non-URL, non-tool-heavy messages
+- complexity is detected with hardcoded thresholds plus a keyword denylist like `debug`, `implement`, `test`, `plan`, `tool`, `docker`, and similar terms
+- if the cheap route cannot be resolved, Hermes silently falls back to the primary model
+
+Important architectural detail:
+
+- Hermes applies this routing before constructing the agent for that turn
+- the route is resolved in `cron/scheduler.py` and passed into agent creation as the active provider/model/runtime
+
+More useful than the routing heuristic itself is Hermes' broader model-slot design:
+
+- main conversational model
+- fallback model for failover
+- auxiliary model slots for side tasks like compression and classification
+
+That separation is a better fit for Paperclip than copying Hermes' exact keyword heuristic.
+
+## 3. Current Paperclip State
+
+Paperclip already has the right execution shape for adapter-specific routing, but it currently assumes one model per heartbeat run.
+
+Current implementation facts:
+
+- `server/src/services/heartbeat.ts` builds rich run context, including `paperclipWake`, workspace metadata, and session handoff context
+- each adapter receives a single resolved `config` object and executes once
+- built-in local adapters read one `config.model` and pass it directly to the underlying CLI
+- UI config today exposes one main `model` field plus adapter-specific thinking-effort controls
+- cost accounting currently records one provider/model tuple per run via `AdapterExecutionResult`
+
+What this means:
+
+- there is no shared routing layer in the server today
+- model choice already lives at the adapter boundary, which is good
+- multi-model execution in a single heartbeat needs explicit contract work or cost reporting will become misleading
+
+## 4. Product Decision
+
+Paperclip should implement smart model routing as an adapter-local, opt-in execution pattern.
+
+V1 decision:
+
+1. Do not add a global server-side router that tries to understand every adapter.
+2. Do not copy Hermes' prompt-keyword classifier as Paperclip's default routing policy.
+3. Add an adapter-specific "cheap preflight" phase for supported adapters.
+4. Keep the primary model as the canonical work model.
+5. Persist only the primary session unless an adapter can prove that cross-model session resume is safe.
+
+Rationale:
+
+- Paperclip heartbeats are structured, issue-scoped, and already include wake metadata
+- routing by execution phase is more reliable than routing by free-text prompt complexity
+- session semantics differ by adapter, so resume behavior must stay adapter-owned
+
+## 5. Proposed V1 Behavior
+
+## 5.1 Config shape
+
+Supported adapters should add an optional routing block to `adapterConfig`.
+
+Proposed shape:
+
+```ts
+smartModelRouting?: {
+  enabled: boolean;
+  cheapModel: string;
+  cheapThinkingEffort?: string;
+  maxPreflightTurns?: number;
+  allowInitialProgressComment?: boolean;
+}
+```
+
+Notes:
+
+- keep existing `model` as the primary model
+- `cheapModel` is adapter-specific, not global
+- adapters that cannot safely support this block simply ignore it
+
+For adapters with provider-specific model fields later, the shape can expand to include provider/base-url overrides. V1 should start simple.
+
+## 5.2 Routing policy
+
+Supported adapters should run cheap preflight only when all are true:
+
+- `smartModelRouting.enabled` is true
+- `cheapModel` is configured
+- the run is issue-scoped
+- the adapter is starting a fresh session, not resuming a persisted one
+- the run is expected to do real task work rather than just resume an existing thread
+
+Supported adapters should skip cheap preflight when any are true:
+
+- a persisted task session already exists
+- the adapter cannot safely isolate preflight from the primary session
+- the issue or wake type implies the task is already mid-flight and continuity matters more than first-response speed
+
+This is intentionally phase-based, not text-heuristic-based.
+
+## 5.3 Cheap preflight responsibilities
+
+The cheap phase should be narrow and bounded.
+
+Allowed responsibilities:
+
+- ingest wake context and issue summary
+- inspect the workspace at a shallow level
+- leave a short "starting investigation" style comment when appropriate
+- collect a compact handoff summary for the primary phase
+
+Not allowed in V1:
+
+- long tool loops
+- risky file mutations
+- being the canonical persisted task session
+- deciding final completion without either explicit adapter support or a trivial success case
+
+Implementation detail:
+
+- the adapter should inject an explicit preflight prompt telling the model this is a bounded orchestration pass
+- preflight should use a very small turn budget, for example 1-2 turns
+
+## 5.4 Primary execution responsibilities
+
+After preflight, the adapter launches the normal primary execution using the existing prompt and primary model.
+
+The primary phase should receive:
+
+- the normal Paperclip prompt
+- any preflight-generated handoff summary
+- normal workspace and wake context
+
+The primary phase remains the source of truth for:
+
+- persisted session state
+- final task completion
+- most file changes
+- most cost
+
+## 6. Required Contract Changes
+
+The current `AdapterExecutionResult` is too narrow for truthful multi-model accounting.
+
+Add an optional segmented execution report, for example:
+
+```ts
+executionSegments?: Array<{
+  phase: "cheap_preflight" | "primary";
+  provider?: string | null;
+  biller?: string | null;
+  model?: string | null;
+  billingType?: AdapterBillingType | null;
+  usage?: UsageSummary;
+  costUsd?: number | null;
+  summary?: string | null;
+}>
+```
+
+V1 server behavior:
+
+- if `executionSegments` is absent, keep current single-result behavior unchanged
+- if present, write one `cost_events` row per segment that has cost or token usage
+- store the segment array in run usage/result metadata for later UI inspection
+- keep the existing top-level `provider` / `model` fields as a summary, preferably the primary phase when present
+
+This avoids breaking existing adapters while giving routed adapters truthful reporting.
+
+## 7. Adapter Rollout Plan
+
+## 7.1 Phase 1: contract and server plumbing
+
+Work:
+
+1. Extend adapter result types with segmented execution metadata.
+2. Update heartbeat cost recording to emit multiple cost events when segments are present.
+3. Include segment summaries in run metadata for transcript/debug views.
+
+Success criteria:
+
+- existing adapters behave exactly as before
+- a routed adapter can report cheap plus primary usage without collapsing them into one fake model
+
+## 7.2 Phase 2: `codex_local`
+
+Why first:
+
+- Codex already has rich prompt/handoff handling
+- the adapter already injects Paperclip skills and workspace metadata cleanly
+- the current implementation already distinguishes bootstrap, wake delta, and handoff prompt sections
+
+Implementation work:
+
+1. Add config support for `smartModelRouting`.
+2. Add a cheap-preflight prompt builder.
+3. Run cheap preflight only on fresh sessions.
+4. Pass a compact preflight handoff note into the primary prompt.
+5. Report segmented usage and model metadata.
+
+Important guardrail:
+
+- do not resume the cheap-model session as the primary session in V1
+
+## 7.3 Phase 3: `claude_local`
+
+Implementation work is similar, but the session model-switch risk is even less attractive.
+
+Same rule:
+
+- cheap preflight is ephemeral
+- primary Claude session remains canonical
+
+## 7.4 Phase 4: other adapters
+
+Candidates:
+
+- `cursor`
+- `gemini_local`
+- `opencode_local`
+- external plugin adapters through `createServerAdapter()`
+
+These should come later because each runtime has different session and model-switch semantics.
+
+## 8. UI and Config Changes
+
+For supported built-in adapters, the agent config UI should expose:
+
+- `model` as the primary model
+- `smart model routing` toggle
+- `cheap model`
+- optional cheap thinking effort
+- optional `allow initial progress comment` toggle
+
+The run detail UI should also show when routing occurred, for example:
+
+- cheap preflight model
+- primary model
+- token/cost split
+
+This matters because Paperclip's board UI is supposed to make cost and behavior legible.
+
+## 9. Why Not Copy Hermes Exactly
+
+Hermes' cheap-route heuristic is useful precedent, but Paperclip should not start there.
+
+Reasons:
+
+- Hermes is optimizing free-form conversational turns
+- Paperclip agents run structured, issue-scoped heartbeats with explicit task and workspace context
+- Paperclip already knows whether a run is fresh vs resumed, issue-scoped vs approval follow-up, and what workspace/session exists
+- those execution facts are stronger routing signals than prompt keyword matching
+
+If Paperclip later wants a cheap-only completion path for trivial runs, that can be a second-stage feature built on observed run data, not the first implementation.
+
+## 10. Risks
+
+## 10.1 Duplicate or noisy comments
+
+If the cheap phase posts an update and the primary phase posts another near-identical update, the issue thread gets worse.
+
+Mitigation:
+
+- keep cheap comments optional
+- make the preflight prompt explicitly avoid repeating status if a useful comment was already posted
+
+## 10.2 Misleading cost reporting
+
+If we only record the primary model, the board loses visibility into the routing cost tradeoff.
+
+Mitigation:
+
+- add segmented execution reporting before shipping adapter behavior
+
+## 10.3 Session corruption
+
+Cross-model session reuse may fail or degrade context quality.
+
+Mitigation:
+
+- V1 does not persist or resume cheap preflight sessions
+
+## 10.4 Cheap model overreach
+
+A cheap model with full tools and permissions may do too much low-quality work.
+
+Mitigation:
+
+- hard cap preflight turns
+- use an explicit orchestration-only prompt
+- start with supported adapters where we can test the behavior well
+
+## 11. Verification Plan
+
+Required tests:
+
+- adapter unit tests for route eligibility
+- adapter unit tests for "fresh session -> cheap preflight + primary"
+- adapter unit tests for "resumed session -> primary only"
+- heartbeat tests for segmented cost-event creation
+- UI tests for config save/load of cheap-model fields
+
+Manual checks:
+
+- create a fresh issue for a routed Codex or Claude agent
+- verify the run metadata shows both phases
+- verify only the primary session is persisted
+- verify cost rows reflect both models
+- verify the issue thread does not get duplicate kickoff comments
+
+## 12. Recommended Sequence
+
+1. Add segmented execution reporting to the adapter/server contract.
+2. Implement `codex_local` cheap preflight.
+3. Validate cost visibility and transcript UX.
+4. Implement `claude_local` cheap preflight.
+5. Decide later whether any adapters need Hermes-style text heuristics in addition to phase-based routing.
+
+## 13. Recommendation
+
+Paperclip should ship smart model routing as:
+
+- adapter-specific
+- opt-in
+- phase-based
+- session-safe
+- cost-truthful
+
+The right V1 is not "choose the cheapest model for simple prompts." The right V1 is "use a cheap model for bounded orchestration work on fresh runs, then hand off to the primary model for the real task."