forked from farhoodlabs/paperclip
docs: add smart model routing plan
Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
@@ -0,0 +1,362 @@
|
||||
# 2026-04-06 Smart Model Routing
|
||||
|
||||
Status: Proposed
|
||||
Date: 2026-04-06
|
||||
Audience: Product and engineering
|
||||
Related:
|
||||
- `doc/SPEC-implementation.md`
|
||||
- `doc/PRODUCT.md`
|
||||
- `doc/plans/2026-03-14-adapter-skill-sync-rollout.md`
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
This document defines a V1 plan for "smart model routing" in Paperclip.
|
||||
|
||||
The goal is not to build a generic cross-provider router in the server. The goal is:
|
||||
|
||||
- let supported adapters use a cheaper model for lightweight heartbeat orchestration work
|
||||
- keep the main task execution on the adapter's normal primary model
|
||||
- preserve Paperclip's existing task, session, and audit invariants
|
||||
- report cost and model usage truthfully when more than one model participates in a single heartbeat
|
||||
|
||||
The motivating use case is a local coding adapter where a cheap model can handle the first fast pass:
|
||||
|
||||
- read the wake context
|
||||
- orient to the task and workspace
|
||||
- leave an immediate progress comment when appropriate
|
||||
- perform bounded lightweight triage
|
||||
|
||||
Then the primary model does the substantive work.
|
||||
|
||||
## 2. Hermes Findings
|
||||
|
||||
Hermes does have a real "smart model routing" feature, but it is narrower than the name suggests.
|
||||
|
||||
Observed behavior:
|
||||
|
||||
- `agent/smart_model_routing.py` implements a conservative classifier for "simple" turns
|
||||
- the cheap path only triggers for short, single-line, non-code, non-URL, non-tool-heavy messages
|
||||
- complexity is detected with hardcoded thresholds plus a keyword denylist like `debug`, `implement`, `test`, `plan`, `tool`, `docker`, and similar terms
|
||||
- if the cheap route cannot be resolved, Hermes silently falls back to the primary model
|
||||
|
||||
Important architectural detail:
|
||||
|
||||
- Hermes applies this routing before constructing the agent for that turn
|
||||
- the route is resolved in `cron/scheduler.py` and passed into agent creation as the active provider/model/runtime
|
||||
|
||||
More useful than the routing heuristic itself is Hermes' broader model-slot design:
|
||||
|
||||
- main conversational model
|
||||
- fallback model for failover
|
||||
- auxiliary model slots for side tasks like compression and classification
|
||||
|
||||
That separation is a better fit for Paperclip than copying Hermes' exact keyword heuristic.
|
||||
|
||||
## 3. Current Paperclip State
|
||||
|
||||
Paperclip already has the right execution shape for adapter-specific routing, but it currently assumes one model per heartbeat run.
|
||||
|
||||
Current implementation facts:
|
||||
|
||||
- `server/src/services/heartbeat.ts` builds rich run context, including `paperclipWake`, workspace metadata, and session handoff context
|
||||
- each adapter receives a single resolved `config` object and executes once
|
||||
- built-in local adapters read one `config.model` and pass it directly to the underlying CLI
|
||||
- UI config today exposes one main `model` field plus adapter-specific thinking-effort controls
|
||||
- cost accounting currently records one provider/model tuple per run via `AdapterExecutionResult`
|
||||
|
||||
What this means:
|
||||
|
||||
- there is no shared routing layer in the server today
|
||||
- model choice already lives at the adapter boundary, which is good
|
||||
- multi-model execution in a single heartbeat needs explicit contract work or cost reporting will become misleading
|
||||
|
||||
## 4. Product Decision
|
||||
|
||||
Paperclip should implement smart model routing as an adapter-local, opt-in execution pattern.
|
||||
|
||||
V1 decision:
|
||||
|
||||
1. Do not add a global server-side router that tries to understand every adapter.
|
||||
2. Do not copy Hermes' prompt-keyword classifier as Paperclip's default routing policy.
|
||||
3. Add an adapter-specific "cheap preflight" phase for supported adapters.
|
||||
4. Keep the primary model as the canonical work model.
|
||||
5. Persist only the primary session unless an adapter can prove that cross-model session resume is safe.
|
||||
|
||||
Rationale:
|
||||
|
||||
- Paperclip heartbeats are structured, issue-scoped, and already include wake metadata
|
||||
- routing by execution phase is more reliable than routing by free-text prompt complexity
|
||||
- session semantics differ by adapter, so resume behavior must stay adapter-owned
|
||||
|
||||
## 5. Proposed V1 Behavior
|
||||
|
||||
## 5.1 Config shape
|
||||
|
||||
Supported adapters should add an optional routing block to `adapterConfig`.
|
||||
|
||||
Proposed shape:
|
||||
|
||||
```ts
|
||||
smartModelRouting?: {
|
||||
enabled: boolean;
|
||||
cheapModel: string;
|
||||
cheapThinkingEffort?: string;
|
||||
maxPreflightTurns?: number;
|
||||
allowInitialProgressComment?: boolean;
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- keep existing `model` as the primary model
|
||||
- `cheapModel` is adapter-specific, not global
|
||||
- adapters that cannot safely support this block simply ignore it
|
||||
|
||||
For adapters with provider-specific model fields later, the shape can expand to include provider/base-url overrides. V1 should start simple.
|
||||
|
||||
## 5.2 Routing policy
|
||||
|
||||
Supported adapters should run cheap preflight only when all are true:
|
||||
|
||||
- `smartModelRouting.enabled` is true
|
||||
- `cheapModel` is configured
|
||||
- the run is issue-scoped
|
||||
- the adapter is starting a fresh session, not resuming a persisted one
|
||||
- the run is expected to do real task work rather than just resume an existing thread
|
||||
|
||||
Supported adapters should skip cheap preflight when any are true:
|
||||
|
||||
- a persisted task session already exists
|
||||
- the adapter cannot safely isolate preflight from the primary session
|
||||
- the issue or wake type implies the task is already mid-flight and continuity matters more than first-response speed
|
||||
|
||||
This is intentionally phase-based, not text-heuristic-based.
|
||||
|
||||
## 5.3 Cheap preflight responsibilities
|
||||
|
||||
The cheap phase should be narrow and bounded.
|
||||
|
||||
Allowed responsibilities:
|
||||
|
||||
- ingest wake context and issue summary
|
||||
- inspect the workspace at a shallow level
|
||||
- leave a short "starting investigation" style comment when appropriate
|
||||
- collect a compact handoff summary for the primary phase
|
||||
|
||||
Not allowed in V1:
|
||||
|
||||
- long tool loops
|
||||
- risky file mutations
|
||||
- being the canonical persisted task session
|
||||
- deciding final completion without either explicit adapter support or a trivial success case
|
||||
|
||||
Implementation detail:
|
||||
|
||||
- the adapter should inject an explicit preflight prompt telling the model this is a bounded orchestration pass
|
||||
- preflight should use a very small turn budget, for example 1-2 turns
|
||||
|
||||
## 5.4 Primary execution responsibilities
|
||||
|
||||
After preflight, the adapter launches the normal primary execution using the existing prompt and primary model.
|
||||
|
||||
The primary phase should receive:
|
||||
|
||||
- the normal Paperclip prompt
|
||||
- any preflight-generated handoff summary
|
||||
- normal workspace and wake context
|
||||
|
||||
The primary phase remains the source of truth for:
|
||||
|
||||
- persisted session state
|
||||
- final task completion
|
||||
- most file changes
|
||||
- most cost
|
||||
|
||||
## 6. Required Contract Changes
|
||||
|
||||
The current `AdapterExecutionResult` is too narrow for truthful multi-model accounting.
|
||||
|
||||
Add an optional segmented execution report, for example:
|
||||
|
||||
```ts
|
||||
executionSegments?: Array<{
|
||||
phase: "cheap_preflight" | "primary";
|
||||
provider?: string | null;
|
||||
biller?: string | null;
|
||||
model?: string | null;
|
||||
billingType?: AdapterBillingType | null;
|
||||
usage?: UsageSummary;
|
||||
costUsd?: number | null;
|
||||
summary?: string | null;
|
||||
}>
|
||||
```
|
||||
|
||||
V1 server behavior:
|
||||
|
||||
- if `executionSegments` is absent, keep current single-result behavior unchanged
|
||||
- if present, write one `cost_events` row per segment that has cost or token usage
|
||||
- store the segment array in run usage/result metadata for later UI inspection
|
||||
- keep the existing top-level `provider` / `model` fields as a summary, preferably the primary phase when present
|
||||
|
||||
This avoids breaking existing adapters while giving routed adapters truthful reporting.
|
||||
|
||||
## 7. Adapter Rollout Plan
|
||||
|
||||
## 7.1 Phase 1: contract and server plumbing
|
||||
|
||||
Work:
|
||||
|
||||
1. Extend adapter result types with segmented execution metadata.
|
||||
2. Update heartbeat cost recording to emit multiple cost events when segments are present.
|
||||
3. Include segment summaries in run metadata for transcript/debug views.
|
||||
|
||||
Success criteria:
|
||||
|
||||
- existing adapters behave exactly as before
|
||||
- a routed adapter can report cheap plus primary usage without collapsing them into one fake model
|
||||
|
||||
## 7.2 Phase 2: `codex_local`
|
||||
|
||||
Why first:
|
||||
|
||||
- Codex already has rich prompt/handoff handling
|
||||
- the adapter already injects Paperclip skills and workspace metadata cleanly
|
||||
- the current implementation already distinguishes bootstrap, wake delta, and handoff prompt sections
|
||||
|
||||
Implementation work:
|
||||
|
||||
1. Add config support for `smartModelRouting`.
|
||||
2. Add a cheap-preflight prompt builder.
|
||||
3. Run cheap preflight only on fresh sessions.
|
||||
4. Pass a compact preflight handoff note into the primary prompt.
|
||||
5. Report segmented usage and model metadata.
|
||||
|
||||
Important guardrail:
|
||||
|
||||
- do not resume the cheap-model session as the primary session in V1
|
||||
|
||||
## 7.3 Phase 3: `claude_local`
|
||||
|
||||
Implementation work is similar, but the session model-switch risk is even less attractive.
|
||||
|
||||
Same rule:
|
||||
|
||||
- cheap preflight is ephemeral
|
||||
- primary Claude session remains canonical
|
||||
|
||||
## 7.4 Phase 4: other adapters
|
||||
|
||||
Candidates:
|
||||
|
||||
- `cursor`
|
||||
- `gemini_local`
|
||||
- `opencode_local`
|
||||
- external plugin adapters through `createServerAdapter()`
|
||||
|
||||
These should come later because each runtime has different session and model-switch semantics.
|
||||
|
||||
## 8. UI and Config Changes
|
||||
|
||||
For supported built-in adapters, the agent config UI should expose:
|
||||
|
||||
- `model` as the primary model
|
||||
- `smart model routing` toggle
|
||||
- `cheap model`
|
||||
- optional cheap thinking effort
|
||||
- optional `allow initial progress comment` toggle
|
||||
|
||||
The run detail UI should also show when routing occurred, for example:
|
||||
|
||||
- cheap preflight model
|
||||
- primary model
|
||||
- token/cost split
|
||||
|
||||
This matters because Paperclip's board UI is supposed to make cost and behavior legible.
|
||||
|
||||
## 9. Why Not Copy Hermes Exactly
|
||||
|
||||
Hermes' cheap-route heuristic is useful precedent, but Paperclip should not start there.
|
||||
|
||||
Reasons:
|
||||
|
||||
- Hermes is optimizing free-form conversational turns
|
||||
- Paperclip agents run structured, issue-scoped heartbeats with explicit task and workspace context
|
||||
- Paperclip already knows whether a run is fresh vs resumed, issue-scoped vs approval follow-up, and what workspace/session exists
|
||||
- those execution facts are stronger routing signals than prompt keyword matching
|
||||
|
||||
If Paperclip later wants a cheap-only completion path for trivial runs, that can be a second-stage feature built on observed run data, not the first implementation.
|
||||
|
||||
## 10. Risks
|
||||
|
||||
## 10.1 Duplicate or noisy comments
|
||||
|
||||
If the cheap phase posts an update and the primary phase posts another near-identical update, the issue thread gets worse.
|
||||
|
||||
Mitigation:
|
||||
|
||||
- keep cheap comments optional
|
||||
- make the preflight prompt explicitly avoid repeating status if a useful comment was already posted
|
||||
|
||||
## 10.2 Misleading cost reporting
|
||||
|
||||
If we only record the primary model, the board loses visibility into the routing cost tradeoff.
|
||||
|
||||
Mitigation:
|
||||
|
||||
- add segmented execution reporting before shipping adapter behavior
|
||||
|
||||
## 10.3 Session corruption
|
||||
|
||||
Cross-model session reuse may fail or degrade context quality.
|
||||
|
||||
Mitigation:
|
||||
|
||||
- V1 does not persist or resume cheap preflight sessions
|
||||
|
||||
## 10.4 Cheap model overreach
|
||||
|
||||
A cheap model with full tools and permissions may do too much low-quality work.
|
||||
|
||||
Mitigation:
|
||||
|
||||
- hard cap preflight turns
|
||||
- use an explicit orchestration-only prompt
|
||||
- start with supported adapters where we can test the behavior well
|
||||
|
||||
## 11. Verification Plan
|
||||
|
||||
Required tests:
|
||||
|
||||
- adapter unit tests for route eligibility
|
||||
- adapter unit tests for "fresh session -> cheap preflight + primary"
|
||||
- adapter unit tests for "resumed session -> primary only"
|
||||
- heartbeat tests for segmented cost-event creation
|
||||
- UI tests for config save/load of cheap-model fields
|
||||
|
||||
Manual checks:
|
||||
|
||||
- create a fresh issue for a routed Codex or Claude agent
|
||||
- verify the run metadata shows both phases
|
||||
- verify only the primary session is persisted
|
||||
- verify cost rows reflect both models
|
||||
- verify the issue thread does not get duplicate kickoff comments
|
||||
|
||||
## 12. Recommended Sequence
|
||||
|
||||
1. Add segmented execution reporting to the adapter/server contract.
|
||||
2. Implement `codex_local` cheap preflight.
|
||||
3. Validate cost visibility and transcript UX.
|
||||
4. Implement `claude_local` cheap preflight.
|
||||
5. Decide later whether any adapters need Hermes-style text heuristics in addition to phase-based routing.
|
||||
|
||||
## 13. Recommendation
|
||||
|
||||
Paperclip should ship smart model routing as:
|
||||
|
||||
- adapter-specific
|
||||
- opt-in
|
||||
- phase-based
|
||||
- session-safe
|
||||
- cost-truthful
|
||||
|
||||
The right V1 is not "choose the cheapest model for simple prompts." The right V1 is "use a cheap model for bounded orchestration work on fresh runs, then hand off to the primary model for the real task."
|
||||
Reference in New Issue
Block a user