Guard cheap recovery model usage (#6371)

## Thinking Path

> - Paperclip is the control plane that coordinates AI-agent work
through issues, heartbeats, comments, approvals, and auditable recovery
paths.
> - The affected subsystem is heartbeat/recovery orchestration,
especially the optional cheap model profile used for operational
recovery overhead.
> - Cheap recovery should repair status and liveness, but it must not
become the worker lane that writes deliverables, continues source work,
or propagates cheap execution hints into downstream retries.
> - The gap was that cheap-profile hints could follow recovery wake
contexts and assignment overrides farther than intended, making real
work eligible to run on the cheap model.
> - This pull request separates status-only cheap recovery from normal
source-work continuations, adds route guards for deliverable mutations
during cheap status-only runs, and documents the invariant.
> - The benefit is safer retry/recovery behavior: cheap runs can clean
up control-plane state, while any remaining source work resumes through
a normal/original model path.

## What Changed

- Added recovery model-profile work classes so status-only recovery
carries explicit guard context and normal-model continuations scrub
cheap hints.
- Updated heartbeat, productivity review, liveness continuation, and
recovery service wakeups to request cheap only for bounded status-only
recovery work.
- Blocked cheap status-only recovery runs from writing issue documents,
plans, attachments, work products, or assigning downstream work back to
`modelProfile: "cheap"`.
- Added/updated server tests for cheap profile propagation,
artifact/document guards, route authorization, retry scheduling, and
successful-run handoff behavior.
- Documented the recovery model-profile lane in
`doc/SPEC-implementation.md` and `doc/execution-semantics.md`.
- After rebasing onto current `public-gh/master`, stabilized the new
`InstanceSidebar` plugin-filter tests so the PR check lane stays green.

## Verification

- Local: `pnpm exec vitest run --config vitest.config.ts
src/services/recovery/model-profile-hint.test.ts
src/__tests__/issue-agent-mutation-ownership-routes.test.ts
src/__tests__/issue-document-restore-routes.test.ts` from `server/` - 3
files, 37 tests passed after final edits.
- Local: `pnpm exec vitest run --config vitest.config.ts
src/__tests__/heartbeat-process-recovery.test.ts` from `server/` - 44
tests passed after rerunning the cleanup-sensitive file alone.
- Local: `pnpm --filter @paperclipai/ui exec vitest run
src/components/InstanceSidebar.test.tsx` - 4 tests passed.
- Local: `pnpm --filter @paperclipai/server typecheck` - passed.
- Local: `pnpm --filter @paperclipai/ui typecheck` - passed.
- PR checks on latest head `6f8c3b1380f5bd872c6f49f6f7188ecf3bb6d263` -
all green, including `verify`, build, typecheck,
server/general/serialized tests, e2e, Snyk, and policy.
- Greptile: pass 3 returned Confidence Score 5/5 with zero unresolved
Greptile review threads.

## Risks

- Medium risk: recovery behavior is intentionally stricter, so any path
that incorrectly relies on cheap recovery to keep doing source work will
now need to hand back to a normal-model run.
- Low migration risk: no schema changes.
- No product UI changes; the UI file touched is a test-only
stabilization after rebasing onto current `master`.

> For core feature work, check [`ROADMAP.md`](ROADMAP.md) first and
discuss it in `#dev` before opening the PR. Feature PRs that overlap
with planned core work may need to be redirected — check the roadmap
first. See `CONTRIBUTING.md`.

## Model Used

- OpenAI Codex coding agent, GPT-5 model family (`gpt-5`), tool use and
local code execution enabled; context window not exposed in this
environment.

## Checklist

- [x] I have included a thinking path that traces from project context
to this change
- [x] I have specified the model used (with version and capability
details)
- [x] I have checked ROADMAP.md and confirmed this PR does not duplicate
planned core work
- [x] I have run tests locally and they pass
- [x] I have added or updated tests where applicable
- [x] If this change affects the UI, I have included before/after
screenshots (N/A: no product UI changes)
- [x] I have updated relevant documentation to reflect my changes
- [x] I have considered and documented any risks above
- [x] I will address all Greptile and reviewer comments before
requesting merge
This commit is contained in:
Dotta
2026-05-19 13:46:02 -05:00
committed by GitHub
parent 24748de421
commit bfe6369ef5
17 changed files with 529 additions and 78 deletions
@@ -405,6 +405,7 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
includeIssue?: boolean;
runErrorCode?: string | null;
runError?: string | null;
contextSnapshot?: Record<string, unknown>;
}) {
const companyId = randomUUID();
const agentId = randomUUID();
@@ -454,7 +455,9 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
triggerDetail: "system",
status: input?.runStatus ?? "running",
wakeupRequestId,
contextSnapshot: input?.includeIssue === false ? {} : { issueId },
contextSnapshot: input?.includeIssue === false
? input?.contextSnapshot ?? {}
: { ...(input?.contextSnapshot ?? {}), issueId },
processPid: input?.processPid ?? null,
processGroupId: input?.processGroupId ?? null,
processLossRetryCount: input?.processLossRetryCount ?? 0,
@@ -765,7 +768,12 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
companyId: input.companyId,
reason: "source_scoped_recovery_action",
source: "assignment",
payload: expect.objectContaining({ modelProfile: "cheap" }),
payload: expect.objectContaining({
modelProfile: "cheap",
allowDeliverableWork: false,
allowDocumentUpdates: false,
resumeRequiresNormalModel: true,
}),
});
const recoveryRun = recoveryWakeup?.runId
@@ -783,6 +791,9 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
sourceIssueId: input.issueId,
strandedRunId: input.runId,
modelProfile: "cheap",
allowDeliverableWork: false,
allowDocumentUpdates: false,
resumeRequiresNormalModel: true,
});
await waitForHeartbeatIdle(db);
const sourceIssue = await db
@@ -920,6 +931,12 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
it("queues exactly one retry when the recorded local pid is dead", async () => {
const { agentId, runId, issueId } = await seedRunFixture({
processPid: 999_999_999,
contextSnapshot: {
modelProfile: "cheap",
allowDeliverableWork: false,
allowDocumentUpdates: false,
resumeRequiresNormalModel: true,
},
});
const heartbeat = heartbeatService(db);
@@ -947,7 +964,7 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
expect(retryRun?.status).toBe("queued");
expect(retryRun?.retryOfRunId).toBe(runId);
expect(retryRun?.processLossRetryCount).toBe(1);
expect(retryRun?.contextSnapshot).toMatchObject({ modelProfile: "cheap" });
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
const issue = await db
.select()
@@ -1253,8 +1270,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
expect(retryRun?.scheduledRetryReason).toBe("transient_failure");
expect(retryRun?.contextSnapshot).toMatchObject({
codexTransientFallbackMode: "same_session",
modelProfile: "cheap",
});
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
const issue = await db
.select()
@@ -1789,9 +1806,9 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
payload: expect.objectContaining({
issueId,
mutation: "assigned_todo_liveness_dispatch",
modelProfile: "cheap",
}),
});
expect(wakeups[0]?.payload as Record<string, unknown>).not.toHaveProperty("modelProfile");
const runs = await db.select().from(heartbeatRuns).where(eq(heartbeatRuns.agentId, agentId));
expect(runs).toHaveLength(1);
@@ -1801,8 +1818,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
taskId: issueId,
wakeReason: "issue_assigned",
source: "issue.assigned_todo_liveness_dispatch",
modelProfile: "cheap",
});
expect(runs[0]?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
expect((runs[0]?.contextSnapshot as Record<string, unknown>)?.retryReason).toBeUndefined();
const issue = await db.select().from(issues).where(eq(issues.id, issueId)).then((rows) => rows[0] ?? null);
@@ -1909,9 +1926,9 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
payload: expect.objectContaining({
issueId: unblocked.issueId,
mutation: "assigned_todo_liveness_dispatch",
modelProfile: "cheap",
}),
});
expect(unblockedWakeups[0]?.payload as Record<string, unknown>).not.toHaveProperty("modelProfile");
const unblockedRuns = await db
.select()
.from(heartbeatRuns)
@@ -1963,7 +1980,7 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
const retryRun = runs.find((row) => row.id !== runId);
expect(retryRun?.id).toBeTruthy();
expect((retryRun?.contextSnapshot as Record<string, unknown>)?.retryReason).toBe("assignment_recovery");
expect(retryRun?.contextSnapshot).toMatchObject({ modelProfile: "cheap" });
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
if (retryRun) {
await waitForRunToSettle(heartbeat, retryRun.id);
}
@@ -2002,8 +2019,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
retryReason: "issue_continuation_needed",
retryOfRunId: runId,
source: "issue.continuation_recovery",
modelProfile: "cheap",
});
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
const recoveries = await db
.select()
@@ -2054,7 +2071,7 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
const retryRun = runs.find((row) => row.id !== runId);
expect((retryRun?.contextSnapshot as Record<string, unknown>)?.retryReason).toBe("assignment_recovery");
expect(retryRun?.contextSnapshot).toMatchObject({ modelProfile: "cheap" });
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
if (retryRun) {
await waitForRunToSettle(heartbeat, retryRun.id);
}
@@ -2296,7 +2313,7 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
const retryRun = runs.find((row) => row.id !== runId);
expect(retryRun?.id).toBeTruthy();
expect((retryRun?.contextSnapshot as Record<string, unknown>)?.retryReason).toBe("issue_continuation_needed");
expect(retryRun?.contextSnapshot).toMatchObject({ modelProfile: "cheap" });
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
if (retryRun) {
await waitForRunToSettle(heartbeat, retryRun.id);
}
@@ -2786,8 +2803,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
retryReason: "issue_continuation_needed",
retryOfRunId: runId,
source: "issue.productive_terminal_continuation_recovery",
modelProfile: "cheap",
});
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
const wakeups = await db.select().from(agentWakeupRequests).where(eq(agentWakeupRequests.agentId, agentId));
expect(wakeups).toHaveLength(2);
@@ -2854,8 +2871,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
retryReason: "issue_continuation_needed",
retryOfRunId: runId,
source: "issue.productive_terminal_continuation_recovery",
modelProfile: "cheap",
});
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
});
it("does not treat a productive terminal run as healthy when in-progress work has no live path", async () => {
@@ -2910,8 +2927,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
retryReason: "issue_continuation_needed",
retryOfRunId: runId,
source: "issue.productive_terminal_continuation_recovery",
modelProfile: "cheap",
});
expect(retryRun?.contextSnapshot as Record<string, unknown>).not.toHaveProperty("modelProfile");
});
it("does not reconcile user-assigned work through the agent stranded-work recovery path", async () => {