forked from farhoodlabs/paperclip
[codex] Harden recovery issue handling (#4600)
## Thinking Path > - Paperclip orchestrates AI agents for zero-human companies > - The control plane must recover stranded agent work without creating new operational loops > - Stranded recovery issues can themselves fail, and exposing raw retry errors in comments can leak sensitive adapter details > - New local companies also should not force a hire-approval gate unless operators enable that policy > - This pull request hardens recovery issue handling, redacts retry failure details in issue copy, preserves `maxConcurrentRuns: 1`, and flips new-hire approval to an opt-in default > - The benefit is safer automatic recovery and smoother default company setup without hidden migration conflicts ## What Changed - Added migration `0071_default_hire_approval_off` and updated company schema/import/export/docs so hire approvals default off and serialize only when enabled. - Added migration `0072_large_sandman` with a partial unique index preventing duplicate active stranded recovery issues for the same source issue. - Blocked failed `stranded_issue_recovery` issues in place instead of creating nested recovery issues. - Redacted latest retry failure details from recovery issue comments while still linking reviewers to run evidence. - Allowed `maxConcurrentRuns: 1` to be honored by heartbeat concurrency normalization. - Added focused regression coverage for recovery recursion, redaction, migration ordering, and concurrency behavior. ## Verification - `pnpm --filter @paperclipai/db run check:migrations` - `pnpm exec vitest run --project @paperclipai/server server/src/__tests__/recovery-classifiers.test.ts` - `pnpm exec vitest run --project @paperclipai/server server/src/__tests__/company-portability.test.ts --pool=forks --poolOptions.forks.isolate=true` - `pnpm exec vitest run --project @paperclipai/server server/src/__tests__/agent-permissions-routes.test.ts --pool=forks --poolOptions.forks.isolate=true` - `pnpm --filter @paperclipai/server typecheck` - `pnpm exec vitest run --project @paperclipai/server server/src/__tests__/heartbeat-process-recovery.test.ts --pool=forks --poolOptions.forks.isolate=true` exits 0, but this host skipped the embedded Postgres tests with the existing init guard. - `pnpm exec vitest run --project @paperclipai/server server/src/__tests__/heartbeat-dependency-scheduling.test.ts --pool=forks --poolOptions.forks.isolate=true` exits 0, but this host skipped the embedded Postgres tests with the existing init guard. ## Risks - Migration risk is low but this PR intentionally owns both new migrations to avoid separate PR migration-journal conflicts. - Recovery comments now require operators to inspect linked run evidence for details instead of reading raw errors inline. - The hire approval default changes behavior for newly created/imported companies only; existing persisted company settings are not changed except by the SQL default for future rows. > For core feature work, check [`ROADMAP.md`](ROADMAP.md) first and discuss it in `#dev` before opening the PR. Feature PRs that overlap with planned core work may need to be redirected — check the roadmap first. See `CONTRIBUTING.md`. ## Model Used - OpenAI Codex, GPT-5 coding agent, tool-enabled terminal/GitHub workflow, reasoning mode active. Context window not exposed in this environment. ## Checklist - [x] I have included a thinking path that traces from project context to this change - [x] I have specified the model used (with version and capability details) - [x] I have checked ROADMAP.md and confirmed this PR does not duplicate planned core work - [x] I have run tests locally and they pass - [x] I have added or updated tests where applicable - [x] If this change affects the UI, I have included before/after screenshots - [x] I have updated relevant documentation to reflect my changes - [x] I have considered and documented any risks above - [x] I will address all Greptile and reviewer comments before requesting merge --------- Co-authored-by: Paperclip <noreply@paperclip.ing>
This commit is contained in:
@@ -468,6 +468,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
retryReason?: "assignment_recovery" | "issue_continuation_needed" | null;
|
||||
assignToUser?: boolean;
|
||||
activePauseHold?: boolean;
|
||||
runErrorCode?: string | null;
|
||||
runError?: string | null;
|
||||
}) {
|
||||
const companyId = randomUUID();
|
||||
const agentId = randomUUID();
|
||||
@@ -509,7 +511,9 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
runId,
|
||||
claimedAt: now,
|
||||
finishedAt: new Date("2026-03-19T00:05:00.000Z"),
|
||||
error: input.runStatus === "succeeded" ? null : "run failed before issue advanced",
|
||||
error: input.runStatus === "succeeded"
|
||||
? null
|
||||
: ("runError" in input ? input.runError : "run failed before issue advanced"),
|
||||
});
|
||||
|
||||
await db.insert(heartbeatRuns).values({
|
||||
@@ -531,8 +535,12 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
startedAt: now,
|
||||
finishedAt: new Date("2026-03-19T00:05:00.000Z"),
|
||||
updatedAt: new Date("2026-03-19T00:05:00.000Z"),
|
||||
errorCode: input.runStatus === "succeeded" ? null : "process_lost",
|
||||
error: input.runStatus === "succeeded" ? null : "run failed before issue advanced",
|
||||
errorCode: input.runStatus === "succeeded"
|
||||
? null
|
||||
: ("runErrorCode" in input ? input.runErrorCode : "process_lost"),
|
||||
error: input.runStatus === "succeeded"
|
||||
? null
|
||||
: ("runError" in input ? input.runError : "run failed before issue advanced"),
|
||||
});
|
||||
|
||||
await db.insert(issues).values([
|
||||
@@ -659,6 +667,20 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
return recovery;
|
||||
}
|
||||
|
||||
async function sourceBlockerIssueIds(companyId: string, sourceIssueId: string) {
|
||||
return db
|
||||
.select({ blockerIssueId: issueRelations.issueId })
|
||||
.from(issueRelations)
|
||||
.where(
|
||||
and(
|
||||
eq(issueRelations.companyId, companyId),
|
||||
eq(issueRelations.relatedIssueId, sourceIssueId),
|
||||
eq(issueRelations.type, "blocks"),
|
||||
),
|
||||
)
|
||||
.then((rows) => rows.map((row) => row.blockerIssueId));
|
||||
}
|
||||
|
||||
async function seedQueuedIssueRunFixture() {
|
||||
const companyId = randomUUID();
|
||||
const agentId = randomUUID();
|
||||
@@ -930,6 +952,81 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
expect(comments[0]?.body).toContain(`Recovery issue: [${recovery.identifier}]`);
|
||||
});
|
||||
|
||||
it("blocks failed recovery work in place during immediate terminal-run cleanup", async () => {
|
||||
const sourceIssueId = randomUUID();
|
||||
const { companyId, agentId, runId, issueId } = await seedRunFixture({
|
||||
agentStatus: "idle",
|
||||
processPid: 999_999_999,
|
||||
processLossRetryCount: 1,
|
||||
runErrorCode: "process_lost",
|
||||
runError: "Authorization: Bearer sk-test-recovery-secret",
|
||||
});
|
||||
await db
|
||||
.update(issues)
|
||||
.set({
|
||||
title: "Recover stalled issue PAP-1",
|
||||
originKind: "stranded_issue_recovery",
|
||||
originId: sourceIssueId,
|
||||
})
|
||||
.where(eq(issues.id, issueId));
|
||||
const issuePrefix = `T${companyId.replace(/-/g, "").slice(0, 6).toUpperCase()}`;
|
||||
await db.insert(issues).values({
|
||||
id: sourceIssueId,
|
||||
companyId,
|
||||
title: "Original stranded source",
|
||||
status: "blocked",
|
||||
priority: "medium",
|
||||
issueNumber: 2,
|
||||
identifier: `${issuePrefix}-2`,
|
||||
});
|
||||
await db.insert(issueRelations).values({
|
||||
companyId,
|
||||
issueId,
|
||||
relatedIssueId: sourceIssueId,
|
||||
type: "blocks",
|
||||
});
|
||||
const heartbeat = heartbeatService(db);
|
||||
|
||||
const result = await heartbeat.reapOrphanedRuns();
|
||||
expect(result.reaped).toBe(1);
|
||||
expect(result.runIds).toEqual([runId]);
|
||||
|
||||
const runs = await db
|
||||
.select()
|
||||
.from(heartbeatRuns)
|
||||
.where(eq(heartbeatRuns.agentId, agentId));
|
||||
expect(runs).toHaveLength(1);
|
||||
expect(runs[0]?.status).toBe("failed");
|
||||
|
||||
const recoveryIssue = await waitForValue(async () =>
|
||||
db.select().from(issues).where(eq(issues.id, issueId)).then((rows) => {
|
||||
const issue = rows[0] ?? null;
|
||||
return issue?.status === "blocked" ? issue : null;
|
||||
})
|
||||
);
|
||||
expect(recoveryIssue?.assigneeAgentId).toBe(agentId);
|
||||
expect(recoveryIssue?.originKind).toBe("stranded_issue_recovery");
|
||||
expect(recoveryIssue?.originId).toBe(sourceIssueId);
|
||||
expect(recoveryIssue?.executionRunId).toBeNull();
|
||||
|
||||
const nestedRecoveries = await db
|
||||
.select()
|
||||
.from(issues)
|
||||
.where(and(eq(issues.companyId, companyId), eq(issues.originKind, "stranded_issue_recovery"), eq(issues.originId, issueId)));
|
||||
expect(nestedRecoveries).toHaveLength(0);
|
||||
|
||||
const comments = await waitForValue(async () => {
|
||||
const rows = await db.select().from(issueComments).where(eq(issueComments.issueId, issueId));
|
||||
return rows.length > 0 ? rows : null;
|
||||
});
|
||||
expect(comments).toHaveLength(1);
|
||||
expect(comments[0]?.body).toContain("stopped automatic stranded-work recovery");
|
||||
expect(comments[0]?.body).toContain("recovery issues do not create nested `stranded_issue_recovery` issues");
|
||||
expect(comments[0]?.body).toContain("Latest retry failure details were withheld from the issue thread");
|
||||
expect(comments[0]?.body).not.toContain("sk-test-recovery-secret");
|
||||
await expect(sourceBlockerIssueIds(companyId, sourceIssueId)).resolves.toEqual([issueId]);
|
||||
});
|
||||
|
||||
it("does not block paused-tree work when immediate continuation recovery is suppressed by the hold", async () => {
|
||||
const { companyId, agentId, runId, issueId } = await seedRunFixture({
|
||||
agentStatus: "idle",
|
||||
@@ -1108,6 +1205,8 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
status: "todo",
|
||||
runStatus: "failed",
|
||||
retryReason: "assignment_recovery",
|
||||
runErrorCode: "process_lost",
|
||||
runError: "Authorization: Bearer sk-test-recovery-secret",
|
||||
});
|
||||
const heartbeat = heartbeatService(db);
|
||||
|
||||
@@ -1127,11 +1226,12 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
previousStatus: "todo",
|
||||
retryReason: "assignment_recovery",
|
||||
});
|
||||
expect(recovery.description ?? "").not.toContain("sk-test-recovery-secret");
|
||||
|
||||
const comments = await db.select().from(issueComments).where(eq(issueComments.issueId, issueId));
|
||||
expect(comments).toHaveLength(1);
|
||||
expect(comments[0]?.body).toContain("retried dispatch");
|
||||
expect(comments[0]?.body).toContain("Latest retry failure: `process_lost` - run failed before issue advanced.");
|
||||
expect(comments[0]?.body).toContain("Latest retry failure details were withheld from the issue thread");
|
||||
expect(comments[0]?.body).toContain(`Recovery issue: [${recovery.identifier}]`);
|
||||
});
|
||||
|
||||
@@ -1446,10 +1546,217 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
|
||||
const comments = await db.select().from(issueComments).where(eq(issueComments.issueId, issueId));
|
||||
expect(comments).toHaveLength(1);
|
||||
expect(comments[0]?.body).toContain("retried continuation");
|
||||
expect(comments[0]?.body).toContain("Latest retry failure: `process_lost` - run failed before issue advanced.");
|
||||
expect(comments[0]?.body).toContain("Latest retry failure details were withheld from the issue thread");
|
||||
expect(comments[0]?.body).toContain(`Recovery issue: [${recovery.identifier}]`);
|
||||
});
|
||||
|
||||
it("redacts error-code-only stranded recovery failures in issue copy", async () => {
|
||||
const { companyId, agentId, issueId, runId } = await seedStrandedIssueFixture({
|
||||
status: "in_progress",
|
||||
runStatus: "failed",
|
||||
retryReason: "issue_continuation_needed",
|
||||
runErrorCode: "adapter_exit_code",
|
||||
runError: null,
|
||||
});
|
||||
const heartbeat = heartbeatService(db);
|
||||
|
||||
const result = await heartbeat.reconcileStrandedAssignedIssues();
|
||||
expect(result.escalated).toBe(1);
|
||||
|
||||
const recovery = await expectStrandedRecoveryArtifacts({
|
||||
companyId,
|
||||
agentId,
|
||||
issueId,
|
||||
runId,
|
||||
previousStatus: "in_progress",
|
||||
retryReason: "issue_continuation_needed",
|
||||
});
|
||||
expect(recovery.description).toContain("Latest retry failure details were withheld from the issue thread");
|
||||
expect(recovery.description).not.toContain("- Failure: none recorded");
|
||||
|
||||
const comments = await db.select().from(issueComments).where(eq(issueComments.issueId, issueId));
|
||||
expect(comments).toHaveLength(1);
|
||||
expect(comments[0]?.body).toContain("Latest retry failure details were withheld from the issue thread");
|
||||
expect(comments[0]?.body).not.toContain("- Failure: none recorded");
|
||||
});
|
||||
|
||||
it("reuses the raced stranded recovery issue when duplicate active recovery creation conflicts", async () => {
|
||||
const { companyId, issueId } = await seedStrandedIssueFixture({
|
||||
status: "in_progress",
|
||||
runStatus: "failed",
|
||||
retryReason: "issue_continuation_needed",
|
||||
});
|
||||
const heartbeat = heartbeatService(db);
|
||||
|
||||
const results = await Promise.allSettled(
|
||||
Array.from({ length: 8 }, () => heartbeat.reconcileStrandedAssignedIssues()),
|
||||
);
|
||||
expect(results.every((result) => result.status === "fulfilled")).toBe(true);
|
||||
|
||||
const recoveries = await db
|
||||
.select()
|
||||
.from(issues)
|
||||
.where(and(
|
||||
eq(issues.companyId, companyId),
|
||||
eq(issues.originKind, "stranded_issue_recovery"),
|
||||
eq(issues.originId, issueId),
|
||||
));
|
||||
expect(recoveries).toHaveLength(1);
|
||||
await expect(sourceBlockerIssueIds(companyId, issueId)).resolves.toEqual([recoveries[0]?.id]);
|
||||
});
|
||||
|
||||
it("blocks stranded recovery issues in place instead of creating nested recovery issues", async () => {
|
||||
const sourceIssueId = randomUUID();
|
||||
const { companyId, agentId, issueId, runId } = await seedStrandedIssueFixture({
|
||||
status: "in_progress",
|
||||
runStatus: "failed",
|
||||
});
|
||||
await db
|
||||
.update(issues)
|
||||
.set({
|
||||
title: "Recover stalled issue PAP-1",
|
||||
originKind: "stranded_issue_recovery",
|
||||
originId: sourceIssueId,
|
||||
})
|
||||
.where(eq(issues.id, issueId));
|
||||
const issuePrefix = `T${companyId.replace(/-/g, "").slice(0, 6).toUpperCase()}`;
|
||||
await db.insert(issues).values({
|
||||
id: sourceIssueId,
|
||||
companyId,
|
||||
title: "Original stranded source",
|
||||
status: "blocked",
|
||||
priority: "medium",
|
||||
issueNumber: 2,
|
||||
identifier: `${issuePrefix}-2`,
|
||||
});
|
||||
await db.insert(issueRelations).values({
|
||||
companyId,
|
||||
issueId,
|
||||
relatedIssueId: sourceIssueId,
|
||||
type: "blocks",
|
||||
});
|
||||
const heartbeat = heartbeatService(db);
|
||||
|
||||
const result = await heartbeat.reconcileStrandedAssignedIssues();
|
||||
expect(result.dispatchRequeued).toBe(0);
|
||||
expect(result.continuationRequeued).toBe(0);
|
||||
expect(result.escalated).toBe(1);
|
||||
expect(result.issueIds).toEqual([issueId]);
|
||||
|
||||
const recoveryIssue = await db.select().from(issues).where(eq(issues.id, issueId)).then((rows) => rows[0] ?? null);
|
||||
expect(recoveryIssue?.status).toBe("blocked");
|
||||
expect(recoveryIssue?.assigneeAgentId).toBe(agentId);
|
||||
expect(recoveryIssue?.originKind).toBe("stranded_issue_recovery");
|
||||
expect(recoveryIssue?.originId).toBe(sourceIssueId);
|
||||
|
||||
const nestedRecoveries = await db
|
||||
.select()
|
||||
.from(issues)
|
||||
.where(and(eq(issues.companyId, companyId), eq(issues.originKind, "stranded_issue_recovery"), eq(issues.originId, issueId)));
|
||||
expect(nestedRecoveries).toHaveLength(0);
|
||||
|
||||
const runs = await db.select().from(heartbeatRuns).where(eq(heartbeatRuns.agentId, agentId));
|
||||
expect(runs).toHaveLength(1);
|
||||
expect(runs[0]?.id).toBe(runId);
|
||||
|
||||
const comments = await db.select().from(issueComments).where(eq(issueComments.issueId, issueId));
|
||||
expect(comments).toHaveLength(1);
|
||||
expect(comments[0]?.body).toContain("stopped automatic stranded-work recovery");
|
||||
expect(comments[0]?.body).toContain("Latest retry failure details were withheld from the issue thread");
|
||||
expect(comments[0]?.body).toContain("recovery issues do not create nested `stranded_issue_recovery` issues");
|
||||
await expect(sourceBlockerIssueIds(companyId, sourceIssueId)).resolves.toEqual([issueId]);
|
||||
});
|
||||
|
||||
it("keeps repeated recovery failures on the same canonical recovery issue", async () => {
|
||||
const sourceIssueId = randomUUID();
|
||||
const { companyId, agentId, issueId, runId } = await seedStrandedIssueFixture({
|
||||
status: "in_progress",
|
||||
runStatus: "failed",
|
||||
});
|
||||
const issuePrefix = `T${companyId.replace(/-/g, "").slice(0, 6).toUpperCase()}`;
|
||||
await db.insert(issues).values({
|
||||
id: sourceIssueId,
|
||||
companyId,
|
||||
title: "Original stranded source",
|
||||
status: "blocked",
|
||||
priority: "medium",
|
||||
issueNumber: 2,
|
||||
identifier: `${issuePrefix}-2`,
|
||||
});
|
||||
await db
|
||||
.update(issues)
|
||||
.set({
|
||||
title: "Recover stalled issue PAP-1",
|
||||
originKind: "stranded_issue_recovery",
|
||||
originId: sourceIssueId,
|
||||
})
|
||||
.where(eq(issues.id, issueId));
|
||||
await db.insert(issueRelations).values({
|
||||
companyId,
|
||||
issueId,
|
||||
relatedIssueId: sourceIssueId,
|
||||
type: "blocks",
|
||||
});
|
||||
const heartbeat = heartbeatService(db);
|
||||
|
||||
const firstResult = await heartbeat.reconcileStrandedAssignedIssues();
|
||||
expect(firstResult.escalated).toBe(1);
|
||||
expect(firstResult.issueIds).toEqual([issueId]);
|
||||
|
||||
const secondRunId = randomUUID();
|
||||
await db.insert(heartbeatRuns).values({
|
||||
id: secondRunId,
|
||||
companyId,
|
||||
agentId,
|
||||
invocationSource: "assignment",
|
||||
triggerDetail: "system",
|
||||
status: "failed",
|
||||
contextSnapshot: {
|
||||
issueId,
|
||||
taskId: issueId,
|
||||
wakeReason: "issue_assigned",
|
||||
source: "stranded_issue_recovery",
|
||||
},
|
||||
startedAt: new Date("2030-03-19T00:10:00.000Z"),
|
||||
finishedAt: new Date("2030-03-19T00:15:00.000Z"),
|
||||
createdAt: new Date("2030-03-19T00:10:00.000Z"),
|
||||
updatedAt: new Date("2030-03-19T00:15:00.000Z"),
|
||||
errorCode: "adapter_failed",
|
||||
error: "adapter failed while retrying recovery issue",
|
||||
});
|
||||
await db
|
||||
.update(issues)
|
||||
.set({
|
||||
status: "in_progress",
|
||||
checkoutRunId: secondRunId,
|
||||
executionRunId: null,
|
||||
})
|
||||
.where(eq(issues.id, issueId));
|
||||
|
||||
const secondResult = await heartbeat.reconcileStrandedAssignedIssues();
|
||||
expect(secondResult.dispatchRequeued).toBe(0);
|
||||
expect(secondResult.continuationRequeued).toBe(0);
|
||||
expect(secondResult.escalated).toBe(1);
|
||||
expect(secondResult.issueIds).toEqual([issueId]);
|
||||
|
||||
const recoveryIssuesForSource = await db
|
||||
.select()
|
||||
.from(issues)
|
||||
.where(and(eq(issues.companyId, companyId), eq(issues.originKind, "stranded_issue_recovery"), eq(issues.originId, sourceIssueId)));
|
||||
expect(recoveryIssuesForSource.map((issue) => issue.id)).toEqual([issueId]);
|
||||
|
||||
const nestedRecoveries = await db
|
||||
.select()
|
||||
.from(issues)
|
||||
.where(and(eq(issues.companyId, companyId), eq(issues.originKind, "stranded_issue_recovery"), eq(issues.originId, issueId)));
|
||||
expect(nestedRecoveries).toHaveLength(0);
|
||||
await expect(sourceBlockerIssueIds(companyId, sourceIssueId)).resolves.toEqual([issueId]);
|
||||
|
||||
const comments = await db.select().from(issueComments).where(eq(issueComments.issueId, issueId));
|
||||
expect(comments).toHaveLength(2);
|
||||
expect(comments[1]?.body).toContain("Latest retry failure details were withheld from the issue thread");
|
||||
});
|
||||
|
||||
it("does not escalate paused-tree recovery when the automatic continuation retry was cancelled by the hold", async () => {
|
||||
const { companyId, agentId, issueId } = await seedStrandedIssueFixture({
|
||||
status: "in_progress",
|
||||
|
||||
Reference in New Issue
Block a user