[codex] Add run liveness continuations (#4083)

## Thinking Path > - Paperclip orchestrates AI agents for zero-human companies. > - Heartbeat runs are the control-plane record of each agent execution window. > - Long-running local agents can exhaust context or stop while still holding useful next-step state. > - Operators need that stop reason, next action, and continuation path to be durable and visible. > - This pull request adds run liveness metadata, continuation summaries, and UI surfaces for issue run ledgers. > - The benefit is that interrupted or long-running work can resume with clearer context instead of losing the agent's last useful handoff. ## What Changed - Added heartbeat-run liveness fields, continuation attempt tracking, and an idempotent `0058` migration. - Added server services and tests for run liveness, continuation summaries, stop metadata, and activity backfill. - Wired local and HTTP adapters to surface continuation/liveness context through shared adapter utilities. - Added shared constants, validators, and heartbeat types for liveness continuation state. - Added issue-detail UI surfaces for continuation handoffs and the run ledger, with component tests. - Updated agent runtime docs, heartbeat protocol docs, prompt guidance, onboarding assets, and skills instructions to explain continuation behavior. - Addressed Greptile feedback by scoping document evidence by run, excluding system continuation-summary documents from liveness evidence, importing shared liveness types, surfacing hidden ledger run counts, documenting bounded retry behavior, and moving run-ledger liveness backfill off the request path. ## Verification - `pnpm exec vitest run packages/adapter-utils/src/server-utils.test.ts server/src/__tests__/run-continuations.test.ts server/src/__tests__/run-liveness.test.ts server/src/__tests__/activity-service.test.ts server/src/__tests__/documents-service.test.ts server/src/__tests__/issue-continuation-summary.test.ts server/src/services/heartbeat-stop-metadata.test.ts ui/src/components/IssueRunLedger.test.tsx ui/src/components/IssueContinuationHandoff.test.tsx ui/src/components/IssueDocumentsSection.test.tsx` - `pnpm --filter @paperclipai/db build` - `pnpm exec vitest run server/src/__tests__/activity-service.test.ts ui/src/components/IssueRunLedger.test.tsx` - `pnpm --filter @paperclipai/ui typecheck` - `pnpm --filter @paperclipai/server typecheck` - `pnpm exec vitest run server/src/__tests__/activity-service.test.ts server/src/__tests__/run-continuations.test.ts ui/src/components/IssueRunLedger.test.tsx` - `pnpm exec vitest run server/src/__tests__/heartbeat-process-recovery.test.ts -t "treats a plan document update"` - `pnpm exec vitest run server/src/__tests__/activity-service.test.ts server/src/__tests__/heartbeat-process-recovery.test.ts -t "activity service|treats a plan document update"` - Remote PR checks on head `e53b1a1d`: `verify`, `e2e`, `policy`, and Snyk all passed. - Confirmed `public-gh/master` is an ancestor of this branch after fetching `public-gh master`. - Confirmed `pnpm-lock.yaml` is not included in the branch diff. - Confirmed migration `0058_wealthy_starbolt.sql` is ordered after `0057` and uses `IF NOT EXISTS` guards for repeat application. - Greptile inline review threads are resolved. ## Risks - Medium risk: this touches heartbeat execution, liveness recovery, activity rendering, issue routes, shared contracts, docs, and UI. - Migration risk is mitigated by additive columns/indexes and idempotent guards. - Run-ledger liveness backfill is now asynchronous, so the first ledger response can briefly show historical missing liveness until the background backfill completes. - UI screenshot coverage is not included in this packaging pass; validation is currently through focused component tests. > For core feature work, check [`ROADMAP.md`](ROADMAP.md) first and discuss it in `#dev` before opening the PR. Feature PRs that overlap with planned core work may need to be redirected — check the roadmap first. See `CONTRIBUTING.md`. ## Model Used - OpenAI Codex, GPT-5.4, local tool-use coding agent with terminal, git, GitHub connector, GitHub CLI, and Paperclip API access. ## Checklist - [x] I have included a thinking path that traces from project context to this change - [x] I have specified the model used (with version and capability details) - [x] I have checked ROADMAP.md and confirmed this PR does not duplicate planned core work - [x] I have run tests locally and they pass - [x] I have added or updated tests where applicable - [x] If this change affects the UI, I have included before/after screenshots - [x] I have updated relevant documentation to reflect my changes - [x] I have considered and documented any risks above - [x] I will address all Greptile and reviewer comments before requesting merge Screenshot note: no before/after screenshots were captured in this PR packaging pass; the UI changes are covered by focused component tests listed above. --------- Co-authored-by: Paperclip <noreply@paperclip.ing>
2026-04-20 06:01:49 -05:00
parent b9a80dcf22
commit 236d11d36f
71 changed files with 18254 additions and 85 deletions
@@ -10,9 +10,12 @@ import {
  companySkills,
  companies,
  createDb,
+  documentRevisions,
+  documents,
  heartbeatRunEvents,
  heartbeatRuns,
  issueComments,
+  issueDocuments,
  issues,
 } from "@paperclipai/db";
 import {
@@ -22,6 +25,17 @@ import {
 import { runningProcesses } from "../adapters/index.ts";
 const mockTelemetryClient = vi.hoisted(() => ({ track: vi.fn() }));
 const mockTrackAgentFirstHeartbeat = vi.hoisted(() => vi.fn());
+const mockAdapterExecute = vi.hoisted(() =>
+  vi.fn(async () => ({
+    exitCode: 0,
+    signal: null,
+    timedOut: false,
+    errorMessage: null,
+    summary: "Recovered stranded heartbeat work.",
+    provider: "test",
+    model: "test-model",
+  })),
+);

 vi.mock("../telemetry.ts", () => ({
  getTelemetryClient: () => mockTelemetryClient,
@@ -43,14 +57,7 @@ vi.mock("../adapters/index.ts", async () => {
    ...actual,
    getServerAdapter: vi.fn(() => ({
      supportsLocalAgentJwt: false,
-      execute: vi.fn(async () => ({
-        exitCode: 0,
-        signal: null,
-        timedOut: false,
-        errorMessage: null,
-        provider: "test",
-        model: "test-model",
-      })),
+      execute: mockAdapterExecute,
    })),
  };
 });
@@ -104,6 +111,20 @@ async function waitForRunToSettle(
  return heartbeat.getRun(runId);
 }

+async function waitForValue<T>(
+  read: () => Promise<T | null | undefined>,
+  timeoutMs = 3_000,
+) {
+  const deadline = Date.now() + timeoutMs;
+  let latest: T | null | undefined = null;
+  while (Date.now() < deadline) {
+    latest = await read();
+    if (latest) return latest;
+    await new Promise((resolve) => setTimeout(resolve, 50));
+  }
+  return latest ?? null;
+}
+
 async function spawnOrphanedProcessGroup() {
  const leader = spawn(
    process.execPath,
@@ -157,6 +178,15 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {

  afterEach(async () => {
    vi.clearAllMocks();
+    mockAdapterExecute.mockImplementation(async () => ({
+      exitCode: 0,
+      signal: null,
+      timedOut: false,
+      errorMessage: null,
+      summary: "Recovered stranded heartbeat work.",
+      provider: "test",
+      model: "test-model",
+    }));
    runningProcesses.clear();
    for (const child of childProcesses) {
      child.kill("SIGKILL");
@@ -170,10 +200,26 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
      }
    }
    cleanupPids.clear();
-    for (let attempt = 0; attempt < 10; attempt += 1) {
-      const runs = await db.select({ status: heartbeatRuns.status }).from(heartbeatRuns);
-      if (runs.every((run) => run.status !== "queued" && run.status !== "running")) {
-        break;
+    let idlePolls = 0;
+    for (let attempt = 0; attempt < 100; attempt += 1) {
+      const runs = await db
+        .select({
+          status: heartbeatRuns.status,
+          processPid: heartbeatRuns.processPid,
+          processGroupId: heartbeatRuns.processGroupId,
+        })
+        .from(heartbeatRuns);
+      const managedExecutionStillActive = runs.some(
+        (run) =>
+          (run.status === "queued" || run.status === "running") &&
+          !run.processPid &&
+          !run.processGroupId,
+      );
+      if (!managedExecutionStillActive) {
+        idlePolls += 1;
+        if (idlePolls >= 3) break;
+      } else {
+        idlePolls = 0;
      }
      await new Promise((resolve) => setTimeout(resolve, 50));
    }
@@ -182,6 +228,9 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
    await db.delete(agentRuntimeState);
    await db.delete(companySkills);
    await db.delete(issueComments);
+    await db.delete(issueDocuments);
+    await db.delete(documentRevisions);
+    await db.delete(documents);
    await db.delete(issues);
    await db.delete(heartbeatRunEvents);
    await db.delete(heartbeatRuns);
@@ -439,6 +488,13 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
    const retryRun = runs.find((row) => row.id !== runId);
    expect(failedRun?.status).toBe("failed");
    expect(failedRun?.errorCode).toBe("process_lost");
+    expect(failedRun?.livenessState).toBe("failed");
+    expect(failedRun?.livenessReason).toContain("process_lost");
+    expect(failedRun?.resultJson).toMatchObject({
+      stopReason: "process_lost",
+      timeoutConfigured: false,
+      timeoutFired: false,
+    });
    expect(retryRun?.status).toBe("queued");
    expect(retryRun?.retryOfRunId).toBe(runId);
    expect(retryRun?.processLossRetryCount).toBe(1);
@@ -553,6 +609,23 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
    );
  });

+  it("records manual cancellation stop metadata", async () => {
+    const { runId } = await seedRunFixture({
+      agentStatus: "running",
+      includeIssue: false,
+    });
+    const heartbeat = heartbeatService(db);
+
+    const cancelled = await heartbeat.cancelRun(runId);
+    expect(cancelled?.status).toBe("cancelled");
+    expect(cancelled?.resultJson).toMatchObject({
+      stopReason: "cancelled",
+      effectiveTimeoutSec: 0,
+      timeoutConfigured: false,
+      timeoutFired: false,
+    });
+  });
+
  it("re-enqueues assigned todo work when the last issue run died and no wake remains", async () => {
    const { agentId, issueId, runId } = await seedStrandedIssueFixture({
      status: "todo",
@@ -629,6 +702,106 @@ describeEmbeddedPostgres("heartbeat orphaned process recovery", () => {
    }
  });

+  it("classifies actionable plan-only recovery and enqueues one liveness continuation", async () => {
+    mockAdapterExecute.mockResolvedValueOnce({
+      exitCode: 0,
+      signal: null,
+      timedOut: false,
+      errorMessage: null,
+      summary: "I will inspect the repo next and then implement the fix.",
+      provider: "test",
+      model: "test-model",
+    });
+    const { agentId, issueId, runId } = await seedStrandedIssueFixture({
+      status: "in_progress",
+      runStatus: "failed",
+    });
+    const heartbeat = heartbeatService(db);
+
+    await heartbeat.reconcileStrandedAssignedIssues();
+
+    const livenessWake = await waitForValue(async () => {
+      const rows = await db.select().from(agentWakeupRequests).where(eq(agentWakeupRequests.agentId, agentId));
+      return rows.find((row) => row.reason === "run_liveness_continuation") ?? null;
+    });
+    expect(livenessWake).toBeTruthy();
+    expect(livenessWake?.payload).toMatchObject({
+      issueId,
+      livenessState: "plan_only",
+      continuationAttempt: 1,
+    });
+
+    const sourceRunId = (livenessWake?.payload as Record<string, unknown> | null)?.sourceRunId;
+    expect(sourceRunId).toBeTruthy();
+    const sourceRun = await db
+      .select()
+      .from(heartbeatRuns)
+      .where(eq(heartbeatRuns.id, String(sourceRunId)))
+      .then((rows) => rows[0] ?? null);
+    expect(sourceRun?.id).not.toBe(runId);
+    expect(sourceRun?.livenessState).toBe("plan_only");
+  });
+
+  it("treats a plan document update as progress and does not enqueue liveness continuation", async () => {
+    const { agentId, companyId, issueId, runId } = await seedStrandedIssueFixture({
+      status: "in_progress",
+      runStatus: "failed",
+    });
+    mockAdapterExecute.mockImplementationOnce(async (ctx: { runId: string }) => {
+      const documentId = randomUUID();
+      const revisionId = randomUUID();
+      await db.insert(documents).values({
+        id: documentId,
+        companyId,
+        title: "Plan",
+        format: "markdown",
+        latestBody: "# Plan\n\n- Inspect files\n- Implement fix",
+        latestRevisionId: revisionId,
+        latestRevisionNumber: 1,
+        createdByAgentId: agentId,
+        updatedByAgentId: agentId,
+      });
+      await db.insert(documentRevisions).values({
+        id: revisionId,
+        companyId,
+        documentId,
+        revisionNumber: 1,
+        title: "Plan",
+        format: "markdown",
+        body: "# Plan\n\n- Inspect files\n- Implement fix",
+        createdByAgentId: agentId,
+        createdByRunId: ctx.runId,
+      });
+      await db.insert(issueDocuments).values({
+        companyId,
+        issueId,
+        documentId,
+        key: "plan",
+      });
+      return {
+        exitCode: 0,
+        signal: null,
+        timedOut: false,
+        errorMessage: null,
+        summary: "Plan:\n- Inspect files\n- Implement fix",
+        provider: "test",
+        model: "test-model",
+      };
+    });
+    const heartbeat = heartbeatService(db);
+
+    await heartbeat.reconcileStrandedAssignedIssues();
+
+    const retryRun = await waitForValue(async () => {
+      const rows = await db.select().from(heartbeatRuns).where(eq(heartbeatRuns.agentId, agentId));
+      return rows.find((row) => row.id !== runId && row.livenessState === "advanced") ?? null;
+    });
+    expect(retryRun?.livenessState).toBe("advanced");
+
+    const wakes = await db.select().from(agentWakeupRequests).where(eq(agentWakeupRequests.agentId, agentId));
+    expect(wakes.some((row) => row.reason === "run_liveness_continuation")).toBe(false);
+  });
+
  it("blocks stranded in-progress work after the continuation retry was already used", async () => {
    const { issueId } = await seedStrandedIssueFixture({
      status: "in_progress",