0.1.52

fix: retry-aware pod state lookup + honest truncation cause messages (FAR-107)
The single-shot getPodTerminatedState query lost a real race against kubelet's containerStatus update: when Claude exited cleanly but quickly, listNamespacedPod often returned the pod with phase=Succeeded/Failed but without a populated state.terminated, so describeTruncationCause fell into the catch-all "pod state unavailable — likely deleted before exit could be read" branch. That message is doubly wrong: the pod was not deleted and the exit cause was readable a few hundred ms later. Operators chasing claude_truncated runs (Nancy/Privileged Escalation) had no visibility into the actual exit code, OOMKilled flag, or reason. Two changes: 1. Introduce lookupPodState + getPodLookupWithRetry — the lookup result carries the pod phase and a podMissing flag, and retries up to 4×500ms when the pod is in a terminal phase but containerStatuses lag. When the pod is in a non-terminal phase or genuinely gone we bail immediately without burning the retry budget. 2. describeTruncationCause now distinguishes three states: - "pod is gone" (eviction, preemption, external delete) - "container terminated state not yet observable (pod phase=…)" - the existing populated-state path with exit code / reason / signal The truncation error path re-queries with the retry-aware lookup right before producing the message, so subsequent claude_truncated errors surface the actual exit cause (137=OOMKilled, 143=SIGTERM, kubelet reason text) instead of a misleading deletion claim. Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:00:57 +00:00 · 2026-04-27 00:00:56 +00:00 · 2026-04-26 21:24:15 +00:00 · 2026-04-26 21:24:11 +00:00 · 2026-04-26 21:19:03 +00:00 · 2026-04-26 21:19:02 +00:00
4 changed files with 428 additions and 87 deletions
@@ -1,12 +1,12 @@
 {
  "name": "paperclip-adapter-claude-k8s",
-  "version": "0.1.45",
+  "version": "0.1.52",
  "lockfileVersion": 3,
  "requires": true,
  "packages": {
    "": {
      "name": "paperclip-adapter-claude-k8s",
-      "version": "0.1.45",
+      "version": "0.1.52",
      "license": "MIT",
      "dependencies": {
        "@kubernetes/client-node": "^1.0.0",
@@ -1,6 +1,6 @@
 {
  "name": "paperclip-adapter-claude-k8s",
-  "version": "0.1.45",
+  "version": "0.1.52",
  "description": "Paperclip adapter plugin that runs Claude Code agents as Kubernetes Jobs",
  "license": "MIT",
  "repository": {
@@ -60,7 +60,7 @@ vi.mock("@paperclipai/adapter-utils/server-utils", async (importOriginal) => {
  });
 });

-const { isK8s404, buildPartialRunError, classifyOrphan, describePodTerminatedError, streamPodLogsOnce, shouldAbortForCancellation, execute } = await import("./execute.js");
+const { isK8s404, buildPartialRunError, classifyOrphan, describePodTerminatedError, describeTruncationCause, streamPodLogsOnce, shouldAbortForCancellation, execute } = await import("./execute.js");

 function makeJob(opts: {
  runId?: string;
@@ -150,10 +150,10 @@ describe("buildPartialRunError", () => {
    expect(buildPartialRunError(null, "", "")).toBe("Claude exited with code -1");
  });

-  it("skips system/init events and returns generic message when only init captured", () => {
+  it("returns init-only message when stdout is init-only with non-zero exit code (FAR-101)", () => {
    const msg = buildPartialRunError(1, "claude-sonnet-4-6", initLine);
    expect(msg).toBe(
-      "Claude started but did not produce a result (model: claude-sonnet-4-6) — check API credentials, model support, and adapter config",
+      "Claude exited immediately after init (model: claude-sonnet-4-6) (exit code 1) — the model may be unsupported or the session may have been rejected before producing output",
    );
  });

@@ -170,15 +170,15 @@ describe("buildPartialRunError", () => {
    expect(msg).toBe("Claude exited with code 1: Error: no API key configured");
  });

-  it("skips result events (structured protocol artefact — not surfaced verbatim)", () => {
+  it("returns init-only message when stdout has init + result event but no plain content (structured artefact, not surfaced verbatim)", () => {
    // In production, buildPartialRunError is only called when parseClaudeStreamJson
    // returns null (no result event).  If somehow a result event appears here, the
-    // raw JSON blob must not be shown — the "did not produce a result" message is
-    // cleaner and avoids leaking protocol internals to the UI.
+    // raw JSON blob must not be shown — the init-only message is cleaner and avoids
+    // leaking protocol internals to the UI.
    const resultLike = JSON.stringify({ type: "result", subtype: "error", result: "rate limit" });
    const stdout = [initLine, resultLike].join("\n");
    const msg = buildPartialRunError(2, "claude-sonnet-4-6", stdout);
-    expect(msg).toContain("did not produce a result");
+    expect(msg).toContain("Claude exited immediately after init");
    expect(msg).toContain("claude-sonnet-4-6");
    expect(msg).not.toMatch(/\{.*type.*result/);
  });
@@ -245,6 +245,44 @@ describe("buildPartialRunError", () => {
    const msg = buildPartialRunError(1, "model-x", stdout);
    expect(msg).toBe("Claude exited with code 1: real error line");
  });
+
+  it("appends pod terminated reason/message when state is provided (FAR-100)", () => {
+    const msg = buildPartialRunError(1, "claude-sonnet-4-6", initLine, {
+      exitCode: 1,
+      reason: "Error",
+      message: "model not supported",
+      signal: null,
+    });
+    expect(msg).toContain("Claude exited immediately after init");
+    expect(msg).toContain("claude-sonnet-4-6");
+    expect(msg).toContain("[pod: reason=Error, message=model not supported]");
+  });
+
+  it("flags exit 137 as OOMKilled in pod cause", () => {
+    const msg = buildPartialRunError(137, "claude-sonnet-4-6", initLine, {
+      exitCode: 137,
+      reason: "OOMKilled",
+      message: null,
+      signal: null,
+    });
+    expect(msg).toContain("[pod: reason=OOMKilled, SIGKILL (commonly OOMKilled)]");
+  });
+
+  it("appends pod cause to content-line message", () => {
+    const stdout = [initLine, "Error: bad request"].join("\n");
+    const msg = buildPartialRunError(1, "claude-sonnet-4-6", stdout, {
+      exitCode: 1,
+      reason: "Error",
+      message: null,
+      signal: null,
+    });
+    expect(msg).toBe("Claude exited with code 1: Error: bad request [pod: reason=Error]");
+  });
+
+  it("does not append anything when podState is null (back-compat)", () => {
+    const msg = buildPartialRunError(1, "claude-sonnet-4-6", initLine, null);
+    expect(msg).not.toContain("[pod:");
+  });
 });

 describe("classifyOrphan", () => {
@@ -362,6 +400,33 @@ describe("describePodTerminatedError", () => {
  });
 });

+describe("describeTruncationCause", () => {
+  it("annotates exit code 137 as SIGKILL/OOM", () => {
+    const msg = describeTruncationCause({ exitCode: 137, reason: "OOMKilled", message: "Memory cgroup out of memory", signal: null });
+    expect(msg).toContain("exit code 137");
+    expect(msg).toContain("SIGKILL");
+    expect(msg).toContain("OOMKilled");
+    expect(msg).toContain("Memory cgroup out of memory");
+  });
+
+  it("annotates exit code 143 as SIGTERM", () => {
+    const msg = describeTruncationCause({ exitCode: 143, reason: null, message: null, signal: null });
+    expect(msg).toContain("exit code 143");
+    expect(msg).toContain("SIGTERM");
+  });
+
+  it("falls back to 'pod state unavailable' when state is null", () => {
+    const msg = describeTruncationCause(null);
+    expect(msg).toContain("pod state unavailable");
+  });
+
+  it("emits 'no exit code' when exitCode is null but state exists", () => {
+    const msg = describeTruncationCause({ exitCode: null, reason: "Error", message: null, signal: null });
+    expect(msg).toContain("no exit code");
+    expect(msg).toContain("reason=Error");
+  });
+});
+
 describe("execute: all-invalid agent.id (N4)", () => {
  it("returns hard error without creating a Job when agent.id sanitizes to null", async () => {
    const logs: string[] = [];
@@ -954,7 +1019,8 @@ describe("execute: happy path", () => {
    const result = await executePromise;

    expect(result.errorCode).toBe("k8s_job_deleted_externally");
-    expect(result.errorMessage).toBe("K8s Job was deleted externally before Claude could complete");
+    expect(result.errorMessage).toMatch(/^K8s Job was deleted externally before Claude could complete \[/);
+    expect(result.errorMessage).toContain("detected_via=");
    expect(result.exitCode).toBeNull();
  });

@@ -1019,7 +1085,7 @@ describe("execute: happy path", () => {
      },
    );
    mockCoreListPods.mockResolvedValue({
-      items: [{ metadata: { name: "pod-abc" }, status: { containerStatuses: [{ name: "claude", state: { terminated: { exitCode: 137 } } }] } }],
+      items: [{ metadata: { name: "pod-abc" }, status: { containerStatuses: [{ name: "claude", state: { terminated: { exitCode: 137, reason: "OOMKilled", message: "Memory cgroup out of memory" } } }] } }],
    });

    const executePromise = execute(makeCtx());
@@ -1030,6 +1096,9 @@ describe("execute: happy path", () => {
    expect(result.errorMessage).toContain("truncated mid-stream");
    expect(result.errorMessage).toContain("claude-opus-4-7");
    expect(result.errorMessage).toContain("exit code 137");
+    expect(result.errorMessage).toContain("SIGKILL");
+    expect(result.errorMessage).toContain("OOMKilled");
+    expect(result.errorMessage).toContain("Memory cgroup out of memory");
  });

  it("reconnects log stream and logs status when job completion takes > 3s", async () => {
@@ -1492,16 +1561,24 @@ describe("shouldAbortForCancellation", () => {
    expect(shouldAbortForCancellation("cancelled")).toBe(true);
  });

-  it("returns true when status is 'failed'", () => {
-    expect(shouldAbortForCancellation("failed")).toBe(true);
+  it("returns true when status is 'cancelling'", () => {
+    expect(shouldAbortForCancellation("cancelling")).toBe(true);
  });

-  it("returns true when status is 'completed'", () => {
-    expect(shouldAbortForCancellation("completed")).toBe(true);
+  // FAR-107: terminal-but-not-cancelled statuses MUST NOT trigger Job deletion.
+  // The previous "anything but running" guard caused k8s_job_deleted_externally
+  // false positives for in-flight runs whenever the API briefly reported a
+  // transient/stale status.
+  it("returns false for non-cancellation terminal statuses (FAR-107)", () => {
+    expect(shouldAbortForCancellation("succeeded")).toBe(false);
+    expect(shouldAbortForCancellation("failed")).toBe(false);
+    expect(shouldAbortForCancellation("completed")).toBe(false);
  });

-  it("returns true for any non-running non-empty string", () => {
-    expect(shouldAbortForCancellation("unknown")).toBe(true);
+  it("returns false for unknown statuses (FAR-107)", () => {
+    expect(shouldAbortForCancellation("unknown")).toBe(false);
+    expect(shouldAbortForCancellation("queued")).toBe(false);
+    expect(shouldAbortForCancellation("pending")).toBe(false);
  });
 });

@@ -1702,7 +1779,7 @@ describe("execute: SIGTERM handler best-effort cleanup", () => {
    vi.useRealTimers();
  });

-  it("deletes the active Job when SIGTERM fires during execution", async () => {
+  it("does NOT delete active Jobs on SIGTERM — leaves them for orphan reattach (FAR-107)", async () => {
    // Mock process.kill to prevent the test process from actually being killed.
    const killSpy = vi.spyOn(process, "kill").mockImplementation(() => true);

@@ -1713,17 +1790,19 @@ describe("execute: SIGTERM handler best-effort cleanup", () => {
    // Flush microtasks through the async setup chain: getSelfPodInfo, listJobs,
    // readSkillEntries, prepareBundle, createJob, onLog, activeJobs.add(), and
    // ensureSigtermHandler() all complete before the try block enters streaming.
-    // 30 rounds is more than enough for the ~7 sequential await points.
    for (let i = 0; i < 30; i++) await Promise.resolve();

-    // Emit SIGTERM — the process.once handler fires synchronously and kicks off
-    // async cleanup (deleteNamespacedJob). The mock resolves immediately.
+    // Reset deleteJob spy after setup so we can detect any SIGTERM-driven calls.
+    mockBatchDeleteJob.mockClear();
+
+    // Emit SIGTERM — the handler must re-raise to the default handler without
+    // touching the K8s Job.  Deleting the Job here would surface as
+    // k8s_job_deleted_externally in the in-flight run (FAR-107).
    process.emit("SIGTERM");

-    // Flush microtasks for deleteJob to resolve and the .then(process.kill) to run.
    for (let i = 0; i < 10; i++) await Promise.resolve();

-    expect(mockBatchDeleteJob).toHaveBeenCalled();
+    expect(mockBatchDeleteJob).not.toHaveBeenCalled();
    expect(killSpy).toHaveBeenCalledWith(process.pid, "SIGTERM");

    killSpy.mockRestore();
@@ -58,30 +58,20 @@ function ensureSigtermHandler(): void {
  if (sigtermHandlerRegistered) return;
  sigtermHandlerRegistered = true;
  process.once("SIGTERM", () => {
-    const jobs = [...activeJobs];
-    void Promise.allSettled(
-      jobs.map(async (ref) => {
-        try {
-          const batchApi = getBatchApi(ref.kubeconfigPath);
-          await batchApi.deleteNamespacedJob({
-            name: ref.jobName,
-            namespace: ref.namespace,
-            body: { propagationPolicy: "Background" },
-          });
-        } catch { /* best-effort */ }
-        if (ref.promptSecretName && ref.promptSecretNamespace) {
-          try {
-            const coreApi = getCoreApi(ref.kubeconfigPath);
-            await coreApi.deleteNamespacedSecret({
-              name: ref.promptSecretName,
-              namespace: ref.promptSecretNamespace,
-            });
-          } catch { /* best-effort */ }
-        }
-      }),
-    ).then(() => {
-      process.kill(process.pid, "SIGTERM");
-    });
+    // Do NOT delete active K8s Jobs on SIGTERM (FAR-107).  Paperclip itself
+    // receives SIGTERM during rolling deploys, evictions, scale-down, etc.
+    // Deleting the Jobs we own there causes the in-flight heartbeat to surface
+    // a false-positive `k8s_job_deleted_externally` error and tears down work
+    // the user expected to keep running.
+    //
+    // The correct behaviour with `reattachOrphanedJobs=true` (default) is to
+    // leave the Jobs alive: the next paperclip process discovers them via the
+    // orphan-classification path and reattaches their log streams.  When
+    // `reattachOrphanedJobs=false` the operator explicitly opted into manual
+    // cleanup and should not have us auto-deleting either.  The owning Job's
+    // ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so
+    // both survive together and TTL cleans them up after natural completion.
+    process.kill(process.pid, "SIGTERM");
  });
 }

@@ -100,34 +90,32 @@ export function isK8s404(err: unknown): boolean {
 }

 /**
- * Returns true when the heartbeat-run status indicates the run is no longer
- * active and the K8s Job should be cancelled.
+ * Returns true when the heartbeat-run status indicates the run was explicitly
+ * cancelled and the K8s Job must be torn down.
+ *
+ * Only `cancelled` / `cancelling` qualify.  Treating any non-`running` status
+ * as cancellation (the previous behaviour) produced spurious
+ * k8s_job_deleted_externally errors for in-flight runs whenever the API
+ * briefly reported a transient or stale status — Nancy's runs at
+ * Privileged Escalation hit this without anyone actually cancelling them
+ * (FAR-107).  Other terminal statuses (`succeeded`/`failed`/`completed`)
+ * are unreachable in practice while the adapter is still executing
+ * (the adapter's own return is what flips them) and even if observed,
+ * they do not warrant our deleting a Job that may still be doing work.
 * Exported for unit tests.
 */
 export function shouldAbortForCancellation(runStatus: string | undefined): boolean {
  if (!runStatus) return false;
-  return runStatus !== "running";
+  return runStatus === "cancelled" || runStatus === "cancelling";
 }

 /**
- * Build the error message when Claude's stdout contains no result event.
- * Skips system/init event lines so the UI doesn't display the raw init JSON.
- * Exported for unit tests.
+ * Returns the first non-JSON/plain-text line in stdout, treating JSON objects
+ * with a "type" field as protocol artefacts and skipping them.
+ * Used by buildPartialRunError to detect init-only runs.
 */
-export function buildPartialRunError(
-  exitCode: number | null,
-  model: string,
-  stdout: string,
-): string {
-  if (exitCode === 0) return "Failed to parse Claude JSON output";
-
-  // Walk stdout lines and skip every structured streaming event (any JSON
-  // object that carries a non-empty "type" field: system, assistant, user,
-  // rate_limit_event, result, …).  All of these are protocol artefacts and
-  // produce confusing raw-JSON blobs when surfaced verbatim as an error
-  // message.  Only plain-text lines (non-JSON, or JSON without a type field)
-  // are treated as human-readable content worth including in the error.
-  const firstContentLine = stdout.split(/\r?\n/)
+function firstContentLine(stdout: string): string {
+  return stdout.split(/\r?\n/)
    .map((l) => l.trim())
    .find((l) => {
      if (!l) return false;
@@ -142,19 +130,82 @@ export function buildPartialRunError(
      }
      return true;
    }) ?? "";
+}
+
+/**
+ * Returns true when stdout contains only init/system/assistant events from the
+ * given model with no human-readable content lines.  Used to detect init-only
+ * non-zero-exit runs that should be classified as claude_init_failed rather than
+ * the generic "Claude exited with code N" message.
+ */
+function isInitOnlyRun(model: string, stdout: string): boolean {
+  if (!stdout.trim() || !model) return false;
+  const content = firstContentLine(stdout);
+  if (content) return false;
+  // Check that at least the init event for this model was seen
+  const hasModelInit = stdout.includes(`"model":"${model}"`) || stdout.includes(`"model":"${model.replace(/-/g, "_")}"`);
+  return hasModelInit;
+}
+
+/**
+ * Append the pod's terminated-state detail (reason/message/signal) to a
+ * partial-run error message when available.  Exit code is already in the
+ * caller-supplied message, so we only append fields that add new signal —
+ * specifically reason (e.g. OOMKilled, Error, ContainerCannotRun), message
+ * (kubelet diagnostic text), and signal.  Saves the operator a kubectl trip.
+ */
+function appendPodCause(message: string, state: PodTerminatedState | null): string {
+  if (!state) return message;
+  const parts: string[] = [];
+  if (state.reason) parts.push(`reason=${state.reason}`);
+  if (state.message) parts.push(`message=${state.message}`);
+  if (state.signal !== null) parts.push(`signal=${state.signal}`);
+  if (state.exitCode === 137) parts.push("SIGKILL (commonly OOMKilled)");
+  if (parts.length === 0) return message;
+  return `${message} [pod: ${parts.join(", ")}]`;
+}
+
+/**
+ * Build the error message when Claude's stdout contains no result event.
+ * Skips system/init event lines so the UI doesn't display the raw init JSON.
+ * When `podState` is provided, appends the K8s container terminated reason/
+ * message so failures self-explain without requiring `kubectl`.
+ * Exported for unit tests.
+ */
+export function buildPartialRunError(
+  exitCode: number | null,
+  model: string,
+  stdout: string,
+  podState: PodTerminatedState | null = null,
+): string {
+  if (exitCode === 0) return "Failed to parse Claude JSON output";

  // If the stream contained only structured events with no plain-text output,
  // surface the model name so the operator can diagnose missing credentials
  // or unsupported/misconfigured model.
-  const initOnlyOutput = stdout.trim() !== "" && model !== "" && !firstContentLine;
-  if (initOnlyOutput) {
-    const modelHint = model ? ` (model: ${model})` : "";
-    return `Claude started but did not produce a result${modelHint} — check API credentials, model support, and adapter config`;
+  const contentLine = firstContentLine(stdout);
+  if (contentLine) {
+    return appendPodCause(`Claude exited with code ${exitCode ?? -1}: ${contentLine}`, podState);
  }

-  return firstContentLine
-    ? `Claude exited with code ${exitCode ?? -1}: ${firstContentLine}`
-    : `Claude exited with code ${exitCode ?? -1}`;
+  if (isInitOnlyRun(model, stdout) && (exitCode ?? 0) !== 0) {
+    const modelHint = model ? ` (model: ${model})` : "";
+    return appendPodCause(
+      `Claude exited immediately after init${modelHint} (exit code ${exitCode ?? -1}) — the model may be unsupported or the session may have been rejected before producing output`,
+      podState,
+    );
+  }
+
+  const initOnlyOutput = stdout.trim() !== "" && model !== "";
+  if (initOnlyOutput) {
+    const modelHint = model ? ` (model: ${model})` : "";
+    return appendPodCause(
+      `Claude started but did not produce a result${modelHint} — check API credentials, model support, and adapter config`,
+      podState,
+    );
+  }
+
+  return appendPodCause(`Claude exited with code ${exitCode ?? -1}`, podState);
 }

 export type OrphanClassification =
@@ -531,11 +582,14 @@ async function readPodLogs(
 * is treated as a soft terminal: succeeded=false, timedOut=false, jobGone=true.
 * The caller should log this and fall through to stdout parsing.
 */
+type JobConditionSnapshot = { type?: string; status?: string; reason?: string; message?: string };
+
 async function waitForJobCompletion(
  namespace: string,
  jobName: string,
  timeoutMs: number,
  kubeconfigPath?: string,
+  observer?: { lastConditions: JobConditionSnapshot[] | null; pollCount: number },
 ): Promise<{ succeeded: boolean; timedOut: boolean; jobGone?: boolean }> {
  const batchApi = getBatchApi(kubeconfigPath);
  const deadline = timeoutMs > 0 ? Date.now() + timeoutMs : 0;
@@ -554,6 +608,12 @@ async function waitForJobCompletion(
      throw err;
    }
    const conditions = job.status?.conditions ?? [];
+    if (observer) {
+      observer.pollCount += 1;
+      observer.lastConditions = conditions.map((c) => ({
+        type: c.type, status: c.status, reason: c.reason, message: c.message,
+      }));
+    }

    const complete = conditions.find((c) => c.type === "Complete" && c.status === "True");
    if (complete) return { succeeded: true, timedOut: false };
@@ -574,16 +634,130 @@ async function waitForJobCompletion(
 * Get the exit code from the Job's pod.
 */
 async function getPodExitCode(namespace: string, jobName: string, kubeconfigPath?: string): Promise<number | null> {
+  const state = await getPodTerminatedState(namespace, jobName, kubeconfigPath);
+  return state?.exitCode ?? null;
+}
+
+/**
+ * Get the claude container's terminated state (exit code, reason, message,
+ * signal) from the Job's pod. Returns null if the pod or container is gone.
+ * Used by the no-result error path to explain *why* a run was truncated.
+ */
+export interface PodTerminatedState {
+  exitCode: number | null;
+  reason: string | null;
+  message: string | null;
+  signal: number | null;
+}
+
+/**
+ * Result of a pod-state lookup.  `state` is the terminated state when available;
+ * `phase` and `podMissing` give the caller enough context to render an honest
+ * truncation-cause message instead of guessing "likely deleted" (FAR-107).
+ */
+export interface PodLookupResult {
+  state: PodTerminatedState | null;
+  phase: string | null;
+  podMissing: boolean;
+}
+
+async function lookupPodState(
+  namespace: string,
+  jobName: string,
+  kubeconfigPath?: string,
+): Promise<PodLookupResult> {
  const coreApi = getCoreApi(kubeconfigPath);
  const podList = await coreApi.listNamespacedPod({
    namespace,
    labelSelector: `job-name=${jobName}`,
  });
  const pod = podList.items[0];
-  if (!pod) return null;
+  if (!pod) return { state: null, phase: null, podMissing: true };

+  const phase = pod.status?.phase ?? null;
  const containerStatus = pod.status?.containerStatuses?.find((s) => s.name === "claude");
-  return containerStatus?.state?.terminated?.exitCode ?? null;
+  const terminated = containerStatus?.state?.terminated;
+  if (!terminated) return { state: null, phase, podMissing: false };
+  return {
+    state: {
+      exitCode: terminated.exitCode ?? null,
+      reason: terminated.reason ?? null,
+      message: (terminated.message ?? "").trim() || null,
+      signal: terminated.signal ?? null,
+    },
+    phase,
+    podMissing: false,
+  };
+}
+
+/**
+ * Read the claude container's terminated state, retrying briefly when the pod
+ * exists in a terminal phase but kubelet has not yet propagated the
+ * containerStatuses[].state.terminated field.  Without this retry, fast
+ * truncated-stream exits surface as "pod state unavailable" (FAR-107) and
+ * mask the real exit code / OOMKilled / SIGTERM cause.
+ */
+async function getPodLookupWithRetry(
+  namespace: string,
+  jobName: string,
+  kubeconfigPath?: string,
+  attempts = 4,
+  delayMs = 500,
+): Promise<PodLookupResult> {
+  let last: PodLookupResult = { state: null, phase: null, podMissing: true };
+  for (let i = 0; i < attempts; i++) {
+    last = await lookupPodState(namespace, jobName, kubeconfigPath);
+    if (last.state) return last;
+    if (last.podMissing) return last;
+    // Pod exists but no terminated state.  If it is in a terminal phase the
+    // containerStatuses update is in flight — wait briefly and retry.  If it
+    // is still Running/Pending, retrying is unlikely to help, so bail.
+    if (last.phase !== "Succeeded" && last.phase !== "Failed") return last;
+    if (i < attempts - 1) await new Promise((r) => setTimeout(r, delayMs));
+  }
+  return last;
+}
+
+async function getPodTerminatedState(
+  namespace: string,
+  jobName: string,
+  kubeconfigPath?: string,
+): Promise<PodTerminatedState | null> {
+  return (await lookupPodState(namespace, jobName, kubeconfigPath)).state;
+}
+
+/**
+ * Format a human-readable explanation for a truncated run, including the
+ * pod's claude-container terminated state when available. Exit code 137
+ * is annotated as SIGKILL/OOM since that is the most common cause.
+ * Exported for unit tests.
+ */
+export function describeTruncationCause(
+  state: PodTerminatedState | null,
+  lookup?: PodLookupResult,
+): string {
+  if (!state) {
+    if (lookup?.podMissing) {
+      return "pod is gone — Job pod was removed (eviction, preemption, or external delete) before exit could be read";
+    }
+    if (lookup && !lookup.podMissing) {
+      const phaseHint = lookup.phase ? `pod phase=${lookup.phase}` : "pod present";
+      return `container terminated state not yet observable (${phaseHint}) — kubelet status update did not land within retry window; exit cause unknown`;
+    }
+    return "pod state unavailable — exit cause unknown";
+  }
+  const parts: string[] = [];
+  if (state.exitCode !== null) {
+    parts.push(`exit code ${state.exitCode}`);
+    if (state.exitCode === 137) parts.push("SIGKILL (commonly OOMKilled)");
+    else if (state.exitCode === 143) parts.push("SIGTERM");
+  } else {
+    parts.push("no exit code");
+  }
+  if (state.signal !== null) parts.push(`signal ${state.signal}`);
+  if (state.reason) parts.push(`reason=${state.reason}`);
+  if (state.message) parts.push(`message=${state.message}`);
+  return parts.join(", ");
 }

 /**
@@ -998,6 +1172,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec

  let stdout = "";
  let exitCode: number | null = null;
+  let podTerminatedState: PodTerminatedState | null = null;
  let jobTimedOut = false;
  let keepaliveTimer: ReturnType<typeof setInterval> | null = null;
  // Set when we return a mismatch error so the finally block knows not to
@@ -1006,6 +1181,17 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
  // Set when the job disappeared (404) or grace-timer fired before we saw a
  // terminal condition — used to emit a clearer error when stdout parsing fails.
  let jobDeletedExternally = false;
+  // Forensics for k8s_job_deleted_externally — captures which of the three
+  // detection paths observed the 404, the last successful Job-condition read
+  // before deletion, and timing.  Surfaced in the error message so the next
+  // occurrence is self-diagnosing instead of opaque (FAR-107).
+  let jobGoneDetectionPath: string | null = null;
+  let jobGoneAt: number | null = null;
+  const jobObserver: { lastConditions: JobConditionSnapshot[] | null; pollCount: number } = {
+    lastConditions: null,
+    pollCount: 0,
+  };
+  let podRunningAt: number | null = null;

  const activeJobRef: ActiveJobRef = {
    namespace,
@@ -1038,6 +1224,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
        podName = await waitForPod(namespace, jobName, scheduleTimeoutMs, onLog, kubeconfigPath);
        await onLog("stdout", `[paperclip] Pod running: ${podName}\n`);
      }
+      podRunningAt = Date.now();

    } catch (err) {
      const msg = err instanceof Error ? err.message : String(err);
@@ -1173,7 +1360,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
    // while streamPodLogs reconnects, holding execute() open for minutes.
    // logStopSignal.stopped is set on every settled path (fulfilled, rejected,
    // or grace) so streamPodLogs stops reconnecting promptly.
-    type CompletionResult = { succeeded: boolean; timedOut: boolean; jobGone?: boolean };
+    type CompletionResult = { succeeded: boolean; timedOut: boolean; jobGone?: boolean; gracePeriodFired?: boolean };
    let gracePoller: ReturnType<typeof setInterval> | null = null;
    const completionWithGrace = new Promise<CompletionResult>((resolve, reject) => {
      let settled = false;
@@ -1191,11 +1378,37 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
        logStopSignal.stopped = true;
        reject(err);
      };
-      waitForJobCompletion(namespace, jobName, completionTimeoutMs, kubeconfigPath).then(settleOk).catch(settleErr);
+      waitForJobCompletion(namespace, jobName, completionTimeoutMs, kubeconfigPath, jobObserver).then(settleOk).catch(settleErr);
      gracePoller = setInterval(() => {
        if (logExitTime !== null && Date.now() - logExitTime >= LOG_EXIT_COMPLETION_GRACE_MS) {
-          void onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago without K8s Job condition update — proceeding with captured output (FAR-23)\n`).catch(() => {});
-          settleOk({ succeeded: false, timedOut: false, jobGone: true });
+          // Stop the grace poller immediately so we don't double-fire while the
+          // verification read below is in flight.
+          if (gracePoller) { clearInterval(gracePoller); gracePoller = null; }
+          // The log stream exiting only means the container stopped producing
+          // output — it does NOT prove the Job was deleted.  Verify Job
+          // presence with a one-shot read so we can distinguish:
+          //   (a) Job 404 → truly gone (TTL or external deletion)
+          //   (b) Job still present → K8s condition propagation lag (FAR-23)
+          // Without this check we mis-classify (b) as "deleted externally" and
+          // emit a false-positive k8s_job_deleted_externally error (FAR-107).
+          void (async () => {
+            try {
+              await getBatchApi(kubeconfigPath).readNamespacedJob({ name: jobName, namespace });
+              await onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago without K8s Job condition update; Job ${jobName} still present — proceeding with captured output (FAR-23)\n`).catch(() => {});
+              settleOk({ succeeded: false, timedOut: false, gracePeriodFired: true });
+            } catch (err: unknown) {
+              if (isK8s404(err)) {
+                jobGoneDetectionPath = "grace-period-verify-404";
+                jobGoneAt = Date.now();
+                await onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago and Job ${jobName} is gone (TTL or external deletion) — proceeding with captured output (FAR-23)\n`).catch(() => {});
+                settleOk({ succeeded: false, timedOut: false, jobGone: true });
+              } else {
+                // K8s API hiccup — bail out without claiming external deletion.
+                await onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago; Job state unverifiable (${err instanceof Error ? err.message : String(err)}) — proceeding with captured output (FAR-23)\n`).catch(() => {});
+                settleOk({ succeeded: false, timedOut: false, gracePeriodFired: true });
+              }
+            }
+          })();
        }
      }, 1_000);
    });
@@ -1263,6 +1476,10 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
        // completion), so log streaming has captured the full output — continue
        // to stdout parsing rather than returning an error.
        jobDeletedExternally = true;
+        if (!jobGoneDetectionPath) {
+          jobGoneDetectionPath = "completion-poll-404";
+          jobGoneAt = Date.now();
+        }
        await onLog("stdout", `[paperclip] Job ${jobName} was deleted before terminal condition was observed (TTL or external deletion) — proceeding with captured output.\n`);
      }
    } else {
@@ -1271,7 +1488,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
      // (60s) so we don't hang the heartbeat indefinitely if the K8s API is degraded.
      jobTimedOut = false;
      const RECHECK_TIMEOUT_MS = 60_000;
-      const actualState = await waitForJobCompletion(namespace, jobName, RECHECK_TIMEOUT_MS, kubeconfigPath);
+      const actualState = await waitForJobCompletion(namespace, jobName, RECHECK_TIMEOUT_MS, kubeconfigPath, jobObserver);
      if (actualState.timedOut) {
        // Re-check itself timed out — the job may still be running.
        // Return an error so the UI knows the run is not done.
@@ -1280,6 +1497,10 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
        // Job was deleted before we could confirm terminal state — same as the
        // fulfilled+jobGone case above: proceed with captured output.
        jobDeletedExternally = true;
+        if (!jobGoneDetectionPath) {
+          jobGoneDetectionPath = "recheck-poll-404";
+          jobGoneAt = Date.now();
+        }
        await onLog("stdout", `[paperclip] Job ${jobName} was deleted before terminal condition was observed (TTL or external deletion) — proceeding with captured output.\n`);
      } else if (!actualState.succeeded) {
        // Job still not terminal — the completion error was likely transient.
@@ -1297,7 +1518,8 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
      }
    }

-    exitCode = await getPodExitCode(namespace, jobName, kubeconfigPath);
+    podTerminatedState = await getPodTerminatedState(namespace, jobName, kubeconfigPath);
+    exitCode = podTerminatedState?.exitCode ?? null;
  } finally {
    if (keepaliveTimer) clearInterval(keepaliveTimer);
    activeJobs.delete(activeJobRef);
@@ -1348,11 +1570,35 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec

  if (!parsed) {
    if (jobDeletedExternally && exitCode === null) {
+      // Forensic context (FAR-107): users sometimes see this error when nothing
+      // actually deleted the Job manually.  Surface enough state in the message
+      // to distinguish self-delete (SIGTERM/cancel), TTL-after-completion, and
+      // genuine external deletion without needing cluster shell access.
+      const detailParts: string[] = [];
+      if (jobGoneDetectionPath) detailParts.push(`detected_via=${jobGoneDetectionPath}`);
+      detailParts.push(`job=${jobName}`);
+      detailParts.push(`ns=${namespace}`);
+      if (podRunningAt !== null && jobGoneAt !== null) {
+        detailParts.push(`elapsed_since_pod_running=${Math.round((jobGoneAt - podRunningAt) / 1000)}s`);
+      }
+      detailParts.push(`completion_polls=${jobObserver.pollCount}`);
+      const lastConds = jobObserver.lastConditions;
+      if (lastConds && lastConds.length > 0) {
+        const summary = lastConds
+          .map((c) => `${c.type}=${c.status}${c.reason ? `(${c.reason})` : ""}`)
+          .join(",");
+        detailParts.push(`last_job_conditions=[${summary}]`);
+      } else {
+        detailParts.push("last_job_conditions=none_observed");
+      }
+      detailParts.push(`stdout_bytes=${stdout.length}`);
+      const stdoutLines = stdout.split("\n").filter((l) => l.trim()).length;
+      detailParts.push(`stdout_nonempty_lines=${stdoutLines}`);
      return {
        exitCode,
        signal: null,
        timedOut: false,
-        errorMessage: "K8s Job was deleted externally before Claude could complete",
+        errorMessage: `K8s Job was deleted externally before Claude could complete [${detailParts.join(", ")}]`,
        errorCode: "k8s_job_deleted_externally",
        resultJson: { stdout },
      };
@@ -1368,13 +1614,29 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
      };
    }
    if (parsedStream.truncatedMidStream) {
-      const exitHint = exitCode === null ? "no exit code" : `exit code ${exitCode}`;
+      // Re-query pod state with retry — the initial single-shot read can lose
+      // to kubelet propagation lag and surface a useless "pod state unavailable"
+      // message that hides the real exit cause (OOMKilled, SIGTERM, etc).  The
+      // retry distinguishes pod-genuinely-gone from terminated-state-lag and
+      // gives the operator the actual exit code/reason where possible (FAR-107).
+      let lookup: PodLookupResult | undefined;
+      let refreshedState = podTerminatedState;
+      try {
+        lookup = await getPodLookupWithRetry(namespace, jobName, kubeconfigPath);
+        refreshedState = lookup.state;
+        if (refreshedState && refreshedState.exitCode !== null) {
+          exitCode = refreshedState.exitCode;
+        }
+      } catch (err) {
+        await onLog("stderr", `[paperclip] truncation diagnostic: pod re-query failed (${err instanceof Error ? err.message : String(err)})\n`).catch(() => {});
+      }
+      const cause = describeTruncationCause(refreshedState, lookup);
      const modelHint = parsedStream.model ? ` (model: ${parsedStream.model})` : "";
      return {
        exitCode,
        signal: null,
        timedOut: false,
-        errorMessage: `Claude run was truncated mid-stream${modelHint} — assistant produced content but no result event arrived (${exitHint}); pod may have been terminated, OOMKilled, or the CLI crashed`,
+        errorMessage: `Claude run was truncated mid-stream${modelHint} — assistant produced content but no result event arrived; ${cause}`,
        errorCode: "claude_truncated",
        resultJson: { stdout },
      };
@@ -1383,7 +1645,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
      exitCode,
      signal: null,
      timedOut: false,
-      errorMessage: buildPartialRunError(exitCode, parsedStream.model, stdout),
+      errorMessage: buildPartialRunError(exitCode, parsedStream.model, stdout, podTerminatedState),
      resultJson: { stdout },
    };
  }
Author	SHA1	Message	Date
Chris Farhood	fd7dce7239	0.1.52	2026-04-27 00:00:57 +00:00
Chris Farhood	b1878c684e	fix: retry-aware pod state lookup + honest truncation cause messages (FAR-107) The single-shot getPodTerminatedState query lost a real race against kubelet's containerStatus update: when Claude exited cleanly but quickly, listNamespacedPod often returned the pod with phase=Succeeded/Failed but without a populated state.terminated, so describeTruncationCause fell into the catch-all "pod state unavailable — likely deleted before exit could be read" branch. That message is doubly wrong: the pod was not deleted and the exit cause was readable a few hundred ms later. Operators chasing claude_truncated runs (Nancy/Privileged Escalation) had no visibility into the actual exit code, OOMKilled flag, or reason. Two changes: 1. Introduce lookupPodState + getPodLookupWithRetry — the lookup result carries the pod phase and a podMissing flag, and retries up to 4×500ms when the pod is in a terminal phase but containerStatuses lag. When the pod is in a non-terminal phase or genuinely gone we bail immediately without burning the retry budget. 2. describeTruncationCause now distinguishes three states: - "pod is gone" (eviction, preemption, external delete) - "container terminated state not yet observable (pod phase=…)" - the existing populated-state path with exit code / reason / signal The truncation error path re-queries with the retry-aware lookup right before producing the message, so subsequent claude_truncated errors surface the actual exit cause (137=OOMKilled, 143=SIGTERM, kubelet reason text) instead of a misleading deletion claim. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-27 00:00:56 +00:00
Chris Farhood	83e105393c	0.1.51	2026-04-26 21:24:15 +00:00
Chris Farhood	49288fa5c7	fix: scope cancel-polling to explicit cancellation states only (FAR-107) shouldAbortForCancellation previously treated any non-`running` runStatus as a cancellation signal — which made the keepalive's cancel-poll delete the K8s Job whenever the heartbeat-runs API briefly returned a transient or stale status (e.g. queued, pending, succeeded, failed, completed, unknown) for an in-flight run. The follow-up `waitForJobCompletion` poll then observed the 404 and surfaced a spurious `k8s_job_deleted_externally` error to the user, even though no human or external system deleted the Job. Privileged Escalation's "null-pointer-nancy" agent reproduced this on runs that were never cancelled and were not adjacent to a paperclip restart, ruling out the SIGTERM path that 0.1.50 already addressed. Tighten the guard to fire only on `cancelled` / `cancelling`. Other terminal statuses are unreachable while the adapter is still executing (the adapter's own return is what flips them) and even if observed mid-run, they do not justify deleting a Job that may still be doing real work — the natural completion path will tear it down. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:24:11 +00:00
Chris Farhood	dae9e18659	0.1.50	2026-04-26 21:19:03 +00:00
Chris Farhood	6923597b31	fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107) Root cause of Nancy's k8s_job_deleted_externally false positive: the paperclip server itself receives SIGTERM during rolling deploys, evictions, scale-down, etc. The previous SIGTERM handler iterated activeJobs and deleted every Job before exiting, which surfaced in the in-flight heartbeat as "K8s Job was deleted externally" — even though nothing external touched it. With reattachOrphanedJobs=true (default), this is exactly the wrong behaviour: leaving the Jobs alive lets the next paperclip process discover them via the orphan-classification path and reattach their log streams. With reattachOrphanedJobs=false the operator opted into manual cleanup, so we still must not auto-delete. The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so both survive together and TTL handles cleanup on natural completion. Test rewritten to assert the new contract: SIGTERM must not touch K8s Jobs. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:19:02 +00:00
Chris Farhood	d184a1732b	0.1.49	2026-04-26 21:06:19 +00:00
Chris Farhood	be84428226	fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107) The error previously fired with no diagnostic context, making it impossible to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a missed Complete condition, or (c) actual external deletion without cluster shell access. Two changes: 1. Grace-period verification: when the log stream exits and the 30s grace timer fires, do a one-shot readNamespacedJob before declaring the Job gone. If it's still there, settle as gracePeriodFired (not jobGone) so we don't mis-classify K8s condition propagation lag as deletion. 2. Forensic capture: track which of the three detection paths (completion-poll-404, grace-period-verify-404, recheck-poll-404) first observed the 404, the last successful Job conditions read, the poll count, elapsed time since pod-running, and stdout size. Append all of it to the errorMessage so the next occurrence is self-diagnosing. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:05:04 +00:00
Chris Farhood	d9928030d6	0.1.48	2026-04-26 14:48:22 +00:00
Chris Farhood	76fc6fcdfc	fix: surface pod terminated reason/message in adapter_failed errors (FAR-100) The init-only and partial-run error paths now embed the K8s container terminated state (reason, message, signal, OOM hint) directly in the errorMessage. This eliminates the kubectl round-trip when diagnosing adapter_failed runs — the surfaced error self-explains. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 14:48:12 +00:00
Chris Farhood	3169f49f23	0.1.47	2026-04-26 13:04:54 +00:00
Chris Farhood	e0b35d230f	fix: distinguish init-only non-zero exits in buildPartialRunError (FAR-100) Init-only runs that exit with a non-zero code now surface a more actionable message naming the exit code and the likely cause (unsupported model or rejected session) instead of the generic "did not produce a result" text. Helps operators diagnose model-id / billing-tier failures (e.g. opus 4.6). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 13:04:43 +00:00
Chris Farhood	4e2c36319d	0.1.46	2026-04-26 01:57:43 +00:00
Chris Farhood	8474f78fe1	fix: include pod terminated reason/message in claude_truncated error (FAR-95) Capture the claude container's terminated state (exit code, reason, message, signal) and surface it in the truncation error so operators see why the run was cut short — e.g. "exit code 137, SIGKILL (commonly OOMKilled), reason=OOMKilled, message=Memory cgroup out of memory" instead of just a "truncated" label with no diagnostic context. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 01:57:43 +00:00