Compare commits

...

13 Commits

Author SHA1 Message Date
Chris Farhood 1d894f104f fix(models): expose static models list so UI renders entries before listModels resolves
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 07:42:44 -04:00
Chris Farhood fc3866924a 0.1.54 2026-04-27 00:38:06 +00:00
Chris Farhood 368254d75d fix: per-chunk activity tracking + pod-phase gate on grace timer (FAR-107)
The 0.1.53 fix tracked stream liveness by updating lastActiveAt only
after streamPodLogsOnce returned.  That worked for the
disconnect-then-reconnect-then-disconnect case, but missed the
disconnect-then-long-running-reconnect case: a streaming attempt that
runs for minutes without disconnecting never refreshes lastActiveAt,
so the grace timer fires 30s after the prior disconnect even though
the new attempt is currently producing output.  Nancy reproduced
exactly this on 0.1.53 — claude_truncated with pod phase=Running.

Two changes:

1. streamPodLogsOnce now accepts the activity ref and updates
   lastActiveAt inside its writable's write handler — every chunk
   delivered from the container refreshes the timer in real time,
   not just on stream return.

2. Before the grace timer settles, gate on pod phase: if the pod is
   still Running or Pending, the container is alive (Claude's
   long tool-use silences exceed 30s for slow upstream APIs).
   Refresh lastActiveAt, leave the poller armed, and let
   waitForJobCompletion remain the authoritative termination
   signal.  Only proceed with the grace settlement when the pod
   has actually reached a terminal phase or is gone.

The original FAR-23 fast-path (container exits, Job condition lags)
still works: when the container terminates, pod phase moves to
Succeeded/Failed and the gate falls through to the existing
Job-presence check.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:38:06 +00:00
Chris Farhood 34756f8215 0.1.53 2026-04-27 00:28:45 +00:00
Chris Farhood 07ef106c66 fix: gate grace timer on stream-output silence, not first disconnect (FAR-107)
The 30s grace timer that bounds K8s Job condition propagation lag was
armed by streamPodLogs's onFirstStreamExit callback the moment
streamPodLogsOnce returned for the first time.  A transient K8s log-API
disconnect mid-run also returns from streamPodLogsOnce — so the grace
timer fired 30s later regardless of whether streamPodLogs had already
reconnected and the container was still producing output.

Nancy / Privileged Escalation reproduced this on long Opus-4-6 runs:
the prod paperclip pod was stable, the cancel-poll guard was already
narrowed in 0.1.51, but every long run truncated with claude_truncated
+ "container terminated state not yet observable (pod phase=Running)"
because the run was being abandoned mid-output.

Replace the boolean onFirstStreamExit signal with a streamActivity ref
carrying lastActiveAt + streamHasExited.  streamPodLogs refreshes
lastActiveAt every time a streamPodLogsOnce attempt returns non-empty
output, so reconnects that resume real output keep the grace clock
reset.  The grace timer fires only once the stream has exited at least
once AND no chunk has arrived for the full grace window — which
preserves the original FAR-23 behaviour (container truly exited but
Job condition lags) while ending the false-truncation of healthy
streams.  Adds a regression test that asserts a stream drop + reconnect
+ deferred Job completion does not surface as truncated.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:28:44 +00:00
Chris Farhood fd7dce7239 0.1.52 2026-04-27 00:00:57 +00:00
Chris Farhood b1878c684e fix: retry-aware pod state lookup + honest truncation cause messages (FAR-107)
The single-shot getPodTerminatedState query lost a real race against
kubelet's containerStatus update: when Claude exited cleanly but quickly,
listNamespacedPod often returned the pod with phase=Succeeded/Failed but
without a populated state.terminated, so describeTruncationCause fell into
the catch-all "pod state unavailable — likely deleted before exit could
be read" branch.  That message is doubly wrong: the pod was not deleted
and the exit cause was readable a few hundred ms later.  Operators
chasing claude_truncated runs (Nancy/Privileged Escalation) had no
visibility into the actual exit code, OOMKilled flag, or reason.

Two changes:

1. Introduce lookupPodState + getPodLookupWithRetry — the lookup result
   carries the pod phase and a podMissing flag, and retries up to 4×500ms
   when the pod is in a terminal phase but containerStatuses lag.  When
   the pod is in a non-terminal phase or genuinely gone we bail
   immediately without burning the retry budget.

2. describeTruncationCause now distinguishes three states:
   - "pod is gone" (eviction, preemption, external delete)
   - "container terminated state not yet observable (pod phase=…)"
   - the existing populated-state path with exit code / reason / signal

The truncation error path re-queries with the retry-aware lookup right
before producing the message, so subsequent claude_truncated errors
surface the actual exit cause (137=OOMKilled, 143=SIGTERM, kubelet
reason text) instead of a misleading deletion claim.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:00:56 +00:00
Chris Farhood 83e105393c 0.1.51 2026-04-26 21:24:15 +00:00
Chris Farhood 49288fa5c7 fix: scope cancel-polling to explicit cancellation states only (FAR-107)
shouldAbortForCancellation previously treated any non-`running` runStatus
as a cancellation signal — which made the keepalive's cancel-poll delete
the K8s Job whenever the heartbeat-runs API briefly returned a transient
or stale status (e.g. queued, pending, succeeded, failed, completed,
unknown) for an in-flight run.  The follow-up `waitForJobCompletion`
poll then observed the 404 and surfaced a spurious
`k8s_job_deleted_externally` error to the user, even though no human
or external system deleted the Job.

Privileged Escalation's "null-pointer-nancy" agent reproduced this on
runs that were never cancelled and were not adjacent to a paperclip
restart, ruling out the SIGTERM path that 0.1.50 already addressed.

Tighten the guard to fire only on `cancelled` / `cancelling`.  Other
terminal statuses are unreachable while the adapter is still executing
(the adapter's own return is what flips them) and even if observed
mid-run, they do not justify deleting a Job that may still be doing
real work — the natural completion path will tear it down.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:24:11 +00:00
Chris Farhood dae9e18659 0.1.50 2026-04-26 21:19:03 +00:00
Chris Farhood 6923597b31 fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107)
Root cause of Nancy's k8s_job_deleted_externally false positive: the
paperclip server itself receives SIGTERM during rolling deploys,
evictions, scale-down, etc.  The previous SIGTERM handler iterated
activeJobs and deleted every Job before exiting, which surfaced in the
in-flight heartbeat as "K8s Job was deleted externally" — even though
nothing external touched it.

With reattachOrphanedJobs=true (default), this is exactly the wrong
behaviour: leaving the Jobs alive lets the next paperclip process
discover them via the orphan-classification path and reattach their
log streams.  With reattachOrphanedJobs=false the operator opted into
manual cleanup, so we still must not auto-delete.

The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the
Job, so both survive together and TTL handles cleanup on natural
completion.  Test rewritten to assert the new contract: SIGTERM must
not touch K8s Jobs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:19:02 +00:00
Chris Farhood d184a1732b 0.1.49 2026-04-26 21:06:19 +00:00
Chris Farhood be84428226 fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107)
The error previously fired with no diagnostic context, making it impossible
to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a
missed Complete condition, or (c) actual external deletion without cluster
shell access.  Two changes:

1. Grace-period verification: when the log stream exits and the 30s grace
   timer fires, do a one-shot readNamespacedJob before declaring the Job
   gone.  If it's still there, settle as gracePeriodFired (not jobGone) so
   we don't mis-classify K8s condition propagation lag as deletion.

2. Forensic capture: track which of the three detection paths
   (completion-poll-404, grace-period-verify-404, recheck-poll-404)
   first observed the 404, the last successful Job conditions read, the
   poll count, elapsed time since pod-running, and stdout size.  Append
   all of it to the errorMessage so the next occurrence is self-diagnosing.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:05:04 +00:00
7 changed files with 364 additions and 83 deletions
+2 -2
View File
@@ -1,12 +1,12 @@
{
"name": "paperclip-adapter-claude-k8s",
"version": "0.1.48",
"version": "0.1.54",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "paperclip-adapter-claude-k8s",
"version": "0.1.48",
"version": "0.1.54",
"license": "MIT",
"dependencies": {
"@kubernetes/client-node": "^1.0.0",
+1 -1
View File
@@ -1,6 +1,6 @@
{
"name": "paperclip-adapter-claude-k8s",
"version": "0.1.48",
"version": "0.1.55",
"description": "Paperclip adapter plugin that runs Claude Code agents as Kubernetes Jobs",
"license": "MIT",
"repository": {
+2 -1
View File
@@ -1,7 +1,8 @@
export const type = "claude_k8s";
export const label = "Claude (Kubernetes)";
export const models: undefined = undefined;
import { DIRECT_MODELS, BEDROCK_MODELS, isBedrockEnv } from "./server/models.js";
export const models = isBedrockEnv() ? BEDROCK_MODELS : DIRECT_MODELS;
export const agentConfigurationDoc = `# claude_k8s agent configuration
+72 -13
View File
@@ -1019,7 +1019,8 @@ describe("execute: happy path", () => {
const result = await executePromise;
expect(result.errorCode).toBe("k8s_job_deleted_externally");
expect(result.errorMessage).toBe("K8s Job was deleted externally before Claude could complete");
expect(result.errorMessage).toMatch(/^K8s Job was deleted externally before Claude could complete \[/);
expect(result.errorMessage).toContain("detected_via=");
expect(result.exitCode).toBeNull();
});
@@ -1511,6 +1512,54 @@ describe("execute: log-stream-exit grace period (FAR-23)", () => {
// (grace did not fire, real completion arrived)
expect(result.errorMessage).toBeNull();
});
it("does NOT fire grace when stream drops mid-output and reconnects with more output (FAR-107)", async () => {
// Reproduces Nancy / Privileged Escalation symptom: the K8s log API drops
// the streaming connection mid-run; streamPodLogs reconnects and the
// container is still producing. Before the fix, the grace timer was
// armed on first stream exit and fired 30s later regardless of whether
// output had resumed, surfacing claude_truncated even though the pod was
// still phase=Running.
let attemptIndex = 0;
mockLogFn.mockImplementation(
async (_ns: string, _pod: string, _ctr: string, writable: import("node:stream").Writable) => {
if (attemptIndex === 0) {
// Stream a partial init line then "drop" the connection without a
// result event — this is the transient API disconnect.
writable.write(JSON.stringify({ type: "system", subtype: "init", model: "claude-sonnet-4-6", session_id: "sess_test123" }) + "\n");
attemptIndex++;
return;
}
// Reconnect produces the rest of the stream including the result event.
writable.write(CLAUDE_HAPPY_OUTPUT);
},
);
// Job condition arrives only after the reconnect produces output, well
// beyond the 30s grace window; the old code would have grace-fired at
// ~30s and treated the run as truncated.
let readJobCalls = 0;
mockBatchReadJob.mockImplementation(async () => {
readJobCalls++;
// Stay non-terminal until the reconnect has had time to run and the
// grace window has fully elapsed since the FIRST disconnect.
if (readJobCalls < 25) return { status: { conditions: [] } };
return { status: { conditions: [{ type: "Complete", status: "True" }] } };
});
const executePromise = execute(makeCtx());
// t=3000: first reconnect sleep fires → second streamPodLogsOnce attempt
await vi.advanceTimersByTimeAsync(3_100);
// Drive past the old (buggy) 30s grace boundary without firing real completion
await vi.advanceTimersByTimeAsync(35_000);
// Then let the Job's Complete condition land
await vi.advanceTimersByTimeAsync(20_000);
const result = await executePromise;
// Run completed normally — grace must not have falsely truncated it.
expect(result.exitCode).toBe(0);
expect(result.errorCode).toBeUndefined();
expect(result.sessionId).toBe("sess_test123");
}, 80_000);
});
// ─── execute: concurrency guard — multiple orphan sorting ────────────────────
@@ -1560,16 +1609,24 @@ describe("shouldAbortForCancellation", () => {
expect(shouldAbortForCancellation("cancelled")).toBe(true);
});
it("returns true when status is 'failed'", () => {
expect(shouldAbortForCancellation("failed")).toBe(true);
it("returns true when status is 'cancelling'", () => {
expect(shouldAbortForCancellation("cancelling")).toBe(true);
});
it("returns true when status is 'completed'", () => {
expect(shouldAbortForCancellation("completed")).toBe(true);
// FAR-107: terminal-but-not-cancelled statuses MUST NOT trigger Job deletion.
// The previous "anything but running" guard caused k8s_job_deleted_externally
// false positives for in-flight runs whenever the API briefly reported a
// transient/stale status.
it("returns false for non-cancellation terminal statuses (FAR-107)", () => {
expect(shouldAbortForCancellation("succeeded")).toBe(false);
expect(shouldAbortForCancellation("failed")).toBe(false);
expect(shouldAbortForCancellation("completed")).toBe(false);
});
it("returns true for any non-running non-empty string", () => {
expect(shouldAbortForCancellation("unknown")).toBe(true);
it("returns false for unknown statuses (FAR-107)", () => {
expect(shouldAbortForCancellation("unknown")).toBe(false);
expect(shouldAbortForCancellation("queued")).toBe(false);
expect(shouldAbortForCancellation("pending")).toBe(false);
});
});
@@ -1770,7 +1827,7 @@ describe("execute: SIGTERM handler best-effort cleanup", () => {
vi.useRealTimers();
});
it("deletes the active Job when SIGTERM fires during execution", async () => {
it("does NOT delete active Jobs on SIGTERM — leaves them for orphan reattach (FAR-107)", async () => {
// Mock process.kill to prevent the test process from actually being killed.
const killSpy = vi.spyOn(process, "kill").mockImplementation(() => true);
@@ -1781,17 +1838,19 @@ describe("execute: SIGTERM handler best-effort cleanup", () => {
// Flush microtasks through the async setup chain: getSelfPodInfo, listJobs,
// readSkillEntries, prepareBundle, createJob, onLog, activeJobs.add(), and
// ensureSigtermHandler() all complete before the try block enters streaming.
// 30 rounds is more than enough for the ~7 sequential await points.
for (let i = 0; i < 30; i++) await Promise.resolve();
// Emit SIGTERM — the process.once handler fires synchronously and kicks off
// async cleanup (deleteNamespacedJob). The mock resolves immediately.
// Reset deleteJob spy after setup so we can detect any SIGTERM-driven calls.
mockBatchDeleteJob.mockClear();
// Emit SIGTERM — the handler must re-raise to the default handler without
// touching the K8s Job. Deleting the Job here would surface as
// k8s_job_deleted_externally in the in-flight run (FAR-107).
process.emit("SIGTERM");
// Flush microtasks for deleteJob to resolve and the .then(process.kill) to run.
for (let i = 0; i < 10; i++) await Promise.resolve();
expect(mockBatchDeleteJob).toHaveBeenCalled();
expect(mockBatchDeleteJob).not.toHaveBeenCalled();
expect(killSpy).toHaveBeenCalledWith(process.pid, "SIGTERM");
killSpy.mockRestore();
+264 -62
View File
@@ -58,30 +58,20 @@ function ensureSigtermHandler(): void {
if (sigtermHandlerRegistered) return;
sigtermHandlerRegistered = true;
process.once("SIGTERM", () => {
const jobs = [...activeJobs];
void Promise.allSettled(
jobs.map(async (ref) => {
try {
const batchApi = getBatchApi(ref.kubeconfigPath);
await batchApi.deleteNamespacedJob({
name: ref.jobName,
namespace: ref.namespace,
body: { propagationPolicy: "Background" },
});
} catch { /* best-effort */ }
if (ref.promptSecretName && ref.promptSecretNamespace) {
try {
const coreApi = getCoreApi(ref.kubeconfigPath);
await coreApi.deleteNamespacedSecret({
name: ref.promptSecretName,
namespace: ref.promptSecretNamespace,
});
} catch { /* best-effort */ }
}
}),
).then(() => {
process.kill(process.pid, "SIGTERM");
});
// Do NOT delete active K8s Jobs on SIGTERM (FAR-107). Paperclip itself
// receives SIGTERM during rolling deploys, evictions, scale-down, etc.
// Deleting the Jobs we own there causes the in-flight heartbeat to surface
// a false-positive `k8s_job_deleted_externally` error and tears down work
// the user expected to keep running.
//
// The correct behaviour with `reattachOrphanedJobs=true` (default) is to
// leave the Jobs alive: the next paperclip process discovers them via the
// orphan-classification path and reattaches their log streams. When
// `reattachOrphanedJobs=false` the operator explicitly opted into manual
// cleanup and should not have us auto-deleting either. The owning Job's
// ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so
// both survive together and TTL cleans them up after natural completion.
process.kill(process.pid, "SIGTERM");
});
}
@@ -100,13 +90,23 @@ export function isK8s404(err: unknown): boolean {
}
/**
* Returns true when the heartbeat-run status indicates the run is no longer
* active and the K8s Job should be cancelled.
* Returns true when the heartbeat-run status indicates the run was explicitly
* cancelled and the K8s Job must be torn down.
*
* Only `cancelled` / `cancelling` qualify. Treating any non-`running` status
* as cancellation (the previous behaviour) produced spurious
* k8s_job_deleted_externally errors for in-flight runs whenever the API
* briefly reported a transient or stale status — Nancy's runs at
* Privileged Escalation hit this without anyone actually cancelling them
* (FAR-107). Other terminal statuses (`succeeded`/`failed`/`completed`)
* are unreachable in practice while the adapter is still executing
* (the adapter's own return is what flips them) and even if observed,
* they do not warrant our deleting a Job that may still be doing work.
* Exported for unit tests.
*/
export function shouldAbortForCancellation(runStatus: string | undefined): boolean {
if (!runStatus) return false;
return runStatus !== "running";
return runStatus === "cancelled" || runStatus === "cancelling";
}
/**
@@ -401,6 +401,7 @@ export async function streamPodLogsOnce(
sinceSeconds?: number,
dedup?: LogLineDedupFilter,
stopSignal?: { stopped: boolean },
activity?: { lastActiveAt: number },
): Promise<string> {
const logApi = getLogApi(kubeconfigPath);
const chunks: string[] = [];
@@ -409,6 +410,13 @@ export async function streamPodLogsOnce(
write(chunk: Buffer, _encoding, callback) {
const text = chunk.toString("utf-8");
chunks.push(text);
// Refresh stream liveness on every chunk received from the container.
// This MUST happen here (not just after streamPodLogsOnce returns) —
// a streaming attempt that never disconnects can produce output for
// hours, and the grace timer in execute() will fire 30s after the
// FIRST disconnect even if a new long-running attempt is currently
// streaming, unless we keep this timestamp fresh per-chunk (FAR-107).
if (activity) activity.lastActiveAt = Date.now();
const emitted = dedup ? dedup.filter(text) : text;
if (!emitted) {
callback();
@@ -481,10 +489,18 @@ export async function streamPodLogsOnce(
* Capped at MAX_LOG_RECONNECT_ATTEMPTS to prevent infinite reconnect
* loops during sustained API partitions.
*
* onFirstStreamExit is called the first time streamPodLogsOnce returns
* (container has exited or stream disconnected). Used by execute() to
* start the LOG_EXIT_COMPLETION_GRACE_MS grace timer (FAR-23) without
* waiting for all reconnects to exhaust.
* `activity` tracks stream liveness so execute()'s grace timer can
* distinguish a transient K8s log-API reconnect from a real container
* exit (FAR-107). Two signals:
* - `streamHasExited` becomes true on the first return from
* streamPodLogsOnce. Until then we are still in the warm-up window
* and waitForJobCompletion is the authoritative signal — grace must
* not fire.
* - `lastActiveAt` advances every time a streamPodLogsOnce attempt
* returns non-empty output (the container is still producing).
* The grace timer fires only once GRACE_MS have passed since the
* last chunk, so output that resumes after a transient drop keeps
* the run alive.
*/
async function streamPodLogs(
namespace: string,
@@ -493,7 +509,7 @@ async function streamPodLogs(
kubeconfigPath?: string,
stopSignal?: { stopped: boolean },
dedup?: LogLineDedupFilter,
onFirstStreamExit?: () => void,
activity?: { lastActiveAt: number; streamHasExited: boolean },
): Promise<string> {
const allChunks: string[] = [];
let attempt = 0;
@@ -523,15 +539,16 @@ async function streamPodLogs(
}
const preStreamTs = Math.floor(Date.now() / 1000);
const result = await streamPodLogsOnce(namespace, podName, onLog, kubeconfigPath, sinceSeconds, dedup, stopSignal);
// Signal first stream exit immediately so the grace-period timer in
// execute() can start without waiting for all reconnects to complete.
if (attempt === 0) onFirstStreamExit?.();
const result = await streamPodLogsOnce(namespace, podName, onLog, kubeconfigPath, sinceSeconds, dedup, stopSignal, activity);
if (activity) activity.streamHasExited = true;
if (result) {
allChunks.push(result);
// Update last-received timestamp to now (the stream just ended,
// so any log lines in `result` were received up to this moment).
lastLogReceivedAt = Math.floor(Date.now() / 1000);
// Refresh stream liveness so the grace timer in execute() does not
// fire while output is still flowing through reconnects (FAR-107).
if (activity) activity.lastActiveAt = Date.now();
} else if (attempt === 0) {
// First attempt returned nothing — update timestamp so reconnect
// window stays reasonable.
@@ -582,11 +599,14 @@ async function readPodLogs(
* is treated as a soft terminal: succeeded=false, timedOut=false, jobGone=true.
* The caller should log this and fall through to stdout parsing.
*/
type JobConditionSnapshot = { type?: string; status?: string; reason?: string; message?: string };
async function waitForJobCompletion(
namespace: string,
jobName: string,
timeoutMs: number,
kubeconfigPath?: string,
observer?: { lastConditions: JobConditionSnapshot[] | null; pollCount: number },
): Promise<{ succeeded: boolean; timedOut: boolean; jobGone?: boolean }> {
const batchApi = getBatchApi(kubeconfigPath);
const deadline = timeoutMs > 0 ? Date.now() + timeoutMs : 0;
@@ -605,6 +625,12 @@ async function waitForJobCompletion(
throw err;
}
const conditions = job.status?.conditions ?? [];
if (observer) {
observer.pollCount += 1;
observer.lastConditions = conditions.map((c) => ({
type: c.type, status: c.status, reason: c.reason, message: c.message,
}));
}
const complete = conditions.find((c) => c.type === "Complete" && c.status === "True");
if (complete) return { succeeded: true, timedOut: false };
@@ -641,30 +667,82 @@ export interface PodTerminatedState {
signal: number | null;
}
async function getPodTerminatedState(
/**
* Result of a pod-state lookup. `state` is the terminated state when available;
* `phase` and `podMissing` give the caller enough context to render an honest
* truncation-cause message instead of guessing "likely deleted" (FAR-107).
*/
export interface PodLookupResult {
state: PodTerminatedState | null;
phase: string | null;
podMissing: boolean;
}
async function lookupPodState(
namespace: string,
jobName: string,
kubeconfigPath?: string,
): Promise<PodTerminatedState | null> {
): Promise<PodLookupResult> {
const coreApi = getCoreApi(kubeconfigPath);
const podList = await coreApi.listNamespacedPod({
namespace,
labelSelector: `job-name=${jobName}`,
});
const pod = podList.items[0];
if (!pod) return null;
if (!pod) return { state: null, phase: null, podMissing: true };
const phase = pod.status?.phase ?? null;
const containerStatus = pod.status?.containerStatuses?.find((s) => s.name === "claude");
const terminated = containerStatus?.state?.terminated;
if (!terminated) return null;
if (!terminated) return { state: null, phase, podMissing: false };
return {
exitCode: terminated.exitCode ?? null,
reason: terminated.reason ?? null,
message: (terminated.message ?? "").trim() || null,
signal: terminated.signal ?? null,
state: {
exitCode: terminated.exitCode ?? null,
reason: terminated.reason ?? null,
message: (terminated.message ?? "").trim() || null,
signal: terminated.signal ?? null,
},
phase,
podMissing: false,
};
}
/**
* Read the claude container's terminated state, retrying briefly when the pod
* exists in a terminal phase but kubelet has not yet propagated the
* containerStatuses[].state.terminated field. Without this retry, fast
* truncated-stream exits surface as "pod state unavailable" (FAR-107) and
* mask the real exit code / OOMKilled / SIGTERM cause.
*/
async function getPodLookupWithRetry(
namespace: string,
jobName: string,
kubeconfigPath?: string,
attempts = 4,
delayMs = 500,
): Promise<PodLookupResult> {
let last: PodLookupResult = { state: null, phase: null, podMissing: true };
for (let i = 0; i < attempts; i++) {
last = await lookupPodState(namespace, jobName, kubeconfigPath);
if (last.state) return last;
if (last.podMissing) return last;
// Pod exists but no terminated state. If it is in a terminal phase the
// containerStatuses update is in flight — wait briefly and retry. If it
// is still Running/Pending, retrying is unlikely to help, so bail.
if (last.phase !== "Succeeded" && last.phase !== "Failed") return last;
if (i < attempts - 1) await new Promise((r) => setTimeout(r, delayMs));
}
return last;
}
async function getPodTerminatedState(
namespace: string,
jobName: string,
kubeconfigPath?: string,
): Promise<PodTerminatedState | null> {
return (await lookupPodState(namespace, jobName, kubeconfigPath)).state;
}
/**
* Format a human-readable explanation for a truncated run, including the
* pod's claude-container terminated state when available. Exit code 137
@@ -673,9 +751,17 @@ async function getPodTerminatedState(
*/
export function describeTruncationCause(
state: PodTerminatedState | null,
lookup?: PodLookupResult,
): string {
if (!state) {
return "pod state unavailable — likely deleted before exit could be read";
if (lookup?.podMissing) {
return "pod is gone — Job pod was removed (eviction, preemption, or external delete) before exit could be read";
}
if (lookup && !lookup.podMissing) {
const phaseHint = lookup.phase ? `pod phase=${lookup.phase}` : "pod present";
return `container terminated state not yet observable (${phaseHint}) — kubelet status update did not land within retry window; exit cause unknown`;
}
return "pod state unavailable — exit cause unknown";
}
const parts: string[] = [];
if (state.exitCode !== null) {
@@ -1112,6 +1198,17 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
// Set when the job disappeared (404) or grace-timer fired before we saw a
// terminal condition — used to emit a clearer error when stdout parsing fails.
let jobDeletedExternally = false;
// Forensics for k8s_job_deleted_externally — captures which of the three
// detection paths observed the 404, the last successful Job-condition read
// before deletion, and timing. Surfaced in the error message so the next
// occurrence is self-diagnosing instead of opaque (FAR-107).
let jobGoneDetectionPath: string | null = null;
let jobGoneAt: number | null = null;
const jobObserver: { lastConditions: JobConditionSnapshot[] | null; pollCount: number } = {
lastConditions: null,
pollCount: 0,
};
let podRunningAt: number | null = null;
const activeJobRef: ActiveJobRef = {
namespace,
@@ -1144,6 +1241,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
podName = await waitForPod(namespace, jobName, scheduleTimeoutMs, onLog, kubeconfigPath);
await onLog("stdout", `[paperclip] Pod running: ${podName}\n`);
}
podRunningAt = Date.now();
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
@@ -1259,17 +1357,16 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
return onLog(stream, chunk);
};
// Track when the log stream first exits so the grace-period can fire
// if the K8s Job condition lags behind container exit (FAR-23).
// Set via onFirstStreamExit callback (called after attempt=0 returns)
// rather than in .then() of streamPodLogs, which would create a
// deadlock: streamPodLogs only resolves after stopSignal is set, but
// stopSignal is set by the grace timer which needs logExitTime to be
// non-null.
let logExitTime: number | null = null;
// Track stream liveness so the grace timer below only fires when output
// has actually stopped — not on a transient K8s log-API reconnect that
// streamPodLogs heals on its own (FAR-107).
const streamActivity: { lastActiveAt: number; streamHasExited: boolean } = {
lastActiveAt: Date.now(),
streamHasExited: false,
};
const trackedLogStream = streamPodLogs(
namespace, podName, wrappedOnLog, kubeconfigPath, logStopSignal, logDedup,
() => { logExitTime = Date.now(); },
streamActivity,
);
// completionWithGrace races waitForJobCompletion against a grace timer
@@ -1279,7 +1376,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
// while streamPodLogs reconnects, holding execute() open for minutes.
// logStopSignal.stopped is set on every settled path (fulfilled, rejected,
// or grace) so streamPodLogs stops reconnecting promptly.
type CompletionResult = { succeeded: boolean; timedOut: boolean; jobGone?: boolean };
type CompletionResult = { succeeded: boolean; timedOut: boolean; jobGone?: boolean; gracePeriodFired?: boolean };
let gracePoller: ReturnType<typeof setInterval> | null = null;
const completionWithGrace = new Promise<CompletionResult>((resolve, reject) => {
let settled = false;
@@ -1297,11 +1394,68 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
logStopSignal.stopped = true;
reject(err);
};
waitForJobCompletion(namespace, jobName, completionTimeoutMs, kubeconfigPath).then(settleOk).catch(settleErr);
waitForJobCompletion(namespace, jobName, completionTimeoutMs, kubeconfigPath, jobObserver).then(settleOk).catch(settleErr);
let graceCheckInFlight = false;
gracePoller = setInterval(() => {
if (logExitTime !== null && Date.now() - logExitTime >= LOG_EXIT_COMPLETION_GRACE_MS) {
void onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago without K8s Job condition update — proceeding with captured output (FAR-23)\n`).catch(() => {});
settleOk({ succeeded: false, timedOut: false, jobGone: true });
// Only consider grace once the stream has exited at least once.
// Until then we are still in the warm-up window and
// waitForJobCompletion is the authoritative signal. Once the
// stream has exited, fire only after GRACE_MS of inactivity
// measured against the last received chunk — output that resumes
// through a reconnect resets the clock so transient drops do not
// truncate live runs (FAR-107).
if (graceCheckInFlight) return;
if (
streamActivity.streamHasExited &&
Date.now() - streamActivity.lastActiveAt >= LOG_EXIT_COMPLETION_GRACE_MS
) {
graceCheckInFlight = true;
void (async () => {
try {
// Pod-phase gate (FAR-107): if the pod is still Running/Pending
// the container is alive — Claude can be silent for >30s during
// long tool calls (web fetches, slow upstream APIs). Refresh
// the stream-activity timer, leave the poller armed, and let
// waitForJobCompletion remain the authoritative signal. Only
// proceed with the grace settlement when the pod has actually
// reached a terminal phase or is gone.
const podLookup = await lookupPodState(namespace, jobName, kubeconfigPath);
if (!podLookup.podMissing && (podLookup.phase === "Running" || podLookup.phase === "Pending")) {
streamActivity.lastActiveAt = Date.now();
graceCheckInFlight = false;
return;
}
} catch (err) {
await onLog("stderr", `[paperclip] grace gate: pod state lookup failed (${err instanceof Error ? err.message : String(err)}) — falling through to Job-presence check\n`).catch(() => {});
}
// Pod is no longer Running — proceed with Job-presence verification.
// Stop the grace poller immediately so we don't double-fire while the
// verification read below is in flight.
if (gracePoller) { clearInterval(gracePoller); gracePoller = null; }
// The log stream exiting only means the container stopped producing
// output — it does NOT prove the Job was deleted. Verify Job
// presence with a one-shot read so we can distinguish:
// (a) Job 404 → truly gone (TTL or external deletion)
// (b) Job still present → K8s condition propagation lag (FAR-23)
// Without this check we mis-classify (b) as "deleted externally" and
// emit a false-positive k8s_job_deleted_externally error (FAR-107).
try {
await getBatchApi(kubeconfigPath).readNamespacedJob({ name: jobName, namespace });
await onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago without K8s Job condition update; Job ${jobName} still present — proceeding with captured output (FAR-23)\n`).catch(() => {});
settleOk({ succeeded: false, timedOut: false, gracePeriodFired: true });
} catch (err: unknown) {
if (isK8s404(err)) {
jobGoneDetectionPath = "grace-period-verify-404";
jobGoneAt = Date.now();
await onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago and Job ${jobName} is gone (TTL or external deletion) — proceeding with captured output (FAR-23)\n`).catch(() => {});
settleOk({ succeeded: false, timedOut: false, jobGone: true });
} else {
// K8s API hiccup — bail out without claiming external deletion.
await onLog("stdout", `[paperclip] Log stream exited ${LOG_EXIT_COMPLETION_GRACE_MS / 1000}s ago; Job state unverifiable (${err instanceof Error ? err.message : String(err)}) — proceeding with captured output (FAR-23)\n`).catch(() => {});
settleOk({ succeeded: false, timedOut: false, gracePeriodFired: true });
}
}
})();
}
}, 1_000);
});
@@ -1369,6 +1523,10 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
// completion), so log streaming has captured the full output — continue
// to stdout parsing rather than returning an error.
jobDeletedExternally = true;
if (!jobGoneDetectionPath) {
jobGoneDetectionPath = "completion-poll-404";
jobGoneAt = Date.now();
}
await onLog("stdout", `[paperclip] Job ${jobName} was deleted before terminal condition was observed (TTL or external deletion) — proceeding with captured output.\n`);
}
} else {
@@ -1377,7 +1535,7 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
// (60s) so we don't hang the heartbeat indefinitely if the K8s API is degraded.
jobTimedOut = false;
const RECHECK_TIMEOUT_MS = 60_000;
const actualState = await waitForJobCompletion(namespace, jobName, RECHECK_TIMEOUT_MS, kubeconfigPath);
const actualState = await waitForJobCompletion(namespace, jobName, RECHECK_TIMEOUT_MS, kubeconfigPath, jobObserver);
if (actualState.timedOut) {
// Re-check itself timed out — the job may still be running.
// Return an error so the UI knows the run is not done.
@@ -1386,6 +1544,10 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
// Job was deleted before we could confirm terminal state — same as the
// fulfilled+jobGone case above: proceed with captured output.
jobDeletedExternally = true;
if (!jobGoneDetectionPath) {
jobGoneDetectionPath = "recheck-poll-404";
jobGoneAt = Date.now();
}
await onLog("stdout", `[paperclip] Job ${jobName} was deleted before terminal condition was observed (TTL or external deletion) — proceeding with captured output.\n`);
} else if (!actualState.succeeded) {
// Job still not terminal — the completion error was likely transient.
@@ -1455,11 +1617,35 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
if (!parsed) {
if (jobDeletedExternally && exitCode === null) {
// Forensic context (FAR-107): users sometimes see this error when nothing
// actually deleted the Job manually. Surface enough state in the message
// to distinguish self-delete (SIGTERM/cancel), TTL-after-completion, and
// genuine external deletion without needing cluster shell access.
const detailParts: string[] = [];
if (jobGoneDetectionPath) detailParts.push(`detected_via=${jobGoneDetectionPath}`);
detailParts.push(`job=${jobName}`);
detailParts.push(`ns=${namespace}`);
if (podRunningAt !== null && jobGoneAt !== null) {
detailParts.push(`elapsed_since_pod_running=${Math.round((jobGoneAt - podRunningAt) / 1000)}s`);
}
detailParts.push(`completion_polls=${jobObserver.pollCount}`);
const lastConds = jobObserver.lastConditions;
if (lastConds && lastConds.length > 0) {
const summary = lastConds
.map((c) => `${c.type}=${c.status}${c.reason ? `(${c.reason})` : ""}`)
.join(",");
detailParts.push(`last_job_conditions=[${summary}]`);
} else {
detailParts.push("last_job_conditions=none_observed");
}
detailParts.push(`stdout_bytes=${stdout.length}`);
const stdoutLines = stdout.split("\n").filter((l) => l.trim()).length;
detailParts.push(`stdout_nonempty_lines=${stdoutLines}`);
return {
exitCode,
signal: null,
timedOut: false,
errorMessage: "K8s Job was deleted externally before Claude could complete",
errorMessage: `K8s Job was deleted externally before Claude could complete [${detailParts.join(", ")}]`,
errorCode: "k8s_job_deleted_externally",
resultJson: { stdout },
};
@@ -1475,7 +1661,23 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
};
}
if (parsedStream.truncatedMidStream) {
const cause = describeTruncationCause(podTerminatedState);
// Re-query pod state with retry — the initial single-shot read can lose
// to kubelet propagation lag and surface a useless "pod state unavailable"
// message that hides the real exit cause (OOMKilled, SIGTERM, etc). The
// retry distinguishes pod-genuinely-gone from terminated-state-lag and
// gives the operator the actual exit code/reason where possible (FAR-107).
let lookup: PodLookupResult | undefined;
let refreshedState = podTerminatedState;
try {
lookup = await getPodLookupWithRetry(namespace, jobName, kubeconfigPath);
refreshedState = lookup.state;
if (refreshedState && refreshedState.exitCode !== null) {
exitCode = refreshedState.exitCode;
}
} catch (err) {
await onLog("stderr", `[paperclip] truncation diagnostic: pod re-query failed (${err instanceof Error ? err.message : String(err)})\n`).catch(() => {});
}
const cause = describeTruncationCause(refreshedState, lookup);
const modelHint = parsedStream.model ? ` (model: ${parsedStream.model})` : "";
return {
exitCode,
+20 -1
View File
@@ -1,5 +1,5 @@
import { describe, it, expect, beforeEach, afterEach } from "vitest";
import { listK8sModels } from "./models.js";
import { listK8sModels, DIRECT_MODELS, BEDROCK_MODELS } from "./models.js";
describe("listK8sModels", () => {
const savedEnv: Record<string, string | undefined> = {};
@@ -50,3 +50,22 @@ describe("listK8sModels", () => {
expect(models.some((m) => m.id === "claude-opus-4-7")).toBe(true);
});
});
describe("static model lists", () => {
it("DIRECT_MODELS is non-empty and has valid ids", () => {
expect(DIRECT_MODELS.length).toBeGreaterThan(0);
for (const m of DIRECT_MODELS) {
expect(typeof m.id).toBe("string");
expect(m.id.length).toBeGreaterThan(0);
expect(typeof m.label).toBe("string");
}
});
it("BEDROCK_MODELS is non-empty and all ids contain 'anthropic.'", () => {
expect(BEDROCK_MODELS.length).toBeGreaterThan(0);
for (const m of BEDROCK_MODELS) {
expect(m.id).toContain("anthropic.");
expect(typeof m.label).toBe("string");
}
});
});
+3 -3
View File
@@ -1,6 +1,6 @@
import type { AdapterModel } from "@paperclipai/adapter-utils";
const DIRECT_MODELS: AdapterModel[] = [
export const DIRECT_MODELS: AdapterModel[] = [
{ id: "claude-opus-4-7", label: "Claude Opus 4.7" },
{ id: "claude-opus-4-6", label: "Claude Opus 4.6" },
{ id: "claude-sonnet-4-6", label: "Claude Sonnet 4.6" },
@@ -9,7 +9,7 @@ const DIRECT_MODELS: AdapterModel[] = [
{ id: "claude-haiku-4-5-20251001", label: "Claude Haiku 4.5" },
];
const BEDROCK_MODELS: AdapterModel[] = [
export const BEDROCK_MODELS: AdapterModel[] = [
{ id: "us.anthropic.claude-opus-4-7", label: "Bedrock Opus 4.7" },
{ id: "us.anthropic.claude-opus-4-6-v1", label: "Bedrock Opus 4.6" },
{ id: "us.anthropic.claude-sonnet-4-6", label: "Bedrock Sonnet 4.6" },
@@ -17,7 +17,7 @@ const BEDROCK_MODELS: AdapterModel[] = [
{ id: "us.anthropic.claude-haiku-4-5-20251001-v1:0", label: "Bedrock Haiku 4.5" },
];
function isBedrockEnv(): boolean {
export function isBedrockEnv(): boolean {
return (
process.env.CLAUDE_CODE_USE_BEDROCK === "1" ||
process.env.CLAUDE_CODE_USE_BEDROCK === "true" ||