paperclip-adapter-claude-k8s

farhoodlabs/paperclip-adapter-claude-k8s

Author	SHA1	Message	Date
Chris Farhood	07ef106c66	fix: gate grace timer on stream-output silence, not first disconnect (FAR-107) The 30s grace timer that bounds K8s Job condition propagation lag was armed by streamPodLogs's onFirstStreamExit callback the moment streamPodLogsOnce returned for the first time. A transient K8s log-API disconnect mid-run also returns from streamPodLogsOnce — so the grace timer fired 30s later regardless of whether streamPodLogs had already reconnected and the container was still producing output. Nancy / Privileged Escalation reproduced this on long Opus-4-6 runs: the prod paperclip pod was stable, the cancel-poll guard was already narrowed in 0.1.51, but every long run truncated with claude_truncated + "container terminated state not yet observable (pod phase=Running)" because the run was being abandoned mid-output. Replace the boolean onFirstStreamExit signal with a streamActivity ref carrying lastActiveAt + streamHasExited. streamPodLogs refreshes lastActiveAt every time a streamPodLogsOnce attempt returns non-empty output, so reconnects that resume real output keep the grace clock reset. The grace timer fires only once the stream has exited at least once AND no chunk has arrived for the full grace window — which preserves the original FAR-23 behaviour (container truly exited but Job condition lags) while ending the false-truncation of healthy streams. Adds a regression test that asserts a stream drop + reconnect + deferred Job completion does not surface as truncated. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-27 00:28:44 +00:00
Chris Farhood	49288fa5c7	fix: scope cancel-polling to explicit cancellation states only (FAR-107) shouldAbortForCancellation previously treated any non-`running` runStatus as a cancellation signal — which made the keepalive's cancel-poll delete the K8s Job whenever the heartbeat-runs API briefly returned a transient or stale status (e.g. queued, pending, succeeded, failed, completed, unknown) for an in-flight run. The follow-up `waitForJobCompletion` poll then observed the 404 and surfaced a spurious `k8s_job_deleted_externally` error to the user, even though no human or external system deleted the Job. Privileged Escalation's "null-pointer-nancy" agent reproduced this on runs that were never cancelled and were not adjacent to a paperclip restart, ruling out the SIGTERM path that 0.1.50 already addressed. Tighten the guard to fire only on `cancelled` / `cancelling`. Other terminal statuses are unreachable while the adapter is still executing (the adapter's own return is what flips them) and even if observed mid-run, they do not justify deleting a Job that may still be doing real work — the natural completion path will tear it down. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:24:11 +00:00
Chris Farhood	6923597b31	fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107) Root cause of Nancy's k8s_job_deleted_externally false positive: the paperclip server itself receives SIGTERM during rolling deploys, evictions, scale-down, etc. The previous SIGTERM handler iterated activeJobs and deleted every Job before exiting, which surfaced in the in-flight heartbeat as "K8s Job was deleted externally" — even though nothing external touched it. With reattachOrphanedJobs=true (default), this is exactly the wrong behaviour: leaving the Jobs alive lets the next paperclip process discover them via the orphan-classification path and reattach their log streams. With reattachOrphanedJobs=false the operator opted into manual cleanup, so we still must not auto-delete. The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so both survive together and TTL handles cleanup on natural completion. Test rewritten to assert the new contract: SIGTERM must not touch K8s Jobs. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:19:02 +00:00
Chris Farhood	be84428226	fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107) The error previously fired with no diagnostic context, making it impossible to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a missed Complete condition, or (c) actual external deletion without cluster shell access. Two changes: 1. Grace-period verification: when the log stream exits and the 30s grace timer fires, do a one-shot readNamespacedJob before declaring the Job gone. If it's still there, settle as gracePeriodFired (not jobGone) so we don't mis-classify K8s condition propagation lag as deletion. 2. Forensic capture: track which of the three detection paths (completion-poll-404, grace-period-verify-404, recheck-poll-404) first observed the 404, the last successful Job conditions read, the poll count, elapsed time since pod-running, and stdout size. Append all of it to the errorMessage so the next occurrence is self-diagnosing. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:05:04 +00:00
Chris Farhood	76fc6fcdfc	fix: surface pod terminated reason/message in adapter_failed errors (FAR-100) The init-only and partial-run error paths now embed the K8s container terminated state (reason, message, signal, OOM hint) directly in the errorMessage. This eliminates the kubectl round-trip when diagnosing adapter_failed runs — the surfaced error self-explains. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 14:48:12 +00:00
Chris Farhood	e0b35d230f	fix: distinguish init-only non-zero exits in buildPartialRunError (FAR-100) Init-only runs that exit with a non-zero code now surface a more actionable message naming the exit code and the likely cause (unsupported model or rejected session) instead of the generic "did not produce a result" text. Helps operators diagnose model-id / billing-tier failures (e.g. opus 4.6). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 13:04:43 +00:00
Chris Farhood	8474f78fe1	fix: include pod terminated reason/message in claude_truncated error (FAR-95) Capture the claude container's terminated state (exit code, reason, message, signal) and surface it in the truncation error so operators see why the run was cut short — e.g. "exit code 137, SIGKILL (commonly OOMKilled), reason=OOMKilled, message=Memory cgroup out of memory" instead of just a "truncated" label with no diagnostic context. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 01:57:43 +00:00
Chris Farhood	a2874c0426	fix: detect mid-stream truncation and emit claude_truncated error code (FAR-95) When Claude produces assistant content (output_tokens > 0) but the stream ends without a result event, classify the run as truncated mid-stream rather than falling through to the generic "did not produce a result — check API credentials" message. The misleading hint pointed operators at auth/model config when the real cause was pod termination, OOMKill, or CLI crash. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 01:54:35 +00:00
Chris Farhood	818aa0f1d6	feat: log bundled skill names and add skills to onMeta commandNotes (FAR-36) Adds a diagnostic log line after skill resolution so operators can see exactly which skills were bundled into each run, making it straightforward to diagnose skill availability issues. Also surfaces the skill list in the onMeta commandNotes for run metadata visibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 20:41:01 +00:00
Chris Farhood	55fd3021fb	fix: add per-agent mutex to eliminate TOCTOU race in K8s concurrency guard (FAR-29) Two concurrent execute() calls for the same agent can both pass the list-then-create guard before either job appears in the other's query. The new module-level agentCreationMutex serializes the guard+create phase within the process so only one call enters listNamespacedJob at a time. The mutex is acquired after sanitizing the agent ID and released in a finally block that wraps the entire guard+create section, so all early return paths (guard blocks, create failures) cleanly release it. Variables used in both the guard+create and log-streaming phases are hoisted to before the try block. Cross-agent calls use separate mutex slots and are unaffected. Added two vitest cases verifying same-agent serialization and that different-agent calls are not serialized. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 20:10:01 +00:00
Chris Farhood	83b58f9207	fix: detect stop_reason:null + output_tokens:0 and emit llm_api_error (FAR-30) parseClaudeStreamJson now tracks assistant events with stop_reason:null and output_tokens:0 (the MiniMax degraded-response pattern). When no result event follows, execute() returns errorCode:"llm_api_error" with a descriptive message instead of the generic adapter_failed. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 20:00:42 +00:00
Chris Farhood	602afa9b84	fix: return k8s_job_deleted_externally error code when job deleted mid-run (FAR-31) When a K8s Job is deleted externally (kubectl delete job or TTL before terminal condition observed) and stdout has no result event, the adapter now returns errorCode "k8s_job_deleted_externally" with the message "K8s Job was deleted externally before Claude could complete" instead of the misleading "Claude exited with code -1". Tracks a jobDeletedExternally flag in execute() on the jobGone path and checks it in the !parsed branch before falling through to buildPartialRunError. Only applies when exitCode is null (pod gone alongside the job). Adds regression test: FAR-31 scenario where job 404s mid-run with partial stdout and missing pod produces the new error code. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 19:58:46 +00:00
Chris Farhood	986f2fc7fa	test: add coverage for deletionTimestamp concurrency guard bypass (FAR-34) Verifies that a terminating K8s job (deletionTimestamp set, no Complete/Failed condition) is skipped by the concurrency guard so subsequent heartbeat runs are not incorrectly blocked. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 19:57:10 +00:00
Chris Farhood	cabdc3df98	fix: skip all structured streaming events in buildPartialRunError (FAR-32 followup) Extends the previous fix (which only covered assistant/user) to skip every JSON object with a non-empty "type" field — system, assistant, user, rate_limit_event, result, and any future event types. This prevents all structured protocol artefacts from being surfaced verbatim as error messages. Root cause of the new repro: when Claude emits a rate_limit_event before producing output and then exits without a result event, the rate_limit_event JSON blob was becoming the "first content line" and appearing in the error: Claude exited with code -1: {"type":"rate_limit_event","rate_limit_info":{...}} With this fix, all typed events are filtered and the initOnlyOutput branch fires, producing the clean diagnostic: Claude started but did not produce a result (model: claude-opus-4-7) — check API credentials, model support, and adapter config Updated the "result event as content" test to match the new (correct) behaviour: in production buildPartialRunError is only called when parseClaudeStreamJson returns null (no result event), so the prior test was exercising a degenerate state that cannot occur through execute(). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:17:48 +00:00
Chris Farhood	f9ff04a354	fix: skip assistant/user events in buildPartialRunError to avoid raw JSON blobs in error messages (FAR-32) When a model produces assistant events with output_tokens=0 but no result event (e.g. MiniMax-M2.7 thinking-only output), the partial-run error previously surfaced the raw assistant JSON blob verbatim, producing an unreadable message like "Claude exited with code -1: {\"type\":\"assistant\",...}". Fix: extend the content-line filter in buildPartialRunError to also skip assistant and user event types (intermediate streaming events), in addition to system events. result events are still retained since they may carry useful terminal error details. When all stdout lines are filtered, the existing initOnlyOutput branch triggers and surfaces a clean diagnostic: "Claude started but did not produce a result (model: MiniMax-M2.7) — check API credentials, model support, and adapter config". Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:11:20 +00:00
Chris Farhood	f097440f3c	feat: implement cancel support via keepalive poll and SIGTERM handler (FAR-26) - Poll GET /api/heartbeat-runs/:runId on every keepalive tick (15s); when status != 'running', delete the K8s Job, set logStopSignal, and return errorCode='cancelled' — Job gone within ~15s of external cancellation. - SIGTERM handler best-effort deletes all active Jobs/Secrets and re-emits the signal to let the process exit naturally. - Export shouldAbortForCancellation() helper; add tests for helper, cancel poll path, and SIGTERM cleanup. - Guard: PAPERCLIP_API_URL missing logs a warning and skips cancel polling; HTTP 5xx from poll treated as transient; reattach path skips cancel poll. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 15:20:45 +00:00
Chris Farhood	b97117e10d	test: mock readPaperclipRuntimeSkillEntries to eliminate real fs I/O under fake timers Previously the test suite relied on real fs.stat completing within the fake timer advance window (~11200ms). Under CI with 11 parallel test files the I/O could drain later than the advances allowed, causing a 1-in-4 timeout on the "logs pod pending" test. Fix: mock @paperclipai/adapter-utils/server-utils using vi.hoisted() + Object.assign so readPaperclipRuntimeSkillEntries resolves immediately as a microtask. All other exports are forwarded to the real module via importOriginal. Each beforeEach that calls vi.resetAllMocks() or vi.clearAllMocks() now also calls mockReadSkillEntries.mockResolvedValue([]) to restore the implementation. Timer advances in affected tests are simplified to reflect the purely fake-timer sequence (no I/O drain prefix). All 323 tests pass deterministically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 13:11:04 +00:00
Chris Farhood	f9d8a2e0ce	fix: resolve grace-period deadlock for stale UI status (FAR-23) The log-stream-exit grace timer never fired because logExitTime was set in the .then() of streamPodLogs, which only resolves once stopSignal is set — but stopSignal is only set when completionWithGrace fires, which requires logExitTime to be non-null. Classic deadlock. Fix: add onFirstStreamExit callback to streamPodLogs, called after attempt=0's streamPodLogsOnce returns (the first container exit signal). execute() passes a closure that sets logExitTime immediately, breaking the circular dependency and allowing the 30s grace timer to fire correctly when K8s Job conditions lag container exit. Tests: all 323 pass including the two FAR-23 grace-period regression tests. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 12:20:10 +00:00
Chris Farhood	a7dfd5d502	test: fix flaky execute.ts timer tests and hit 80%+ line coverage readPaperclipRuntimeSkillEntries does real fs.stat I/O under fake timers, delaying execute()'s fake-timer registration by ~3200-4200ms of fake time when tests run in isolation (cold OS page cache). The previous approach tried vi.spyOn on an ESM module namespace export, which throws "Cannot redefine property" — a fundamental ESM constraint. Fix: remove the broken spy. Instead, each timer-heavy test now uses enough advanceTimersByTimeAsync calls to (a) give the event loop sufficient turns for the I/O to drain, and (b) cover the full fake-timer sequence even with the maximum observed I/O delay. Patterns chosen: reconnects (needs t+6000): 6 advances, ~12200ms total deadline exceeded (needs t+3000): 5 advances, ~8400ms total pod-creation wait (needs t+5000): 5 advances, ~9400ms total execute.ts line coverage: 82.57% (was ~24% before this task's test additions). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 04:10:49 +00:00
Chris Farhood	29a4e709d0	fix: sanitize agent/run/company labels to RFC 1123 (N4) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 00:00:56 +00:00
Chris Farhood	b91859c258	refactor: extract classifyOrphan helper with decision matrix (#8 ) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:58:23 +00:00
Gandalf the Greybeard	69d0f4972f	test: regression for streamPodLogsOnce bail timer (FAR-10) Uses vi.mock on k8s-client and vi.useFakeTimers to prove that when logApi.log() never resolves (the FAR-10 hang shape) and stopSignal fires, streamPodLogsOnce still returns within the bail window (LOG_STREAM_BAIL_TIMEOUT_MS). Exports streamPodLogsOnce so the test can call it directly. Also covers the no-stopSignal happy path. 269/269 passing (+2 new). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 16:43:32 +00:00
Gandalf the Greybeard	b3c1519cf5	fix: prevent process_lost when K8s Job completes (FAR-10) Four stacked bugs caused the adapter to hang after K8s Job completion, allowing the 5-minute reaper to mark runs process_lost even when the Job actually succeeded. - streamPodLogsOnce: add stopSignal polling loop that destroys the writable every 200ms once the job-completion branch fires, aborting any in-flight follow stream that would otherwise hang indefinitely - waitForPod: treat phase=Failed as a terminal error (throw via describePodTerminatedError) instead of entering the log-stream path with a dead pod (new helper is exported for unit tests) - waitForPod: surface cs.state?.terminated in the per-tick detail line so operators see exit code / reason without needing kubectl - keepalive: add POST_TERMINAL_KEEPALIVE_MS (90s) window after Job goes terminal so onSpawn keeps refreshing updatedAt during cleanup; if execute() genuinely stalls past 90s the reaper will still catch it Regression tests added for describePodTerminatedError (phase=Failed with and without claude container status). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 15:59:51 +00:00
Test User	c8968598e4	fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124) When the Paperclip pod restarts mid-run, the in-process setInterval keepalive dies, `updatedAt` goes stale, and the server's orphan reaper fails the run with the (misleading) "child pid 1 is no longer running" message. Paperclip then dispatches a continuation run, whose execute() finds the previous run's K8s Job still happily running and deletes it as an "orphan" — throwing away work and producing the transcript/run cascade reported on FAR-124. Changes: - job-manifest: add `paperclip.io/task-id` and `paperclip.io/session-id` labels (sanitized via new `sanitizeLabelValue` helper) so a later execute() can identify an orphan as the continuation of the same logical unit of work. - execute: in the concurrency guard, when `reattachOrphanedJobs` is on (default) and an orphan matches agent + task + session + is not terminal, pick it as the reattach target; delete only the other orphans. Branch the build/create/waitForPod block so the reattach path skips manifest building, Secret creation, Job creation, and scheduling wait — it jumps straight to streaming logs and waiting for the existing pod's completion. - config-schema: expose `reattachOrphanedJobs` toggle (default true). - Tests: `sanitizeLabelValue`, `isReattachableOrphan`, new label presence/absence, config default. No server-side changes; the misleading reaper message and lack of a non-local retry path will be addressed in a follow-up upstream PR. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 21:59:25 +00:00
Test User	8c8c2f2ec0	fix: address review nits — refactor fallbacks, add unit tests (FAR-122) - Merge both one-shot log fallbacks into a single conditional block using a cheap string-scan guard (`stdout.includes('"type":"result"')`) to avoid calling parseClaudeStreamJson twice and prevent double readPodLogs calls when the first fallback already ran. - Extract error-message logic into `buildPartialRunError(exitCode, model, stdout)` (exported for tests) so the `!parsed` branch is a one-liner and the logic is independently testable. - Export `isK8s404` for tests. - Add execute.test.ts with 15 unit tests covering: - isK8s404: v0.x response.statusCode, v1.0+ response.status, direct statusCode, message-based detection, non-404 codes - buildPartialRunError: exitCode=0 path, empty stdout, init-only output (model surfaced), first non-system content line, null exitCode (-1), multiple consecutive system events Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 19:42:57 +00:00

25 Commits