paperclip-adapter-claude-k8s

farhoodlabs/paperclip-adapter-claude-k8s

Author	SHA1	Message	Date
Test User	8c8c2f2ec0	fix: address review nits — refactor fallbacks, add unit tests (FAR-122) - Merge both one-shot log fallbacks into a single conditional block using a cheap string-scan guard (`stdout.includes('"type":"result"')`) to avoid calling parseClaudeStreamJson twice and prevent double readPodLogs calls when the first fallback already ran. - Extract error-message logic into `buildPartialRunError(exitCode, model, stdout)` (exported for tests) so the `!parsed` branch is a one-liner and the logic is independently testable. - Export `isK8s404` for tests. - Add execute.test.ts with 15 unit tests covering: - isK8s404: v0.x response.statusCode, v1.0+ response.status, direct statusCode, message-based detection, non-404 codes - buildPartialRunError: exitCode=0 path, empty stdout, init-only output (model surfaced), first non-system content line, null exitCode (-1), multiple consecutive system events Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 19:42:57 +00:00
Test User	b9def0964e	fix: improve partial-log handling and error messages for fast-exit containers (FAR-122) - Add a second log fallback: if the follow stream captured partial output (init event present but no result event), attempt a one-shot readPodLogs before the pod is cleaned up. Fast-exiting containers (bad model, missing API key, etc.) can cause the follow stream to return only the init line before the connection drops; the one-shot read is more reliable for already-terminated containers. - Improve the `!parsed` error message: skip system/init events when searching for the first content line, so the error reads "Claude started but did not produce a result (model: MiniMax-M2.7) — check API credentials..." instead of "Claude exited with code -1: {"type":"system","subtype":"init",...}". Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 19:33:15 +00:00
Test User	2a31fe1f9b	fix: handle K8s 404 (job deleted) gracefully in waitForJobCompletion (FAR-122) - Add `isK8s404()` helper compatible with @kubernetes/client-node v0.x and v1.0+ (checks response.statusCode, response.status, err.statusCode, and message text) - `waitForJobCompletion` now catches 404 and returns `{ jobGone: true }` instead of throwing — prevents uncaught exceptions when the K8s Job is TTL-deleted or externally removed while the adapter is polling for a terminal condition - Keepalive job-liveness check now uses `isK8s404` (was checking `response.statusCode` which is absent in the v1.0+ fetch-based client, silently breaking 404 detection) - `jobGone` case in completion handler logs a diagnostic and falls through to stdout parsing rather than returning an opaque 404 error to the user Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 16:42:07 +00:00
Iceman	5926b302e5	fix: replace pid:-1 sentinel with process.pid to prevent false process_lost The adapter was calling onSpawn({ pid: -1 }) as a sentinel value for K8s Jobs (which run out-of-process), then the server's orphan reaper was checking isProcessAlive(-1) which always returns false, causing legitimate runs to be reaped as 'process_lost'. Using process.pid (the Paperclip server's own PID) is always alive while the adapter runs in-process, preventing false reaping. Fixes FAR-116. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 18:09:41 +00:00
Test User	1e517bb9bb	fix: P1 correctness and operational fixes from FAR-104/FAR-105 analysis 5. Cap log stream reconnect attempts at 50 — prevents infinite reconnect loops during sustained API partitions. 6. Fire keepalive refresh earlier — tick 1 + every 12 ticks (~3min) instead of every 16 ticks (~4min), providing better safety margin under the 5-minute reaper window. 7. Catch rejections from onLog inside keepalive — add .catch(() => {}) to prevent unhandledRejection on SSE backpressure. 8. Prevent sanitized-name collisions — extend slugs to 16 chars each, add a 6-char SHA-256 hash suffix, shorten prefix to `ac-` to stay well within the 63-char DNS label limit. 10. Fix config-hint parity for nodeSelector and labels — parse both `key=value` multiline text and JSON objects, matching what the textarea hint promises. 11. Large-prompt fallback via Secret — prompts >256 KiB are staged as a K8s Secret and mounted as a volume instead of passed via env var, protecting against the ~1 MiB PodSpec limit. 13. Track last-seen log timestamp on reconnect — anchor sinceSeconds at the last received log line instead of stream start, fixing FAR-105 duplicative logs. Belt-and-braces: dedupe assistantTexts at the parser boundary in parse.ts. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 19:05:07 +00:00
Test User	d74b6d34b3	fix: P0 correctness fixes from FAR-104/FAR-105 analysis 1. Inherit envFrom and env.valueFrom from self pod — secrets wired via valueFrom.secretKeyRef or envFrom.secretRef are now forwarded to Job pods, fixing credentials silently dropped for K8s-idiomatic secret patterns (e.g. ANTHROPIC_API_KEY via Secret). 2. Distinguish 404 vs transient errors in keepalive — only mark the keepalive as terminal on 404 (Job deleted). Transient 5xx/connection errors are logged and retried on the next tick, preventing premature reaper kills during API instability. 3. Fail closed on concurrency-guard read failure — a failing listNamespacedJob now returns k8s_concurrency_guard_unreachable instead of silently proceeding, protecting against zombie Jobs on shared PVCs. 4. Bound the waitForJobCompletion re-check — pass a 60s timeout instead of polling forever, preventing indefinite hangs when the K8s API is degraded. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 18:57:16 +00:00
Test User	5f5ae92ce7	fix: skip keepalive updatedAt refresh once K8s Job is terminal The previous fix (`df856e6`) made the keepalive timer call onSpawn every ~4 minutes to refresh the run's updatedAt in the DB, so the stale-run reaper wouldn't kill live runs in multi-instance deployments. That was correct for live jobs, but it was unconditional — if execute() stalled after the pod terminated (slow K8s API call, hung log stream drain, or a Job whose Complete condition lags pod termination), the keepalive kept the run marked "alive" indefinitely even though the pod was gone. That manifests as the opposite of the original bug: the UI shows jobs as running when they have actually finished. Two changes: 1. Verify the Job is still alive before the keepalive refreshes updatedAt. If the Job has reached a terminal Complete/Failed condition (or has been deleted / the API read fails), stop refreshing. If execute() truly ends up stuck past that point, the reaper will catch the run within the normal 5-minute staleness window instead of never. 2. Clear the keepalive interval immediately once Promise.allSettled resolves, rather than only in the finally block. Post-completion work (exit-code fetch, log fallback read, job cleanup) must not be able to emit another onSpawn refresh that keeps the run "alive". Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-17 02:57:17 +00:00
Test User	df856e6ca5	fix: clean up orphaned K8s Jobs and refresh updatedAt to prevent UI desync Two root causes behind the "plugin losing sync" issue: 1. After a server restart, the in-memory activeRunExecutions set is lost. The K8s Job keeps running but the reaper marks the server-side run as failed after 5 min (stale updatedAt). Next heartbeat fires a new run, the adapter's concurrency guard blocks it because the old Job is still alive, and this loops indefinitely. Fix: the concurrency guard now compares each running Job's paperclip.io/run-id label against the current runId. Jobs from a previous (dead) run are cleaned up automatically so the new run can proceed. 2. onLog (keepalive) does NOT update the run's updatedAt in the DB — it only writes to the log store and publishes SSE events. In multi-instance deployments, a reaper on instance B can mark a run being executed on instance A as stale after 5 min of no DB updates. Fix: the keepalive timer now calls onSpawn every ~4 min (16 ticks) to refresh updatedAt, staying within the 5-min reaper threshold. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-16 21:48:16 +00:00
Test User	4bf5cf64a4	fix: call onSpawn after pod enters Running state to prevent UI desync The k8s adapter never called ctx.onSpawn(), so the Paperclip server had no processStartedAt timestamp for the run. The stale-run reaper (reapOrphanedRuns) would then mark live k8s runs as failed/orphaned, causing the UI to show no active runs and triggering duplicate run attempts that hit the concurrency guard. Uses pid=-1 as a sentinel since there is no local process — the server's isProcessAlive check safely returns false for pid <= 0. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-16 15:46:34 +00:00
Chris Farhood	b8ba457790	fix: don't delete job when returning state-mismatch error to keep UI in sync When waitForJobCompletion threw and the job was still not terminal, we were returning an error but still deleting the job in the finally block. This left the UI holding an error while the job (still alive) would be cleaned up by Kubernetes, causing the next heartbeat to find nothing and think it was safe to retry — spawning a concurrent pod. Now we set skipCleanup=true when returning the mismatch error, so the job is retained and the heartbeat can still find and wait on it. Also removes a duplicate empty-stdout fallback block. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 11:29:42 -04:00
Chris Farhood	efbbfbc299	fix: re-check job state when completion waiter throws to prevent UI staleness When waitForJobCompletion threw a transient error (API disconnect, etc.), the code fell through with jobTimedOut=true and returned a result even though the job was still running. This caused the UI to think the run was complete while the job kept running, resulting in concurrency errors. Now when completion throws, we re-check the job's actual state. If still not terminal, we return a k8s_job_state_mismatch error so the UI knows the run is not done. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 07:26:10 -04:00
Pawla Abdul	77ba40d9bf	Reconnect K8s log stream on silent API disconnects The adapter opened a single follow-stream to the K8s API for pod logs. If that TCP connection silently dropped (API server hiccup, network timeout, load-balancer idle cut), streamPodLogs returned early and no more real Claude output reached the UI — only keepalive pings. The pod kept running and producing logs (visible via kubectl), but the adapter never reconnected. Splits streamPodLogs into streamPodLogsOnce (single follow attempt) and a reconnecting wrapper that retries with sinceSeconds until a shared stop signal fires when waitForJobCompletion resolves. On reconnect, requests logs from the original stream start time (+5s overlap) so no output is lost; the UI deduplicates chunks. Bumps version to 0.1.12. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-13 10:34:41 +00:00
Pawla Abdul	e760bf9386	Add keepalive pings during job execution to prevent UI timeout desync The adapter had no mechanism to signal liveness while a K8s Job was running. When Claude entered long thinking phases with no log output, the Paperclip UI could lose sync and consider the run stuck even though the pod was still actively working. Adds a 15-second interval keepalive that sends status messages via onLog during execution. The keepalive tracks time since last real log output and reports it, keeping the connection alive. The timer is cleaned up in the finally block to prevent leaks on any exit path. Bumps version to 0.1.11. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-12 18:44:09 +00:00
Chris Farhood	9dbb5f337e	Initial commit: Paperclip adapter for Claude Code on Kubernetes Adapter plugin that runs Claude Code agents as Kubernetes Jobs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-11 23:16:31 -04:00

14 Commits