paperclip-adapter-claude-k8s

farhoodlabs/paperclip-adapter-claude-k8s

Author	SHA1	Message	Date
Chris Farhood	29a4e709d0	fix: sanitize agent/run/company labels to RFC 1123 (N4) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 00:00:56 +00:00
Chris Farhood	8a08e6a6ee	fix: relabel reattached Job with current run-id and session-id (N3) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:59:05 +00:00
Chris Farhood	c0dba8e904	fix: never auto-delete live K8s orphans; block on mismatch (#8 ) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:58:51 +00:00
Chris Farhood	b91859c258	refactor: extract classifyOrphan helper with decision matrix (#8 ) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:58:23 +00:00
Chris Farhood	f1433b05a6	fix: reserve paperclip.io/ and app.kubernetes.io/ label prefixes (N2) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:54:15 +00:00
Chris Farhood	f64694f894	fix: validate companyId/instanceId against path traversal (N1) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:53:18 +00:00
Gandalf the Greybeard	e86b14a677	0.1.34 v0.1.34	2026-04-23 23:35:02 +00:00
Gandalf the Greybeard	98f3821f91	fix: address remaining minor code review findings (FAR-15) - #9: match Paperclip container by name in k8s-client instead of trusting spec.containers[0], which could be a service-mesh sidecar - #11: key assistant-text dedup by (message.id, index) so legitimate duplicate content across turns isn't collapsed in the summary - #16: trim trailing hyphens from sanitized K8s names so truncation doesn't produce names ending in "-" Findings #5 (keepalive re-verify) and #6 (one-shot log dedup) were already addressed in the current code — verified during this review. #8 (orphan reattach behavior) requires a product decision on whether "new session wins" is intentional, so deferring. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:34:59 +00:00
Gandalf the Greybeard	21a02da00f	fix: prevent prompt Secret leak by attaching ownerReference to Job (FAR-15) When a large prompt creates a K8s Secret, it can orphan if the process crashes before the finally block runs. Now the Secret gets an ownerReference pointing to the Job after creation, so K8s GC cleans it up automatically. Also cleans up the Secret on job creation failure. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:29:47 +00:00
Gandalf the Greybeard	346f5cc1df	fix: prevent UTF-8 corruption when RTK truncation splits multi-byte codepoints (FAR-19) The trunc function in the RTK filter script now walks back from the truncation point past continuation bytes and checks whether the full codepoint fits, avoiding replacement characters from mid-codepoint slicing. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:28:28 +00:00
Gandalf the Greybeard	ef73586a41	fix: address 6 critical/minor code review findings (FAR-15) 1. Fix resources.* dotted-key config — UI fields now correctly read 2. Fix operator precedence bug in container status key (add parens) 3. Add missing RBAC checks to testEnvironment (jobs/list, secrets/*, pvc) 4. Add bail timer log message for debuggability 5. Make result-event detection robust to JSON whitespace variations 6. Remove namespace short-circuit so all checks run on first attempt Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:15:01 +00:00
Gandalf the Greybeard	9f79efdf36	0.1.33 v0.1.33	2026-04-23 22:45:37 +00:00
Gandalf the Greybeard	4210f51937	chore: update lockfile Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 22:45:31 +00:00
Gandalf the Greybeard	f41ae818ef	fix: fire onSpawn immediately on job terminal transition (FAR-14) Prevents process_lost false positives for 2-3 minute K8s jobs by resetting the reaper clock when the keepalive loop detects the job has completed (or been deleted), rather than waiting for the next periodic refresh. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 22:29:22 +00:00
Hugh Commit	baf7e2d44d	0.1.32: port prepareClaudePromptBundle to claude_k8s (FAR-12) Co-Authored-By: Paperclip <noreply@paperclip.ing> v0.1.32	2026-04-23 19:47:26 +00:00
Gandalf the Greybeard	77ed2004f8	fix: port prepareClaudePromptBundle flow to claude_k8s adapter (FAR-11) K8s Job pods were starting without the Paperclip skill loaded, so agents could not find their heartbeat procedure and reported "no issue content in my workspace" on every wake. Root cause: claude_local materialises skills into a PVC-backed prompt-bundle directory and passes --add-dir to Claude, but claude_k8s did neither. Changes: - Add src/server/prompt-cache.ts with prepareClaudePromptBundle (ported from adapter-claude-local). Writes skill symlinks and the agent's instructions file into a content-addressed bundle directory under the shared PVC (/paperclip/instances/.../claude-prompt-cache/<hash>/). - execute.ts: read desired skills and instructions file before building the Job manifest, then call prepareClaudePromptBundle and pass the resulting bundle to buildJobManifest. - job-manifest.ts: accept optional promptBundle in JobBuildInput; when present, pass --add-dir <bundle.addDir> and use bundle.instructionsFilePath for --append-system-prompt-file. Also fix: skip --append-system-prompt-file on session resumes to avoid wasting tokens on re-injection. - skills.ts: correct the detail string to reflect actual materialisation. - job-manifest.test.ts: add 5 new tests covering --add-dir injection, bundle path preference, session-resume skipping, and fallback behaviour. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 19:34:35 +00:00
Gandalf the Greybeard	69d0f4972f	test: regression for streamPodLogsOnce bail timer (FAR-10) Uses vi.mock on k8s-client and vi.useFakeTimers to prove that when logApi.log() never resolves (the FAR-10 hang shape) and stopSignal fires, streamPodLogsOnce still returns within the bail window (LOG_STREAM_BAIL_TIMEOUT_MS). Exports streamPodLogsOnce so the test can call it directly. Also covers the no-stopSignal happy path. 269/269 passing (+2 new). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 16:43:32 +00:00
Gandalf the Greybeard	c7706d742f	0.1.31: harden streamPodLogsOnce with Promise.race bail (FAR-10) Defensive follow-up to the FAR-10 fix. The original patch aborts the in-flight follow stream by destroying the Writable once stopSignal fires, and relies on the @kubernetes/client-node library propagating that destroy into an abort of the underlying HTTP request. If that propagation ever fails (e.g. the client is awaiting a response that never arrives), logApi.log() can still hang forever. Adds a Promise.race with a 3s bail timer that starts when stopSignal fires. In the happy path (destroy-propagation works), logApi.log() resolves first and the bail timer is cleared. In the failure path, the bail timer fires and streamPodLogsOnce returns with whatever chunks were captured — preventing the hang from reaching execute(). No test change: existing 267 tests pass and the race path needs a k8s mock to exercise end-to-end; validated by monitoring real runs. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 16:36:51 +00:00
Gandalf the Greybeard	8937fb2804	chore: fix repo org farhoodliquor→farhoodlabs; wire NPM_TOKEN for publish - Update repository, bugs, and homepage URLs in package.json to use the correct farhoodlabs GitHub org - Add NODE_AUTH_TOKEN: NPM_TOKEN to the CI publish step so the newly added NPM_TOKEN secret is picked up for authentication Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 16:20:48 +00:00
Gandalf the Greybeard	77e9aa9b37	ci: switch npm publish to OIDC trusted publishing Replaces NPM_TOKEN secret with id-token: write + --provenance so publishing uses GitHub's OIDC token directly. No repository secret required; provenance attestation is generated automatically. Also collapses the redundant second setup-node step (registry-url is now set on the first one). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 16:10:39 +00:00
Gandalf the Greybeard	683ea2d8b1	0.1.30 Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 16:08:22 +00:00
Chris Farhood	dd859c74a8	Merge pull request #9 from farhoodlabs/fix/far-10-process-lost-after-job-complete fix: prevent process_lost when K8s Job completes (FAR-10)	2026-04-23 12:07:33 -04:00
Gandalf the Greybeard	b3c1519cf5	fix: prevent process_lost when K8s Job completes (FAR-10) Four stacked bugs caused the adapter to hang after K8s Job completion, allowing the 5-minute reaper to mark runs process_lost even when the Job actually succeeded. - streamPodLogsOnce: add stopSignal polling loop that destroys the writable every 200ms once the job-completion branch fires, aborting any in-flight follow stream that would otherwise hang indefinitely - waitForPod: treat phase=Failed as a terminal error (throw via describePodTerminatedError) instead of entering the log-stream path with a dead pod (new helper is exported for unit tests) - waitForPod: surface cs.state?.terminated in the per-tick detail line so operators see exit code / reason without needing kubectl - keepalive: add POST_TERMINAL_KEEPALIVE_MS (90s) window after Job goes terminal so onSpawn keeps refreshing updatedAt during cleanup; if execute() genuinely stalls past 90s the reaper will still catch it Regression tests added for describePodTerminatedError (phase=Failed with and without claude container status). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 15:59:51 +00:00
Test User	78fd702ccb	0.1.29 v0.1.29	2026-04-23 02:48:58 +00:00
Chris Farhood	0bc1bb1dd1	Merge pull request #8 from farhoodliquor/fix/far-124-reattach-orphan-k8s-jobs fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124)	2026-04-22 22:33:05 -04:00
Test User	c8968598e4	fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124) When the Paperclip pod restarts mid-run, the in-process setInterval keepalive dies, `updatedAt` goes stale, and the server's orphan reaper fails the run with the (misleading) "child pid 1 is no longer running" message. Paperclip then dispatches a continuation run, whose execute() finds the previous run's K8s Job still happily running and deletes it as an "orphan" — throwing away work and producing the transcript/run cascade reported on FAR-124. Changes: - job-manifest: add `paperclip.io/task-id` and `paperclip.io/session-id` labels (sanitized via new `sanitizeLabelValue` helper) so a later execute() can identify an orphan as the continuation of the same logical unit of work. - execute: in the concurrency guard, when `reattachOrphanedJobs` is on (default) and an orphan matches agent + task + session + is not terminal, pick it as the reattach target; delete only the other orphans. Branch the build/create/waitForPod block so the reattach path skips manifest building, Secret creation, Job creation, and scheduling wait — it jumps straight to streaming logs and waiting for the existing pod's completion. - config-schema: expose `reattachOrphanedJobs` toggle (default true). - Tests: `sanitizeLabelValue`, `isReattachableOrphan`, new label presence/absence, config default. No server-side changes; the misleading reaper message and lack of a non-local retry path will be addressed in a follow-up upstream PR. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 21:59:25 +00:00
Test User	a4631ac756	0.1.28 Co-Authored-By: Paperclip <noreply@paperclip.ing> v0.1.28	2026-04-22 19:52:51 +00:00
Chris Farhood	1fc6a9c626	Merge pull request #7 from farhoodliquor/fix/far-123-duplicate-output-logs fix(FAR-123): dedup replayed K8s log lines at the streaming UI boundary	2026-04-22 15:51:40 -04:00
Chris Farhood	d71ff15443	Merge pull request #6 from farhoodliquor/fix/far-122-partial-log-stream-and-error-message fix: partial log stream + better error messages for fast-exit containers (FAR-122 follow-up)	2026-04-22 15:51:05 -04:00
Test User	5e01ae99b3	fix: dedup replayed K8s log lines at the streaming UI boundary (FAR-123) The K8s log follow stream replays the trailing few seconds of output on every reconnect because `sinceSeconds` uses integer-second granularity with a small safety buffer. FAR-105 dedupped those replays at the final parser (parse.ts), but the streaming UI consumes raw onLog chunks and still showed each replayed assistant/tool event as a fresh entry — which is how the duplicate "Three nits to fix…" blocks in the screenshot appeared between successive tool calls. Fix: add a stateful line-level dedup filter around onLog, shared across reconnects. Claude stream-json events are keyed by their stable structural IDs (message.id, tool_use_id, session_id); non-JSON output (paperclip status lines, shell output) passes through unchanged. - New `src/server/log-dedup.ts` + tests: LogLineDedupFilter handles chunk-to-line buffering, replay dedup, and end-of-stream flush. - `streamPodLogs` instantiates one filter per run so dedup state persists across reconnect attempts. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 19:49:04 +00:00
Test User	8c8c2f2ec0	fix: address review nits — refactor fallbacks, add unit tests (FAR-122) - Merge both one-shot log fallbacks into a single conditional block using a cheap string-scan guard (`stdout.includes('"type":"result"')`) to avoid calling parseClaudeStreamJson twice and prevent double readPodLogs calls when the first fallback already ran. - Extract error-message logic into `buildPartialRunError(exitCode, model, stdout)` (exported for tests) so the `!parsed` branch is a one-liner and the logic is independently testable. - Export `isK8s404` for tests. - Add execute.test.ts with 15 unit tests covering: - isK8s404: v0.x response.statusCode, v1.0+ response.status, direct statusCode, message-based detection, non-404 codes - buildPartialRunError: exitCode=0 path, empty stdout, init-only output (model surfaced), first non-system content line, null exitCode (-1), multiple consecutive system events Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 19:42:57 +00:00
Test User	b9def0964e	fix: improve partial-log handling and error messages for fast-exit containers (FAR-122) - Add a second log fallback: if the follow stream captured partial output (init event present but no result event), attempt a one-shot readPodLogs before the pod is cleaned up. Fast-exiting containers (bad model, missing API key, etc.) can cause the follow stream to return only the init line before the connection drops; the one-shot read is more reliable for already-terminated containers. - Improve the `!parsed` error message: skip system/init events when searching for the first content line, so the error reads "Claude started but did not produce a result (model: MiniMax-M2.7) — check API credentials..." instead of "Claude exited with code -1: {"type":"system","subtype":"init",...}". Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 19:33:15 +00:00
Test User	20e7ec43ce	0.1.27 v0.1.27	2026-04-22 17:08:48 +00:00
Chris Farhood	3e67b34baa	Merge pull request #5 from farhoodliquor/fix/far-122-404-job-not-found fix: handle K8s 404 (job deleted) gracefully in waitForJobCompletion (FAR-122)	2026-04-22 13:08:03 -04:00
Test User	2a31fe1f9b	fix: handle K8s 404 (job deleted) gracefully in waitForJobCompletion (FAR-122) - Add `isK8s404()` helper compatible with @kubernetes/client-node v0.x and v1.0+ (checks response.statusCode, response.status, err.statusCode, and message text) - `waitForJobCompletion` now catches 404 and returns `{ jobGone: true }` instead of throwing — prevents uncaught exceptions when the K8s Job is TTL-deleted or externally removed while the adapter is polling for a terminal condition - Keepalive job-liveness check now uses `isK8s404` (was checking `response.statusCode` which is absent in the v1.0+ fetch-based client, silently breaking 404 detection) - `jobGone` case in completion handler logs a diagnostic and falls through to stdout parsing rather than returning an opaque 404 error to the user Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 16:42:07 +00:00
Test User	99c97c1fb2	feat: add native Node.js RTK output filtering (FAR-66) Replace the init-container RTK binary approach with a self-contained Node.js implementation. When `enableRtk: true` is set in adapter config, the job's main container startup: 1. Writes a Node.js filter script to /tmp/.rtk-filter.js (base64-encoded inline — no curl, no wget, no external binary download required). 2. Merges a PostToolUse hook into ~/.claude/settings.json so Claude Code runs the filter after every tool call. 3. The filter truncates tool_response/tool_result content that exceeds `rtkMaxOutputBytes` (default: 50 000 B), handling both string and array (text-block) content formats. New config fields: enableRtk toggle — off by default rtkMaxOutputBytes number — truncation threshold (default 50 000) 9 new tests cover: command shape, ordering, no-external-binary guarantee, threshold injection, PostToolUse hook presence, and filter-script logic. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-22 02:08:24 +00:00
Iceman	5926b302e5	fix: replace pid:-1 sentinel with process.pid to prevent false process_lost The adapter was calling onSpawn({ pid: -1 }) as a sentinel value for K8s Jobs (which run out-of-process), then the server's orphan reaper was checking isProcessAlive(-1) which always returns false, causing legitimate runs to be reaped as 'process_lost'. Using process.pid (the Paperclip server's own PID) is always alive while the adapter runs in-process, preventing false reaping. Fixes FAR-116. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 18:09:41 +00:00
Paperclip	31328dd85b	chore: unscope package name to paperclip-adapter-claude-k8s Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-21 10:26:43 +00:00
farhoodliquor-paperclip[bot]	0660749c1f	Merge pull request #3 from farhoodliquor/fix/p0-correctness-far107 fix: P0+P1 correctness fixes (FAR-107 PR 1-2/3)	2026-04-20 19:41:16 +00:00
Test User	b45cc29787	chore: bump version to 0.1.25 for PR #3 Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 19:40:26 +00:00
Test User	1e517bb9bb	fix: P1 correctness and operational fixes from FAR-104/FAR-105 analysis 5. Cap log stream reconnect attempts at 50 — prevents infinite reconnect loops during sustained API partitions. 6. Fire keepalive refresh earlier — tick 1 + every 12 ticks (~3min) instead of every 16 ticks (~4min), providing better safety margin under the 5-minute reaper window. 7. Catch rejections from onLog inside keepalive — add .catch(() => {}) to prevent unhandledRejection on SSE backpressure. 8. Prevent sanitized-name collisions — extend slugs to 16 chars each, add a 6-char SHA-256 hash suffix, shorten prefix to `ac-` to stay well within the 63-char DNS label limit. 10. Fix config-hint parity for nodeSelector and labels — parse both `key=value` multiline text and JSON objects, matching what the textarea hint promises. 11. Large-prompt fallback via Secret — prompts >256 KiB are staged as a K8s Secret and mounted as a volume instead of passed via env var, protecting against the ~1 MiB PodSpec limit. 13. Track last-seen log timestamp on reconnect — anchor sinceSeconds at the last received log line instead of stream start, fixing FAR-105 duplicative logs. Belt-and-braces: dedupe assistantTexts at the parser boundary in parse.ts. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 19:05:07 +00:00
Test User	d74b6d34b3	fix: P0 correctness fixes from FAR-104/FAR-105 analysis 1. Inherit envFrom and env.valueFrom from self pod — secrets wired via valueFrom.secretKeyRef or envFrom.secretRef are now forwarded to Job pods, fixing credentials silently dropped for K8s-idiomatic secret patterns (e.g. ANTHROPIC_API_KEY via Secret). 2. Distinguish 404 vs transient errors in keepalive — only mark the keepalive as terminal on 404 (Job deleted). Transient 5xx/connection errors are logged and retried on the next tick, preventing premature reaper kills during API instability. 3. Fail closed on concurrency-guard read failure — a failing listNamespacedJob now returns k8s_concurrency_guard_unreachable instead of silently proceeding, protecting against zombie Jobs on shared PVCs. 4. Bound the waitForJobCompletion re-check — pass a 60s timeout instead of polling forever, preventing indefinite hangs when the K8s API is degraded. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 18:57:16 +00:00
Test User	c35253ddd4	0.1.24 v0.1.24	2026-04-20 18:03:53 +00:00
Test User	5f358b2a26	chore: update package-lock.json Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 18:03:50 +00:00
Test User	5c28e6c191	fix: use printf instead of echo in init container to prevent prompt corruption Busybox echo interprets escape sequences by default (\c, \n, \t, \0NNN, etc.). If the prompt contains \c (common in file paths or shell references), echo silently stops output at that point, truncating the prompt file. This can leave Claude CLI with an empty or garbled stdin, causing it to hang with zero output — manifesting as endless keepalive messages in the UI. printf '%s' passes content through verbatim, avoiding the issue. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 18:03:37 +00:00
Test User	465a947e1d	0.1.23 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> 0.1.23	2026-04-20 16:10:40 +00:00
Test User	ecd8bfc7f6	fix: correct Bedrock model list — add Sonnet 4.6, fix Sonnet 4.5 version - Add missing us.anthropic.claude-sonnet-4-6 entry - Correct sonnet version from v2:0 to v1:0 (verified against AWS docs) - All model IDs verified against current Bedrock documentation Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-20 16:01:11 +00:00
Test User	b14ec960ae	0.1.22 v0.1.22	2026-04-17 02:58:08 +00:00
Test User	5f5ae92ce7	fix: skip keepalive updatedAt refresh once K8s Job is terminal The previous fix (`df856e6`) made the keepalive timer call onSpawn every ~4 minutes to refresh the run's updatedAt in the DB, so the stale-run reaper wouldn't kill live runs in multi-instance deployments. That was correct for live jobs, but it was unconditional — if execute() stalled after the pod terminated (slow K8s API call, hung log stream drain, or a Job whose Complete condition lags pod termination), the keepalive kept the run marked "alive" indefinitely even though the pod was gone. That manifests as the opposite of the original bug: the UI shows jobs as running when they have actually finished. Two changes: 1. Verify the Job is still alive before the keepalive refreshes updatedAt. If the Job has reached a terminal Complete/Failed condition (or has been deleted / the API read fails), stop refreshing. If execute() truly ends up stuck past that point, the reaper will catch the run within the normal 5-minute staleness window instead of never. 2. Clear the keepalive interval immediately once Promise.allSettled resolves, rather than only in the finally block. Post-completion work (exit-code fetch, log fallback read, job cleanup) must not be able to emit another onSpawn refresh that keeps the run "alive". Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-17 02:57:17 +00:00
Chris Farhood	20b85b8391	feat: add serviceAccountName field to config schema Surface SA assignment in the Kubernetes section of the adapter UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 20:06:43 -04:00

1 2

86 Commits