paperclip-adapter-claude-k8s

farhoodlabs/paperclip-adapter-claude-k8s

Author	SHA1	Message	Date
Chris Farhood	76fc6fcdfc	fix: surface pod terminated reason/message in adapter_failed errors (FAR-100) The init-only and partial-run error paths now embed the K8s container terminated state (reason, message, signal, OOM hint) directly in the errorMessage. This eliminates the kubectl round-trip when diagnosing adapter_failed runs — the surfaced error self-explains. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 14:48:12 +00:00
Chris Farhood	3169f49f23	0.1.47 v0.1.47	2026-04-26 13:04:54 +00:00
Chris Farhood	e0b35d230f	fix: distinguish init-only non-zero exits in buildPartialRunError (FAR-100) Init-only runs that exit with a non-zero code now surface a more actionable message naming the exit code and the likely cause (unsupported model or rejected session) instead of the generic "did not produce a result" text. Helps operators diagnose model-id / billing-tier failures (e.g. opus 4.6). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 13:04:43 +00:00
Chris Farhood	4e2c36319d	0.1.46 v0.1.46	2026-04-26 01:57:43 +00:00
Chris Farhood	8474f78fe1	fix: include pod terminated reason/message in claude_truncated error (FAR-95) Capture the claude container's terminated state (exit code, reason, message, signal) and surface it in the truncation error so operators see why the run was cut short — e.g. "exit code 137, SIGKILL (commonly OOMKilled), reason=OOMKilled, message=Memory cgroup out of memory" instead of just a "truncated" label with no diagnostic context. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 01:57:43 +00:00
Chris Farhood	88896eddcf	0.1.45 v0.1.45	2026-04-26 01:54:48 +00:00
Chris Farhood	a2874c0426	fix: detect mid-stream truncation and emit claude_truncated error code (FAR-95) When Claude produces assistant content (output_tokens > 0) but the stream ends without a result event, classify the run as truncated mid-stream rather than falling through to the generic "did not produce a result — check API credentials" message. The misleading hint pointed operators at auth/model config when the real cause was pod termination, OOMKill, or CLI crash. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 01:54:35 +00:00
Chris Farhood	818aa0f1d6	feat: log bundled skill names and add skills to onMeta commandNotes (FAR-36) Adds a diagnostic log line after skill resolution so operators can see exactly which skills were bundled into each run, making it straightforward to diagnose skill availability issues. Also surfaces the skill list in the onMeta commandNotes for run metadata visibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 20:41:01 +00:00
Chris Farhood	55fd3021fb	fix: add per-agent mutex to eliminate TOCTOU race in K8s concurrency guard (FAR-29) Two concurrent execute() calls for the same agent can both pass the list-then-create guard before either job appears in the other's query. The new module-level agentCreationMutex serializes the guard+create phase within the process so only one call enters listNamespacedJob at a time. The mutex is acquired after sanitizing the agent ID and released in a finally block that wraps the entire guard+create section, so all early return paths (guard blocks, create failures) cleanly release it. Variables used in both the guard+create and log-streaming phases are hoisted to before the try block. Cross-agent calls use separate mutex slots and are unaffected. Added two vitest cases verifying same-agent serialization and that different-agent calls are not serialized. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 20:10:01 +00:00
Chris Farhood	83b58f9207	fix: detect stop_reason:null + output_tokens:0 and emit llm_api_error (FAR-30) parseClaudeStreamJson now tracks assistant events with stop_reason:null and output_tokens:0 (the MiniMax degraded-response pattern). When no result event follows, execute() returns errorCode:"llm_api_error" with a descriptive message instead of the generic adapter_failed. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 20:00:42 +00:00
Chris Farhood	602afa9b84	fix: return k8s_job_deleted_externally error code when job deleted mid-run (FAR-31) When a K8s Job is deleted externally (kubectl delete job or TTL before terminal condition observed) and stdout has no result event, the adapter now returns errorCode "k8s_job_deleted_externally" with the message "K8s Job was deleted externally before Claude could complete" instead of the misleading "Claude exited with code -1". Tracks a jobDeletedExternally flag in execute() on the jobGone path and checks it in the !parsed branch before falling through to buildPartialRunError. Only applies when exitCode is null (pod gone alongside the job). Adds regression test: FAR-31 scenario where job 404s mid-run with partial stdout and missing pod produces the new error code. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 19:58:46 +00:00
Chris Farhood	986f2fc7fa	test: add coverage for deletionTimestamp concurrency guard bypass (FAR-34) Verifies that a terminating K8s job (deletionTimestamp set, no Complete/Failed condition) is skipped by the concurrency guard so subsequent heartbeat runs are not incorrectly blocked. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 19:57:10 +00:00
Chris Farhood	357f035418	fix: skip K8s jobs with deletionTimestamp in concurrency guard (FAR-34) Jobs being deleted via kubectl enter a Terminating state where deletionTimestamp is set but no Complete/Failed condition is added. The concurrency guard previously treated these as running, blocking all subsequent heartbeat runs for the agent until the job fully disappeared from the K8s API. Co-Authored-By: Paperclip <noreply@paperclip.ing> 0.1.43	2026-04-24 18:36:19 +00:00
Chris Farhood	f340ce52ee	0.1.42 v0.1.42	2026-04-24 17:56:14 +00:00
Chris Farhood	ecc477d0be	fix: stream raw stream-json to onLog so Paperclip UI renders structured transcript entries (FAR-32) The prior approach (commit `b607657`) converted Claude's stream-json into flat plain text before calling onLog. This stripped the structure the Paperclip UI needs — its adapter ui-parser (src/ui-parser.ts, exported via the package's ./ui-parser entry) expects raw stream-json lines and emits structured transcript entries (assistant / thinking / tool_call / tool_result / init / result) that the UI renders as rich blocks, just like claude_local. claude_local passes stdout through unchanged to onLog for the same reason — the server persists raw lines and the UI parser turns them into rendered transcript entries. Mirror that here. formatClaudeStreamLine stays as an internal helper for future CLI use, but is no longer applied in the K8s streaming path. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:56:10 +00:00
Chris Farhood	f9ba77527a	0.1.41 v0.1.41	2026-04-24 17:43:16 +00:00
Chris Farhood	f304c70899	fix: keep formatClaudeStreamLine internal to avoid ESM hot-reload link failure (FAR-32) Exposing formatClaudeStreamLine at the package root caused Paperclip reinstalls to fail with "'./cli/index.js' does not provide an export named 'formatClaudeStreamLine'". The host process caches child ESM module records across reinstalls; linking the new dist/index.js re-export against the cached old dist/cli/index.js fails. The symbol is only used internally by server/execute.ts (which imports from ./cli/format-event.js directly), so drop the public re-export. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:43:16 +00:00
Chris Farhood	727d9494da	0.1.40 v0.1.40	2026-04-24 17:35:08 +00:00
Chris Farhood	b60765785b	feat: format Claude stream-json events in K8s streaming path for consistency with claude_local (FAR-32) All output sent to Paperclip via onLog now passes through formatClaudeStreamLine, converting raw stream-json blobs into human-readable text consistent with how the CLI and claude_local adapter format events. Changes: - format-event.ts: add formatClaudeStreamLine(raw) -> string \| null Plain-text equivalent of printClaudeStreamEvent — no ANSI colours, returns null for lines to suppress (assistant with no content, unknown events). Handles: system/init, assistant (text/thinking/tool_use), user (tool_result), result (summary + tokens), rate_limit_event. Non-JSON lines pass through. - execute.ts: wire formatClaudeStreamLine into streamPodLogsOnce write handler. raw chunks still stored in 'chunks[]' for parseClaudeStreamJson; only the onLog path receives formatted text. - 12 new tests for formatClaudeStreamLine covering all event types. - 352/352 tests pass. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:26:37 +00:00
Chris Farhood	28d6451265	feat: add rate_limit_event formatting to printClaudeStreamEvent (FAR-32) rate_limit_event was previously falling through to the debug-only branch and silently dropped in non-debug mode. Now it surfaces a concise, human-readable line for CLI consumers: rate_limit: type=five_hour status=allowed resets=2026-04-22T06:00:00.000Z Two tests cover the exact FAR-32 repro payload and graceful handling of missing rate_limit_info fields. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:22:15 +00:00
Chris Farhood	cabdc3df98	fix: skip all structured streaming events in buildPartialRunError (FAR-32 followup) Extends the previous fix (which only covered assistant/user) to skip every JSON object with a non-empty "type" field — system, assistant, user, rate_limit_event, result, and any future event types. This prevents all structured protocol artefacts from being surfaced verbatim as error messages. Root cause of the new repro: when Claude emits a rate_limit_event before producing output and then exits without a result event, the rate_limit_event JSON blob was becoming the "first content line" and appearing in the error: Claude exited with code -1: {"type":"rate_limit_event","rate_limit_info":{...}} With this fix, all typed events are filtered and the initOnlyOutput branch fires, producing the clean diagnostic: Claude started but did not produce a result (model: claude-opus-4-7) — check API credentials, model support, and adapter config Updated the "result event as content" test to match the new (correct) behaviour: in production buildPartialRunError is only called when parseClaudeStreamJson returns null (no result event), so the prior test was exercising a degenerate state that cannot occur through execute(). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:17:48 +00:00
Chris Farhood	f9ff04a354	fix: skip assistant/user events in buildPartialRunError to avoid raw JSON blobs in error messages (FAR-32) When a model produces assistant events with output_tokens=0 but no result event (e.g. MiniMax-M2.7 thinking-only output), the partial-run error previously surfaced the raw assistant JSON blob verbatim, producing an unreadable message like "Claude exited with code -1: {\"type\":\"assistant\",...}". Fix: extend the content-line filter in buildPartialRunError to also skip assistant and user event types (intermediate streaming events), in addition to system events. result events are still retained since they may carry useful terminal error details. When all stdout lines are filtered, the existing initOnlyOutput branch triggers and surfaces a clean diagnostic: "Claude started but did not produce a result (model: MiniMax-M2.7) — check API credentials, model support, and adapter config". Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:11:20 +00:00
Chris Farhood	e611f26d32	0.1.39 v0.1.39	2026-04-24 15:20:59 +00:00
Chris Farhood	f097440f3c	feat: implement cancel support via keepalive poll and SIGTERM handler (FAR-26) - Poll GET /api/heartbeat-runs/:runId on every keepalive tick (15s); when status != 'running', delete the K8s Job, set logStopSignal, and return errorCode='cancelled' — Job gone within ~15s of external cancellation. - SIGTERM handler best-effort deletes all active Jobs/Secrets and re-emits the signal to let the process exit naturally. - Export shouldAbortForCancellation() helper; add tests for helper, cancel poll path, and SIGTERM cleanup. - Guard: PAPERCLIP_API_URL missing logs a warning and skips cancel polling; HTTP 5xx from poll treated as transient; reattach path skips cancel poll. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 15:20:45 +00:00
Chris Farhood	c55d6c61fc	feat: declare hasOutOfProcessLiveness and remove onSpawn workarounds (FAR-24) - Add `hasOutOfProcessLiveness: true` to createServerAdapter() so the reaper skips local PID checks and uses the staleness window instead. - Remove the initial onSpawn call and all periodic keepalive onSpawn refreshes that were compensating for the missing flag. - Remove POST_TERMINAL_KEEPALIVE_MS constant and keepaliveTick counter that backed those workarounds. - Cast required: adapter-utils ServerAdapterModule type predates this field. - Bump to 0.1.38. Co-Authored-By: Paperclip <noreply@paperclip.ing> v0.1.38	2026-04-24 14:14:10 +00:00
Chris Farhood	32d6308eae	0.1.37 v0.1.37	2026-04-24 13:11:13 +00:00
Chris Farhood	b97117e10d	test: mock readPaperclipRuntimeSkillEntries to eliminate real fs I/O under fake timers Previously the test suite relied on real fs.stat completing within the fake timer advance window (~11200ms). Under CI with 11 parallel test files the I/O could drain later than the advances allowed, causing a 1-in-4 timeout on the "logs pod pending" test. Fix: mock @paperclipai/adapter-utils/server-utils using vi.hoisted() + Object.assign so readPaperclipRuntimeSkillEntries resolves immediately as a microtask. All other exports are forwarded to the real module via importOriginal. Each beforeEach that calls vi.resetAllMocks() or vi.clearAllMocks() now also calls mockReadSkillEntries.mockResolvedValue([]) to restore the implementation. Timer advances in affected tests are simplified to reflect the purely fake-timer sequence (no I/O drain prefix). All 323 tests pass deterministically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 13:11:04 +00:00
Chris Farhood	abdce817f3	0.1.36 Co-Authored-By: Paperclip <noreply@paperclip.ing> v0.1.36	2026-04-24 12:36:21 +00:00
Chris Farhood	f9d8a2e0ce	fix: resolve grace-period deadlock for stale UI status (FAR-23) The log-stream-exit grace timer never fired because logExitTime was set in the .then() of streamPodLogs, which only resolves once stopSignal is set — but stopSignal is only set when completionWithGrace fires, which requires logExitTime to be non-null. Classic deadlock. Fix: add onFirstStreamExit callback to streamPodLogs, called after attempt=0's streamPodLogsOnce returns (the first container exit signal). execute() passes a closure that sets logExitTime immediately, breaking the circular dependency and allowing the 30s grace timer to fire correctly when K8s Job conditions lag container exit. Tests: all 323 pass including the two FAR-23 grace-period regression tests. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 12:20:10 +00:00
Chris Farhood	a7dfd5d502	test: fix flaky execute.ts timer tests and hit 80%+ line coverage readPaperclipRuntimeSkillEntries does real fs.stat I/O under fake timers, delaying execute()'s fake-timer registration by ~3200-4200ms of fake time when tests run in isolation (cold OS page cache). The previous approach tried vi.spyOn on an ESM module namespace export, which throws "Cannot redefine property" — a fundamental ESM constraint. Fix: remove the broken spy. Instead, each timer-heavy test now uses enough advanceTimersByTimeAsync calls to (a) give the event loop sufficient turns for the I/O to drain, and (b) cover the full fake-timer sequence even with the maximum observed I/O delay. Patterns chosen: reconnects (needs t+6000): 6 advances, ~12200ms total deadline exceeded (needs t+3000): 5 advances, ~8400ms total pod-creation wait (needs t+5000): 5 advances, ~9400ms total execute.ts line coverage: 82.57% (was ~24% before this task's test additions). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 04:10:49 +00:00
Chris Farhood	e310ba4156	0.1.35 v0.1.35	2026-04-24 00:44:59 +00:00
Chris Farhood	ae7adb0847	docs: add enableRtk, rtkMaxOutputBytes, reattachOrphanedJobs to config doc (N6) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 00:01:57 +00:00
Chris Farhood	d24510172e	fix: remove misleading dangerouslySkipPermissions UI toggle (N5) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 00:01:38 +00:00
Chris Farhood	29a4e709d0	fix: sanitize agent/run/company labels to RFC 1123 (N4) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 00:00:56 +00:00
Chris Farhood	8a08e6a6ee	fix: relabel reattached Job with current run-id and session-id (N3) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:59:05 +00:00
Chris Farhood	c0dba8e904	fix: never auto-delete live K8s orphans; block on mismatch (#8 ) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:58:51 +00:00
Chris Farhood	b91859c258	refactor: extract classifyOrphan helper with decision matrix (#8 ) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:58:23 +00:00
Chris Farhood	f1433b05a6	fix: reserve paperclip.io/ and app.kubernetes.io/ label prefixes (N2) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:54:15 +00:00
Chris Farhood	f64694f894	fix: validate companyId/instanceId against path traversal (N1) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:53:18 +00:00
Gandalf the Greybeard	e86b14a677	0.1.34 v0.1.34	2026-04-23 23:35:02 +00:00
Gandalf the Greybeard	98f3821f91	fix: address remaining minor code review findings (FAR-15) - #9: match Paperclip container by name in k8s-client instead of trusting spec.containers[0], which could be a service-mesh sidecar - #11: key assistant-text dedup by (message.id, index) so legitimate duplicate content across turns isn't collapsed in the summary - #16: trim trailing hyphens from sanitized K8s names so truncation doesn't produce names ending in "-" Findings #5 (keepalive re-verify) and #6 (one-shot log dedup) were already addressed in the current code — verified during this review. #8 (orphan reattach behavior) requires a product decision on whether "new session wins" is intentional, so deferring. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:34:59 +00:00
Gandalf the Greybeard	21a02da00f	fix: prevent prompt Secret leak by attaching ownerReference to Job (FAR-15) When a large prompt creates a K8s Secret, it can orphan if the process crashes before the finally block runs. Now the Secret gets an ownerReference pointing to the Job after creation, so K8s GC cleans it up automatically. Also cleans up the Secret on job creation failure. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:29:47 +00:00
Gandalf the Greybeard	346f5cc1df	fix: prevent UTF-8 corruption when RTK truncation splits multi-byte codepoints (FAR-19) The trunc function in the RTK filter script now walks back from the truncation point past continuation bytes and checks whether the full codepoint fits, avoiding replacement characters from mid-codepoint slicing. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:28:28 +00:00
Gandalf the Greybeard	ef73586a41	fix: address 6 critical/minor code review findings (FAR-15) 1. Fix resources.* dotted-key config — UI fields now correctly read 2. Fix operator precedence bug in container status key (add parens) 3. Add missing RBAC checks to testEnvironment (jobs/list, secrets/*, pvc) 4. Add bail timer log message for debuggability 5. Make result-event detection robust to JSON whitespace variations 6. Remove namespace short-circuit so all checks run on first attempt Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 23:15:01 +00:00
Gandalf the Greybeard	9f79efdf36	0.1.33 v0.1.33	2026-04-23 22:45:37 +00:00
Gandalf the Greybeard	4210f51937	chore: update lockfile Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 22:45:31 +00:00
Gandalf the Greybeard	f41ae818ef	fix: fire onSpawn immediately on job terminal transition (FAR-14) Prevents process_lost false positives for 2-3 minute K8s jobs by resetting the reaper clock when the keepalive loop detects the job has completed (or been deleted), rather than waiting for the next periodic refresh. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 22:29:22 +00:00
Hugh Commit	baf7e2d44d	0.1.32: port prepareClaudePromptBundle to claude_k8s (FAR-12) Co-Authored-By: Paperclip <noreply@paperclip.ing> v0.1.32	2026-04-23 19:47:26 +00:00
Gandalf the Greybeard	77ed2004f8	fix: port prepareClaudePromptBundle flow to claude_k8s adapter (FAR-11) K8s Job pods were starting without the Paperclip skill loaded, so agents could not find their heartbeat procedure and reported "no issue content in my workspace" on every wake. Root cause: claude_local materialises skills into a PVC-backed prompt-bundle directory and passes --add-dir to Claude, but claude_k8s did neither. Changes: - Add src/server/prompt-cache.ts with prepareClaudePromptBundle (ported from adapter-claude-local). Writes skill symlinks and the agent's instructions file into a content-addressed bundle directory under the shared PVC (/paperclip/instances/.../claude-prompt-cache/<hash>/). - execute.ts: read desired skills and instructions file before building the Job manifest, then call prepareClaudePromptBundle and pass the resulting bundle to buildJobManifest. - job-manifest.ts: accept optional promptBundle in JobBuildInput; when present, pass --add-dir <bundle.addDir> and use bundle.instructionsFilePath for --append-system-prompt-file. Also fix: skip --append-system-prompt-file on session resumes to avoid wasting tokens on re-injection. - skills.ts: correct the detail string to reflect actual materialisation. - job-manifest.test.ts: add 5 new tests covering --add-dir injection, bundle path preference, session-resume skipping, and fallback behaviour. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 19:34:35 +00:00
Gandalf the Greybeard	69d0f4972f	test: regression for streamPodLogsOnce bail timer (FAR-10) Uses vi.mock on k8s-client and vi.useFakeTimers to prove that when logApi.log() never resolves (the FAR-10 hang shape) and stopSignal fires, streamPodLogsOnce still returns within the bail window (LOG_STREAM_BAIL_TIMEOUT_MS). Exports streamPodLogsOnce so the test can call it directly. Also covers the no-stopSignal happy path. 269/269 passing (+2 new). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-23 16:43:32 +00:00

1 2 3

119 Commits