paperclip-adapter-claude-k8s

farhoodlabs/paperclip-adapter-claude-k8s

Author	SHA1	Message	Date
Chris Farhood	5179544fd6	docs: mark repo as abandoned in favor of paperclip-plugin-k8s Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-01 08:46:55 -04:00
Chris Farhood	160d6b49e9	0.2.5 v0.2.5	2026-04-30 09:06:19 -04:00
Chris Farhood	9007762390	chore(deps): bump @paperclipai/adapter-utils from canary.7 pin to ^2026.428.0 The peerDep floor and devDep were pinned to a pre-release canary from April 15, 13 days behind the current stable. Move both to the latest stable 2026.428.0. All 328 tests pass against the new types; the imported surface (asString, parseObject, runChildProcess, AdapterExecutionContext, etc.) is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 09:06:12 -04:00
Chris Farhood	506007984c	0.2.4 v0.2.4	2026-04-30 08:46:57 -04:00
Chris Farhood	7a6d1a44f2	fix(ui-parser): restore esbuild CJS bundle step lost in PR #11 merge Commit `0e43811` added an esbuild step to bundle src/ui-parser.ts as CJS because the UI's sandboxed worker can't evaluate ESM `export` syntax. PR #11 (filesystem-log-tail) was based on a commit predating that fix, so the merge clobbered both the build:ui-parser script and the esbuild devDependency. Every release since has shipped a tsc-emitted ESM ui-parser.js that the worker silently fails to load — parseStdoutLine never registers and the run transcript falls back to dumping raw stream-json lines as plain text instead of rendering structured assistant/thinking/tool_call/tool_result entries. Restore the script and dep verbatim from `0e43811`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:46:46 -04:00
Chris Farhood	3960d746f4	ci: serialize publish jobs sharing the same SHA to fix race When a tagged release lands on master, both the master-push and tag-push events trigger the publish job. The skip-on-exists check (`npm view`) runs concurrently on both, both see the version as not-yet-published, and both proceed to `npm publish`. The first wins; the second gets E403 ("cannot publish over previously published versions") and reds out the run. Fixes the race by adding a publish-${{ github.sha }} concurrency group so the second run queues until the first finishes — by then npm view sees the published version and the skip path takes over cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:28:01 -04:00
Chris Farhood	cc942ca818	0.2.3 v0.2.3	2026-04-30 08:03:08 -04:00
Chris Farhood	83a2d25062	fix(execute): assign captured stdout to outer binding so parse sees it The filesystem-tail rewrite (`8bd5042`) declared `const stdout` inside the try block, shadowing the outer `let stdout = ""`. parseClaudeStreamJson then ran on the empty outer binding, so every run failed with "Failed to parse Claude JSON output" and resultJson={stdout:""} despite live log-streaming working fine. Drop the `const` so the assignment lands on the outer let. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:03:05 -04:00
Chris Farhood	c8429cfde1	fix: write logs to /paperclip/instances/default/data/run-logs/ to match server PVC layout v0.2.1 introduced filesystem-tail log delivery with buildPodLogPath() returning /paperclip/instances/default/run-logs/... but the paperclip server creates and tails from /paperclip/instances/default/data/run-logs/ on the shared PVC. The missing /data/ segment meant: 1. The init container's mkdir -p /paperclip/instances/... ran in a directory busybox UID 1000 can't write to — it's the init container's ephemeral rootfs, since the PVC is only mounted in the main container. Init exited 1, the && short-circuited, and the prompt copy never happened. Job failed with "Init container 'write-prompt' failed with exit code 1". 2. Even if the mkdir had worked, the main container's tee would have written to a path the server doesn't tail. Fix: drop the misplaced mkdir from both init container variants and correct buildPodLogPath() to include /data/. The directory already exists on the PVC because the paperclip server creates it; both containers run as UID 1000 with fsGroup 1000, so the main container's tee writes to the pre-existing path with no setup needed. Bump to 0.2.2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-29 22:15:15 -04:00
Chris Farhood	1502039d70	Merge pull request #11 from farhoodlabs/feat/filesystem-log-tail feat: replace k8s log API streaming with filesystem tailing	2026-04-27 22:26:02 -04:00
Chris Farhood	c326d2571e	fix(ci): run on tags, publish on both master push and tags v0.2.0	2026-04-27 22:25:48 -04:00
Chris Farhood	e6df8fad98	chore: bump to 0.2.0	2026-04-27 22:20:00 -04:00
Chris Farhood	8bd5042b5d	feat: replace k8s log API streaming with filesystem tailing Replaces K8s log API streaming (which was dropping every ~3 seconds at production scale) with filesystem tailing via tee to a pod log file on the shared PVC. Core changes: - Add tee to claudeInvocation to write pod log file - Add mkdir -p to init container to create log directory - Add assertSafePathComponent and buildPodLogPath helper - Add tailPodLogFile function with adaptive 250ms/1s polling - Replace k8s log streaming with tailPodLogFile in Promise.allSettled - Delete log-dedup.ts (RTK output truncation no longer needed) - Update config-schema.ts and index.ts to remove RTK references - Clean up log file in cleanupJob when retainJobs=false Note: 14 tests in execute.test.ts test the obsolete k8s log streaming approach and need to be rewritten or deleted (streamPodLogsOnce tests). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-27 22:13:39 -04:00
Chris Farhood	568f571d8c	fix(models): inline static model list in index.ts to break circular dep with server/models Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v0.1.57	2026-04-27 09:16:35 -04:00
Chris Farhood	8a9376b40e	chore: bump to 0.1.56 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v0.1.56	2026-04-27 08:05:29 -04:00
Chris Farhood	0c8aa4d1ea	fix(models): move import to top of index.ts before export declarations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 08:04:46 -04:00
Chris Farhood	1d894f104f	fix(models): expose static models list so UI renders entries before listModels resolves Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> v0.1.55	2026-04-27 07:42:44 -04:00
Chris Farhood	fc3866924a	0.1.54 v0.1.54	2026-04-27 00:38:06 +00:00
Chris Farhood	368254d75d	fix: per-chunk activity tracking + pod-phase gate on grace timer (FAR-107) The 0.1.53 fix tracked stream liveness by updating lastActiveAt only after streamPodLogsOnce returned. That worked for the disconnect-then-reconnect-then-disconnect case, but missed the disconnect-then-long-running-reconnect case: a streaming attempt that runs for minutes without disconnecting never refreshes lastActiveAt, so the grace timer fires 30s after the prior disconnect even though the new attempt is currently producing output. Nancy reproduced exactly this on 0.1.53 — claude_truncated with pod phase=Running. Two changes: 1. streamPodLogsOnce now accepts the activity ref and updates lastActiveAt inside its writable's write handler — every chunk delivered from the container refreshes the timer in real time, not just on stream return. 2. Before the grace timer settles, gate on pod phase: if the pod is still Running or Pending, the container is alive (Claude's long tool-use silences exceed 30s for slow upstream APIs). Refresh lastActiveAt, leave the poller armed, and let waitForJobCompletion remain the authoritative termination signal. Only proceed with the grace settlement when the pod has actually reached a terminal phase or is gone. The original FAR-23 fast-path (container exits, Job condition lags) still works: when the container terminates, pod phase moves to Succeeded/Failed and the gate falls through to the existing Job-presence check. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-27 00:38:06 +00:00
Chris Farhood	34756f8215	0.1.53 v0.1.53	2026-04-27 00:28:45 +00:00
Chris Farhood	07ef106c66	fix: gate grace timer on stream-output silence, not first disconnect (FAR-107) The 30s grace timer that bounds K8s Job condition propagation lag was armed by streamPodLogs's onFirstStreamExit callback the moment streamPodLogsOnce returned for the first time. A transient K8s log-API disconnect mid-run also returns from streamPodLogsOnce — so the grace timer fired 30s later regardless of whether streamPodLogs had already reconnected and the container was still producing output. Nancy / Privileged Escalation reproduced this on long Opus-4-6 runs: the prod paperclip pod was stable, the cancel-poll guard was already narrowed in 0.1.51, but every long run truncated with claude_truncated + "container terminated state not yet observable (pod phase=Running)" because the run was being abandoned mid-output. Replace the boolean onFirstStreamExit signal with a streamActivity ref carrying lastActiveAt + streamHasExited. streamPodLogs refreshes lastActiveAt every time a streamPodLogsOnce attempt returns non-empty output, so reconnects that resume real output keep the grace clock reset. The grace timer fires only once the stream has exited at least once AND no chunk has arrived for the full grace window — which preserves the original FAR-23 behaviour (container truly exited but Job condition lags) while ending the false-truncation of healthy streams. Adds a regression test that asserts a stream drop + reconnect + deferred Job completion does not surface as truncated. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-27 00:28:44 +00:00
Chris Farhood	fd7dce7239	0.1.52 v0.1.52	2026-04-27 00:00:57 +00:00
Chris Farhood	b1878c684e	fix: retry-aware pod state lookup + honest truncation cause messages (FAR-107) The single-shot getPodTerminatedState query lost a real race against kubelet's containerStatus update: when Claude exited cleanly but quickly, listNamespacedPod often returned the pod with phase=Succeeded/Failed but without a populated state.terminated, so describeTruncationCause fell into the catch-all "pod state unavailable — likely deleted before exit could be read" branch. That message is doubly wrong: the pod was not deleted and the exit cause was readable a few hundred ms later. Operators chasing claude_truncated runs (Nancy/Privileged Escalation) had no visibility into the actual exit code, OOMKilled flag, or reason. Two changes: 1. Introduce lookupPodState + getPodLookupWithRetry — the lookup result carries the pod phase and a podMissing flag, and retries up to 4×500ms when the pod is in a terminal phase but containerStatuses lag. When the pod is in a non-terminal phase or genuinely gone we bail immediately without burning the retry budget. 2. describeTruncationCause now distinguishes three states: - "pod is gone" (eviction, preemption, external delete) - "container terminated state not yet observable (pod phase=…)" - the existing populated-state path with exit code / reason / signal The truncation error path re-queries with the retry-aware lookup right before producing the message, so subsequent claude_truncated errors surface the actual exit cause (137=OOMKilled, 143=SIGTERM, kubelet reason text) instead of a misleading deletion claim. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-27 00:00:56 +00:00
Chris Farhood	83e105393c	0.1.51 v0.1.51	2026-04-26 21:24:15 +00:00
Chris Farhood	49288fa5c7	fix: scope cancel-polling to explicit cancellation states only (FAR-107) shouldAbortForCancellation previously treated any non-`running` runStatus as a cancellation signal — which made the keepalive's cancel-poll delete the K8s Job whenever the heartbeat-runs API briefly returned a transient or stale status (e.g. queued, pending, succeeded, failed, completed, unknown) for an in-flight run. The follow-up `waitForJobCompletion` poll then observed the 404 and surfaced a spurious `k8s_job_deleted_externally` error to the user, even though no human or external system deleted the Job. Privileged Escalation's "null-pointer-nancy" agent reproduced this on runs that were never cancelled and were not adjacent to a paperclip restart, ruling out the SIGTERM path that 0.1.50 already addressed. Tighten the guard to fire only on `cancelled` / `cancelling`. Other terminal statuses are unreachable while the adapter is still executing (the adapter's own return is what flips them) and even if observed mid-run, they do not justify deleting a Job that may still be doing real work — the natural completion path will tear it down. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:24:11 +00:00
Chris Farhood	dae9e18659	0.1.50 v0.1.50	2026-04-26 21:19:03 +00:00
Chris Farhood	6923597b31	fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107) Root cause of Nancy's k8s_job_deleted_externally false positive: the paperclip server itself receives SIGTERM during rolling deploys, evictions, scale-down, etc. The previous SIGTERM handler iterated activeJobs and deleted every Job before exiting, which surfaced in the in-flight heartbeat as "K8s Job was deleted externally" — even though nothing external touched it. With reattachOrphanedJobs=true (default), this is exactly the wrong behaviour: leaving the Jobs alive lets the next paperclip process discover them via the orphan-classification path and reattach their log streams. With reattachOrphanedJobs=false the operator opted into manual cleanup, so we still must not auto-delete. The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so both survive together and TTL handles cleanup on natural completion. Test rewritten to assert the new contract: SIGTERM must not touch K8s Jobs. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:19:02 +00:00
Chris Farhood	d184a1732b	0.1.49 v0.1.49	2026-04-26 21:06:19 +00:00
Chris Farhood	be84428226	fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107) The error previously fired with no diagnostic context, making it impossible to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a missed Complete condition, or (c) actual external deletion without cluster shell access. Two changes: 1. Grace-period verification: when the log stream exits and the 30s grace timer fires, do a one-shot readNamespacedJob before declaring the Job gone. If it's still there, settle as gracePeriodFired (not jobGone) so we don't mis-classify K8s condition propagation lag as deletion. 2. Forensic capture: track which of the three detection paths (completion-poll-404, grace-period-verify-404, recheck-poll-404) first observed the 404, the last successful Job conditions read, the poll count, elapsed time since pod-running, and stdout size. Append all of it to the errorMessage so the next occurrence is self-diagnosing. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 21:05:04 +00:00
Chris Farhood	d9928030d6	0.1.48 v0.1.48	2026-04-26 14:48:22 +00:00
Chris Farhood	76fc6fcdfc	fix: surface pod terminated reason/message in adapter_failed errors (FAR-100) The init-only and partial-run error paths now embed the K8s container terminated state (reason, message, signal, OOM hint) directly in the errorMessage. This eliminates the kubectl round-trip when diagnosing adapter_failed runs — the surfaced error self-explains. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 14:48:12 +00:00
Chris Farhood	3169f49f23	0.1.47 v0.1.47	2026-04-26 13:04:54 +00:00
Chris Farhood	e0b35d230f	fix: distinguish init-only non-zero exits in buildPartialRunError (FAR-100) Init-only runs that exit with a non-zero code now surface a more actionable message naming the exit code and the likely cause (unsupported model or rejected session) instead of the generic "did not produce a result" text. Helps operators diagnose model-id / billing-tier failures (e.g. opus 4.6). Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 13:04:43 +00:00
Chris Farhood	4e2c36319d	0.1.46 v0.1.46	2026-04-26 01:57:43 +00:00
Chris Farhood	8474f78fe1	fix: include pod terminated reason/message in claude_truncated error (FAR-95) Capture the claude container's terminated state (exit code, reason, message, signal) and surface it in the truncation error so operators see why the run was cut short — e.g. "exit code 137, SIGKILL (commonly OOMKilled), reason=OOMKilled, message=Memory cgroup out of memory" instead of just a "truncated" label with no diagnostic context. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 01:57:43 +00:00
Chris Farhood	88896eddcf	0.1.45 v0.1.45	2026-04-26 01:54:48 +00:00
Chris Farhood	a2874c0426	fix: detect mid-stream truncation and emit claude_truncated error code (FAR-95) When Claude produces assistant content (output_tokens > 0) but the stream ends without a result event, classify the run as truncated mid-stream rather than falling through to the generic "did not produce a result — check API credentials" message. The misleading hint pointed operators at auth/model config when the real cause was pod termination, OOMKill, or CLI crash. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-26 01:54:35 +00:00
Chris Farhood	818aa0f1d6	feat: log bundled skill names and add skills to onMeta commandNotes (FAR-36) Adds a diagnostic log line after skill resolution so operators can see exactly which skills were bundled into each run, making it straightforward to diagnose skill availability issues. Also surfaces the skill list in the onMeta commandNotes for run metadata visibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 20:41:01 +00:00
Chris Farhood	55fd3021fb	fix: add per-agent mutex to eliminate TOCTOU race in K8s concurrency guard (FAR-29) Two concurrent execute() calls for the same agent can both pass the list-then-create guard before either job appears in the other's query. The new module-level agentCreationMutex serializes the guard+create phase within the process so only one call enters listNamespacedJob at a time. The mutex is acquired after sanitizing the agent ID and released in a finally block that wraps the entire guard+create section, so all early return paths (guard blocks, create failures) cleanly release it. Variables used in both the guard+create and log-streaming phases are hoisted to before the try block. Cross-agent calls use separate mutex slots and are unaffected. Added two vitest cases verifying same-agent serialization and that different-agent calls are not serialized. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 20:10:01 +00:00
Chris Farhood	83b58f9207	fix: detect stop_reason:null + output_tokens:0 and emit llm_api_error (FAR-30) parseClaudeStreamJson now tracks assistant events with stop_reason:null and output_tokens:0 (the MiniMax degraded-response pattern). When no result event follows, execute() returns errorCode:"llm_api_error" with a descriptive message instead of the generic adapter_failed. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 20:00:42 +00:00
Chris Farhood	602afa9b84	fix: return k8s_job_deleted_externally error code when job deleted mid-run (FAR-31) When a K8s Job is deleted externally (kubectl delete job or TTL before terminal condition observed) and stdout has no result event, the adapter now returns errorCode "k8s_job_deleted_externally" with the message "K8s Job was deleted externally before Claude could complete" instead of the misleading "Claude exited with code -1". Tracks a jobDeletedExternally flag in execute() on the jobGone path and checks it in the !parsed branch before falling through to buildPartialRunError. Only applies when exitCode is null (pod gone alongside the job). Adds regression test: FAR-31 scenario where job 404s mid-run with partial stdout and missing pod produces the new error code. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 19:58:46 +00:00
Chris Farhood	986f2fc7fa	test: add coverage for deletionTimestamp concurrency guard bypass (FAR-34) Verifies that a terminating K8s job (deletionTimestamp set, no Complete/Failed condition) is skipped by the concurrency guard so subsequent heartbeat runs are not incorrectly blocked. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 19:57:10 +00:00
Chris Farhood	357f035418	fix: skip K8s jobs with deletionTimestamp in concurrency guard (FAR-34) Jobs being deleted via kubectl enter a Terminating state where deletionTimestamp is set but no Complete/Failed condition is added. The concurrency guard previously treated these as running, blocking all subsequent heartbeat runs for the agent until the job fully disappeared from the K8s API. Co-Authored-By: Paperclip <noreply@paperclip.ing> 0.1.43	2026-04-24 18:36:19 +00:00
Chris Farhood	f340ce52ee	0.1.42 v0.1.42	2026-04-24 17:56:14 +00:00
Chris Farhood	ecc477d0be	fix: stream raw stream-json to onLog so Paperclip UI renders structured transcript entries (FAR-32) The prior approach (commit `b607657`) converted Claude's stream-json into flat plain text before calling onLog. This stripped the structure the Paperclip UI needs — its adapter ui-parser (src/ui-parser.ts, exported via the package's ./ui-parser entry) expects raw stream-json lines and emits structured transcript entries (assistant / thinking / tool_call / tool_result / init / result) that the UI renders as rich blocks, just like claude_local. claude_local passes stdout through unchanged to onLog for the same reason — the server persists raw lines and the UI parser turns them into rendered transcript entries. Mirror that here. formatClaudeStreamLine stays as an internal helper for future CLI use, but is no longer applied in the K8s streaming path. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:56:10 +00:00
Chris Farhood	f9ba77527a	0.1.41 v0.1.41	2026-04-24 17:43:16 +00:00
Chris Farhood	f304c70899	fix: keep formatClaudeStreamLine internal to avoid ESM hot-reload link failure (FAR-32) Exposing formatClaudeStreamLine at the package root caused Paperclip reinstalls to fail with "'./cli/index.js' does not provide an export named 'formatClaudeStreamLine'". The host process caches child ESM module records across reinstalls; linking the new dist/index.js re-export against the cached old dist/cli/index.js fails. The symbol is only used internally by server/execute.ts (which imports from ./cli/format-event.js directly), so drop the public re-export. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:43:16 +00:00
Chris Farhood	727d9494da	0.1.40 v0.1.40	2026-04-24 17:35:08 +00:00
Chris Farhood	b60765785b	feat: format Claude stream-json events in K8s streaming path for consistency with claude_local (FAR-32) All output sent to Paperclip via onLog now passes through formatClaudeStreamLine, converting raw stream-json blobs into human-readable text consistent with how the CLI and claude_local adapter format events. Changes: - format-event.ts: add formatClaudeStreamLine(raw) -> string \| null Plain-text equivalent of printClaudeStreamEvent — no ANSI colours, returns null for lines to suppress (assistant with no content, unknown events). Handles: system/init, assistant (text/thinking/tool_use), user (tool_result), result (summary + tokens), rate_limit_event. Non-JSON lines pass through. - execute.ts: wire formatClaudeStreamLine into streamPodLogsOnce write handler. raw chunks still stored in 'chunks[]' for parseClaudeStreamJson; only the onLog path receives formatted text. - 12 new tests for formatClaudeStreamLine covering all event types. - 352/352 tests pass. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:26:37 +00:00
Chris Farhood	28d6451265	feat: add rate_limit_event formatting to printClaudeStreamEvent (FAR-32) rate_limit_event was previously falling through to the debug-only branch and silently dropped in non-debug mode. Now it surfaces a concise, human-readable line for CLI consumers: rate_limit: type=five_hour status=allowed resets=2026-04-22T06:00:00.000Z Two tests cover the exact FAR-32 repro payload and graceful handling of missing rate_limit_info fields. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 17:22:15 +00:00

1 2 3

149 Commits