149 Commits

Author SHA1 Message Date
Chris Farhood 5179544fd6 docs: mark repo as abandoned in favor of paperclip-plugin-k8s
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-01 08:46:55 -04:00
Chris Farhood 160d6b49e9 0.2.5 v0.2.5 2026-04-30 09:06:19 -04:00
Chris Farhood 9007762390 chore(deps): bump @paperclipai/adapter-utils from canary.7 pin to ^2026.428.0
The peerDep floor and devDep were pinned to a pre-release canary from
April 15, 13 days behind the current stable. Move both to the latest
stable 2026.428.0. All 328 tests pass against the new types; the
imported surface (asString, parseObject, runChildProcess,
AdapterExecutionContext, etc.) is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:06:12 -04:00
Chris Farhood 506007984c 0.2.4 v0.2.4 2026-04-30 08:46:57 -04:00
Chris Farhood 7a6d1a44f2 fix(ui-parser): restore esbuild CJS bundle step lost in PR #11 merge
Commit 0e43811 added an esbuild step to bundle src/ui-parser.ts as CJS
because the UI's sandboxed worker can't evaluate ESM `export` syntax.
PR #11 (filesystem-log-tail) was based on a commit predating that fix,
so the merge clobbered both the build:ui-parser script and the esbuild
devDependency. Every release since has shipped a tsc-emitted ESM
ui-parser.js that the worker silently fails to load — parseStdoutLine
never registers and the run transcript falls back to dumping raw
stream-json lines as plain text instead of rendering structured
assistant/thinking/tool_call/tool_result entries.

Restore the script and dep verbatim from 0e43811.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:46:46 -04:00
Chris Farhood 3960d746f4 ci: serialize publish jobs sharing the same SHA to fix race
When a tagged release lands on master, both the master-push and tag-push
events trigger the publish job. The skip-on-exists check (`npm view`)
runs concurrently on both, both see the version as not-yet-published,
and both proceed to `npm publish`. The first wins; the second gets
E403 ("cannot publish over previously published versions") and reds
out the run.

Fixes the race by adding a publish-${{ github.sha }} concurrency group
so the second run queues until the first finishes — by then npm view
sees the published version and the skip path takes over cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:28:01 -04:00
Chris Farhood cc942ca818 0.2.3 v0.2.3 2026-04-30 08:03:08 -04:00
Chris Farhood 83a2d25062 fix(execute): assign captured stdout to outer binding so parse sees it
The filesystem-tail rewrite (8bd5042) declared `const stdout` inside the
try block, shadowing the outer `let stdout = ""`. parseClaudeStreamJson
then ran on the empty outer binding, so every run failed with "Failed to
parse Claude JSON output" and resultJson={stdout:""} despite live
log-streaming working fine. Drop the `const` so the assignment lands on
the outer let.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:03:05 -04:00
Chris Farhood c8429cfde1 fix: write logs to /paperclip/instances/default/data/run-logs/ to match server PVC layout
v0.2.1 introduced filesystem-tail log delivery with buildPodLogPath()
returning /paperclip/instances/default/run-logs/... but the paperclip
server creates and tails from /paperclip/instances/default/data/run-logs/
on the shared PVC. The missing /data/ segment meant:

  1. The init container's mkdir -p /paperclip/instances/... ran in a
     directory busybox UID 1000 can't write to — it's the init
     container's ephemeral rootfs, since the PVC is only mounted in
     the main container. Init exited 1, the && short-circuited, and
     the prompt copy never happened. Job failed with "Init container
     'write-prompt' failed with exit code 1".
  2. Even if the mkdir had worked, the main container's tee would
     have written to a path the server doesn't tail.

Fix: drop the misplaced mkdir from both init container variants and
correct buildPodLogPath() to include /data/. The directory already
exists on the PVC because the paperclip server creates it; both
containers run as UID 1000 with fsGroup 1000, so the main container's
tee writes to the pre-existing path with no setup needed.

Bump to 0.2.2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 22:15:15 -04:00
Chris Farhood 1502039d70 Merge pull request #11 from farhoodlabs/feat/filesystem-log-tail
feat: replace k8s log API streaming with filesystem tailing
2026-04-27 22:26:02 -04:00
Chris Farhood c326d2571e fix(ci): run on tags, publish on both master push and tags v0.2.0 2026-04-27 22:25:48 -04:00
Chris Farhood e6df8fad98 chore: bump to 0.2.0 2026-04-27 22:20:00 -04:00
Chris Farhood 8bd5042b5d feat: replace k8s log API streaming with filesystem tailing
Replaces K8s log API streaming (which was dropping every ~3 seconds at
production scale) with filesystem tailing via tee to a pod log file on
the shared PVC.

Core changes:
- Add tee to claudeInvocation to write pod log file
- Add mkdir -p to init container to create log directory
- Add assertSafePathComponent and buildPodLogPath helper
- Add tailPodLogFile function with adaptive 250ms/1s polling
- Replace k8s log streaming with tailPodLogFile in Promise.allSettled
- Delete log-dedup.ts (RTK output truncation no longer needed)
- Update config-schema.ts and index.ts to remove RTK references
- Clean up log file in cleanupJob when retainJobs=false

Note: 14 tests in execute.test.ts test the obsolete k8s log streaming
approach and need to be rewritten or deleted (streamPodLogsOnce tests).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 22:13:39 -04:00
Chris Farhood 568f571d8c fix(models): inline static model list in index.ts to break circular dep with server/models
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v0.1.57
2026-04-27 09:16:35 -04:00
Chris Farhood 8a9376b40e chore: bump to 0.1.56
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v0.1.56
2026-04-27 08:05:29 -04:00
Chris Farhood 0c8aa4d1ea fix(models): move import to top of index.ts before export declarations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 08:04:46 -04:00
Chris Farhood 1d894f104f fix(models): expose static models list so UI renders entries before listModels resolves
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
v0.1.55
2026-04-27 07:42:44 -04:00
Chris Farhood fc3866924a 0.1.54 v0.1.54 2026-04-27 00:38:06 +00:00
Chris Farhood 368254d75d fix: per-chunk activity tracking + pod-phase gate on grace timer (FAR-107)
The 0.1.53 fix tracked stream liveness by updating lastActiveAt only
after streamPodLogsOnce returned.  That worked for the
disconnect-then-reconnect-then-disconnect case, but missed the
disconnect-then-long-running-reconnect case: a streaming attempt that
runs for minutes without disconnecting never refreshes lastActiveAt,
so the grace timer fires 30s after the prior disconnect even though
the new attempt is currently producing output.  Nancy reproduced
exactly this on 0.1.53 — claude_truncated with pod phase=Running.

Two changes:

1. streamPodLogsOnce now accepts the activity ref and updates
   lastActiveAt inside its writable's write handler — every chunk
   delivered from the container refreshes the timer in real time,
   not just on stream return.

2. Before the grace timer settles, gate on pod phase: if the pod is
   still Running or Pending, the container is alive (Claude's
   long tool-use silences exceed 30s for slow upstream APIs).
   Refresh lastActiveAt, leave the poller armed, and let
   waitForJobCompletion remain the authoritative termination
   signal.  Only proceed with the grace settlement when the pod
   has actually reached a terminal phase or is gone.

The original FAR-23 fast-path (container exits, Job condition lags)
still works: when the container terminates, pod phase moves to
Succeeded/Failed and the gate falls through to the existing
Job-presence check.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:38:06 +00:00
Chris Farhood 34756f8215 0.1.53 v0.1.53 2026-04-27 00:28:45 +00:00
Chris Farhood 07ef106c66 fix: gate grace timer on stream-output silence, not first disconnect (FAR-107)
The 30s grace timer that bounds K8s Job condition propagation lag was
armed by streamPodLogs's onFirstStreamExit callback the moment
streamPodLogsOnce returned for the first time.  A transient K8s log-API
disconnect mid-run also returns from streamPodLogsOnce — so the grace
timer fired 30s later regardless of whether streamPodLogs had already
reconnected and the container was still producing output.

Nancy / Privileged Escalation reproduced this on long Opus-4-6 runs:
the prod paperclip pod was stable, the cancel-poll guard was already
narrowed in 0.1.51, but every long run truncated with claude_truncated
+ "container terminated state not yet observable (pod phase=Running)"
because the run was being abandoned mid-output.

Replace the boolean onFirstStreamExit signal with a streamActivity ref
carrying lastActiveAt + streamHasExited.  streamPodLogs refreshes
lastActiveAt every time a streamPodLogsOnce attempt returns non-empty
output, so reconnects that resume real output keep the grace clock
reset.  The grace timer fires only once the stream has exited at least
once AND no chunk has arrived for the full grace window — which
preserves the original FAR-23 behaviour (container truly exited but
Job condition lags) while ending the false-truncation of healthy
streams.  Adds a regression test that asserts a stream drop + reconnect
+ deferred Job completion does not surface as truncated.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:28:44 +00:00
Chris Farhood fd7dce7239 0.1.52 v0.1.52 2026-04-27 00:00:57 +00:00
Chris Farhood b1878c684e fix: retry-aware pod state lookup + honest truncation cause messages (FAR-107)
The single-shot getPodTerminatedState query lost a real race against
kubelet's containerStatus update: when Claude exited cleanly but quickly,
listNamespacedPod often returned the pod with phase=Succeeded/Failed but
without a populated state.terminated, so describeTruncationCause fell into
the catch-all "pod state unavailable — likely deleted before exit could
be read" branch.  That message is doubly wrong: the pod was not deleted
and the exit cause was readable a few hundred ms later.  Operators
chasing claude_truncated runs (Nancy/Privileged Escalation) had no
visibility into the actual exit code, OOMKilled flag, or reason.

Two changes:

1. Introduce lookupPodState + getPodLookupWithRetry — the lookup result
   carries the pod phase and a podMissing flag, and retries up to 4×500ms
   when the pod is in a terminal phase but containerStatuses lag.  When
   the pod is in a non-terminal phase or genuinely gone we bail
   immediately without burning the retry budget.

2. describeTruncationCause now distinguishes three states:
   - "pod is gone" (eviction, preemption, external delete)
   - "container terminated state not yet observable (pod phase=…)"
   - the existing populated-state path with exit code / reason / signal

The truncation error path re-queries with the retry-aware lookup right
before producing the message, so subsequent claude_truncated errors
surface the actual exit cause (137=OOMKilled, 143=SIGTERM, kubelet
reason text) instead of a misleading deletion claim.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:00:56 +00:00
Chris Farhood 83e105393c 0.1.51 v0.1.51 2026-04-26 21:24:15 +00:00
Chris Farhood 49288fa5c7 fix: scope cancel-polling to explicit cancellation states only (FAR-107)
shouldAbortForCancellation previously treated any non-`running` runStatus
as a cancellation signal — which made the keepalive's cancel-poll delete
the K8s Job whenever the heartbeat-runs API briefly returned a transient
or stale status (e.g. queued, pending, succeeded, failed, completed,
unknown) for an in-flight run.  The follow-up `waitForJobCompletion`
poll then observed the 404 and surfaced a spurious
`k8s_job_deleted_externally` error to the user, even though no human
or external system deleted the Job.

Privileged Escalation's "null-pointer-nancy" agent reproduced this on
runs that were never cancelled and were not adjacent to a paperclip
restart, ruling out the SIGTERM path that 0.1.50 already addressed.

Tighten the guard to fire only on `cancelled` / `cancelling`.  Other
terminal statuses are unreachable while the adapter is still executing
(the adapter's own return is what flips them) and even if observed
mid-run, they do not justify deleting a Job that may still be doing
real work — the natural completion path will tear it down.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:24:11 +00:00
Chris Farhood dae9e18659 0.1.50 v0.1.50 2026-04-26 21:19:03 +00:00
Chris Farhood 6923597b31 fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107)
Root cause of Nancy's k8s_job_deleted_externally false positive: the
paperclip server itself receives SIGTERM during rolling deploys,
evictions, scale-down, etc.  The previous SIGTERM handler iterated
activeJobs and deleted every Job before exiting, which surfaced in the
in-flight heartbeat as "K8s Job was deleted externally" — even though
nothing external touched it.

With reattachOrphanedJobs=true (default), this is exactly the wrong
behaviour: leaving the Jobs alive lets the next paperclip process
discover them via the orphan-classification path and reattach their
log streams.  With reattachOrphanedJobs=false the operator opted into
manual cleanup, so we still must not auto-delete.

The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the
Job, so both survive together and TTL handles cleanup on natural
completion.  Test rewritten to assert the new contract: SIGTERM must
not touch K8s Jobs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:19:02 +00:00
Chris Farhood d184a1732b 0.1.49 v0.1.49 2026-04-26 21:06:19 +00:00
Chris Farhood be84428226 fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107)
The error previously fired with no diagnostic context, making it impossible
to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a
missed Complete condition, or (c) actual external deletion without cluster
shell access.  Two changes:

1. Grace-period verification: when the log stream exits and the 30s grace
   timer fires, do a one-shot readNamespacedJob before declaring the Job
   gone.  If it's still there, settle as gracePeriodFired (not jobGone) so
   we don't mis-classify K8s condition propagation lag as deletion.

2. Forensic capture: track which of the three detection paths
   (completion-poll-404, grace-period-verify-404, recheck-poll-404)
   first observed the 404, the last successful Job conditions read, the
   poll count, elapsed time since pod-running, and stdout size.  Append
   all of it to the errorMessage so the next occurrence is self-diagnosing.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:05:04 +00:00
Chris Farhood d9928030d6 0.1.48 v0.1.48 2026-04-26 14:48:22 +00:00
Chris Farhood 76fc6fcdfc fix: surface pod terminated reason/message in adapter_failed errors (FAR-100)
The init-only and partial-run error paths now embed the K8s container
terminated state (reason, message, signal, OOM hint) directly in the
errorMessage. This eliminates the kubectl round-trip when diagnosing
adapter_failed runs — the surfaced error self-explains.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 14:48:12 +00:00
Chris Farhood 3169f49f23 0.1.47 v0.1.47 2026-04-26 13:04:54 +00:00
Chris Farhood e0b35d230f fix: distinguish init-only non-zero exits in buildPartialRunError (FAR-100)
Init-only runs that exit with a non-zero code now surface a more actionable
message naming the exit code and the likely cause (unsupported model or
rejected session) instead of the generic "did not produce a result" text.
Helps operators diagnose model-id / billing-tier failures (e.g. opus 4.6).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 13:04:43 +00:00
Chris Farhood 4e2c36319d 0.1.46 v0.1.46 2026-04-26 01:57:43 +00:00
Chris Farhood 8474f78fe1 fix: include pod terminated reason/message in claude_truncated error (FAR-95)
Capture the claude container's terminated state (exit code, reason, message,
signal) and surface it in the truncation error so operators see *why* the run
was cut short — e.g. "exit code 137, SIGKILL (commonly OOMKilled),
reason=OOMKilled, message=Memory cgroup out of memory" instead of just a
"truncated" label with no diagnostic context.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 01:57:43 +00:00
Chris Farhood 88896eddcf 0.1.45 v0.1.45 2026-04-26 01:54:48 +00:00
Chris Farhood a2874c0426 fix: detect mid-stream truncation and emit claude_truncated error code (FAR-95)
When Claude produces assistant content (output_tokens > 0) but the stream ends
without a result event, classify the run as truncated mid-stream rather than
falling through to the generic "did not produce a result — check API
credentials" message. The misleading hint pointed operators at auth/model
config when the real cause was pod termination, OOMKill, or CLI crash.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 01:54:35 +00:00
Chris Farhood 818aa0f1d6 feat: log bundled skill names and add skills to onMeta commandNotes (FAR-36)
Adds a diagnostic log line after skill resolution so operators can see exactly
which skills were bundled into each run, making it straightforward to diagnose
skill availability issues. Also surfaces the skill list in the onMeta
commandNotes for run metadata visibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 20:41:01 +00:00
Chris Farhood 55fd3021fb fix: add per-agent mutex to eliminate TOCTOU race in K8s concurrency guard (FAR-29)
Two concurrent execute() calls for the same agent can both pass the
list-then-create guard before either job appears in the other's query.
The new module-level agentCreationMutex serializes the guard+create phase
within the process so only one call enters listNamespacedJob at a time.

The mutex is acquired after sanitizing the agent ID and released in a
finally block that wraps the entire guard+create section, so all early
return paths (guard blocks, create failures) cleanly release it. Variables
used in both the guard+create and log-streaming phases are hoisted to
before the try block. Cross-agent calls use separate mutex slots and are
unaffected.

Added two vitest cases verifying same-agent serialization and that
different-agent calls are not serialized.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 20:10:01 +00:00
Chris Farhood 83b58f9207 fix: detect stop_reason:null + output_tokens:0 and emit llm_api_error (FAR-30)
parseClaudeStreamJson now tracks assistant events with stop_reason:null and
output_tokens:0 (the MiniMax degraded-response pattern). When no result event
follows, execute() returns errorCode:"llm_api_error" with a descriptive message
instead of the generic adapter_failed.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 20:00:42 +00:00
Chris Farhood 602afa9b84 fix: return k8s_job_deleted_externally error code when job deleted mid-run (FAR-31)
When a K8s Job is deleted externally (kubectl delete job or TTL before
terminal condition observed) and stdout has no result event, the adapter
now returns errorCode "k8s_job_deleted_externally" with the message
"K8s Job was deleted externally before Claude could complete" instead of
the misleading "Claude exited with code -1".

Tracks a jobDeletedExternally flag in execute() on the jobGone path and
checks it in the !parsed branch before falling through to buildPartialRunError.
Only applies when exitCode is null (pod gone alongside the job).

Adds regression test: FAR-31 scenario where job 404s mid-run with partial
stdout and missing pod produces the new error code.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 19:58:46 +00:00
Chris Farhood 986f2fc7fa test: add coverage for deletionTimestamp concurrency guard bypass (FAR-34)
Verifies that a terminating K8s job (deletionTimestamp set, no
Complete/Failed condition) is skipped by the concurrency guard so
subsequent heartbeat runs are not incorrectly blocked.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 19:57:10 +00:00
Chris Farhood 357f035418 fix: skip K8s jobs with deletionTimestamp in concurrency guard (FAR-34)
Jobs being deleted via kubectl enter a Terminating state where
deletionTimestamp is set but no Complete/Failed condition is added.
The concurrency guard previously treated these as running, blocking
all subsequent heartbeat runs for the agent until the job fully
disappeared from the K8s API.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
0.1.43
2026-04-24 18:36:19 +00:00
Chris Farhood f340ce52ee 0.1.42 v0.1.42 2026-04-24 17:56:14 +00:00
Chris Farhood ecc477d0be fix: stream raw stream-json to onLog so Paperclip UI renders structured transcript entries (FAR-32)
The prior approach (commit b607657) converted Claude's stream-json into
flat plain text before calling onLog.  This stripped the structure the
Paperclip UI needs — its adapter ui-parser (src/ui-parser.ts, exported
via the package's ./ui-parser entry) expects raw stream-json lines and
emits structured transcript entries (assistant / thinking / tool_call /
tool_result / init / result) that the UI renders as rich blocks, just
like claude_local.

claude_local passes stdout through unchanged to onLog for the same
reason — the server persists raw lines and the UI parser turns them
into rendered transcript entries.  Mirror that here.

formatClaudeStreamLine stays as an internal helper for future CLI use,
but is no longer applied in the K8s streaming path.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:56:10 +00:00
Chris Farhood f9ba77527a 0.1.41 v0.1.41 2026-04-24 17:43:16 +00:00
Chris Farhood f304c70899 fix: keep formatClaudeStreamLine internal to avoid ESM hot-reload link failure (FAR-32)
Exposing formatClaudeStreamLine at the package root caused Paperclip reinstalls
to fail with "'./cli/index.js' does not provide an export named
'formatClaudeStreamLine'".  The host process caches child ESM module records
across reinstalls; linking the new dist/index.js re-export against the cached
old dist/cli/index.js fails.

The symbol is only used internally by server/execute.ts (which imports from
./cli/format-event.js directly), so drop the public re-export.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:43:16 +00:00
Chris Farhood 727d9494da 0.1.40 v0.1.40 2026-04-24 17:35:08 +00:00
Chris Farhood b60765785b feat: format Claude stream-json events in K8s streaming path for consistency with claude_local (FAR-32)
All output sent to Paperclip via onLog now passes through formatClaudeStreamLine,
converting raw stream-json blobs into human-readable text consistent with how
the CLI and claude_local adapter format events.

Changes:
- format-event.ts: add formatClaudeStreamLine(raw) -> string | null
  Plain-text equivalent of printClaudeStreamEvent — no ANSI colours, returns
  null for lines to suppress (assistant with no content, unknown events).
  Handles: system/init, assistant (text/thinking/tool_use), user (tool_result),
  result (summary + tokens), rate_limit_event. Non-JSON lines pass through.
- execute.ts: wire formatClaudeStreamLine into streamPodLogsOnce write handler.
  raw chunks still stored in 'chunks[]' for parseClaudeStreamJson; only the
  onLog path receives formatted text.
- 12 new tests for formatClaudeStreamLine covering all event types.
- 352/352 tests pass.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:26:37 +00:00
Chris Farhood 28d6451265 feat: add rate_limit_event formatting to printClaudeStreamEvent (FAR-32)
rate_limit_event was previously falling through to the debug-only branch
and silently dropped in non-debug mode.  Now it surfaces a concise,
human-readable line for CLI consumers:

  rate_limit: type=five_hour status=allowed resets=2026-04-22T06:00:00.000Z

Two tests cover the exact FAR-32 repro payload and graceful handling of
missing rate_limit_info fields.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:22:15 +00:00