Commit Graph

48 Commits

Author SHA1 Message Date
Chris Farhood 07ef106c66 fix: gate grace timer on stream-output silence, not first disconnect (FAR-107)
The 30s grace timer that bounds K8s Job condition propagation lag was
armed by streamPodLogs's onFirstStreamExit callback the moment
streamPodLogsOnce returned for the first time.  A transient K8s log-API
disconnect mid-run also returns from streamPodLogsOnce — so the grace
timer fired 30s later regardless of whether streamPodLogs had already
reconnected and the container was still producing output.

Nancy / Privileged Escalation reproduced this on long Opus-4-6 runs:
the prod paperclip pod was stable, the cancel-poll guard was already
narrowed in 0.1.51, but every long run truncated with claude_truncated
+ "container terminated state not yet observable (pod phase=Running)"
because the run was being abandoned mid-output.

Replace the boolean onFirstStreamExit signal with a streamActivity ref
carrying lastActiveAt + streamHasExited.  streamPodLogs refreshes
lastActiveAt every time a streamPodLogsOnce attempt returns non-empty
output, so reconnects that resume real output keep the grace clock
reset.  The grace timer fires only once the stream has exited at least
once AND no chunk has arrived for the full grace window — which
preserves the original FAR-23 behaviour (container truly exited but
Job condition lags) while ending the false-truncation of healthy
streams.  Adds a regression test that asserts a stream drop + reconnect
+ deferred Job completion does not surface as truncated.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:28:44 +00:00
Chris Farhood b1878c684e fix: retry-aware pod state lookup + honest truncation cause messages (FAR-107)
The single-shot getPodTerminatedState query lost a real race against
kubelet's containerStatus update: when Claude exited cleanly but quickly,
listNamespacedPod often returned the pod with phase=Succeeded/Failed but
without a populated state.terminated, so describeTruncationCause fell into
the catch-all "pod state unavailable — likely deleted before exit could
be read" branch.  That message is doubly wrong: the pod was not deleted
and the exit cause was readable a few hundred ms later.  Operators
chasing claude_truncated runs (Nancy/Privileged Escalation) had no
visibility into the actual exit code, OOMKilled flag, or reason.

Two changes:

1. Introduce lookupPodState + getPodLookupWithRetry — the lookup result
   carries the pod phase and a podMissing flag, and retries up to 4×500ms
   when the pod is in a terminal phase but containerStatuses lag.  When
   the pod is in a non-terminal phase or genuinely gone we bail
   immediately without burning the retry budget.

2. describeTruncationCause now distinguishes three states:
   - "pod is gone" (eviction, preemption, external delete)
   - "container terminated state not yet observable (pod phase=…)"
   - the existing populated-state path with exit code / reason / signal

The truncation error path re-queries with the retry-aware lookup right
before producing the message, so subsequent claude_truncated errors
surface the actual exit cause (137=OOMKilled, 143=SIGTERM, kubelet
reason text) instead of a misleading deletion claim.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-27 00:00:56 +00:00
Chris Farhood 49288fa5c7 fix: scope cancel-polling to explicit cancellation states only (FAR-107)
shouldAbortForCancellation previously treated any non-`running` runStatus
as a cancellation signal — which made the keepalive's cancel-poll delete
the K8s Job whenever the heartbeat-runs API briefly returned a transient
or stale status (e.g. queued, pending, succeeded, failed, completed,
unknown) for an in-flight run.  The follow-up `waitForJobCompletion`
poll then observed the 404 and surfaced a spurious
`k8s_job_deleted_externally` error to the user, even though no human
or external system deleted the Job.

Privileged Escalation's "null-pointer-nancy" agent reproduced this on
runs that were never cancelled and were not adjacent to a paperclip
restart, ruling out the SIGTERM path that 0.1.50 already addressed.

Tighten the guard to fire only on `cancelled` / `cancelling`.  Other
terminal statuses are unreachable while the adapter is still executing
(the adapter's own return is what flips them) and even if observed
mid-run, they do not justify deleting a Job that may still be doing
real work — the natural completion path will tear it down.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:24:11 +00:00
Chris Farhood 6923597b31 fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107)
Root cause of Nancy's k8s_job_deleted_externally false positive: the
paperclip server itself receives SIGTERM during rolling deploys,
evictions, scale-down, etc.  The previous SIGTERM handler iterated
activeJobs and deleted every Job before exiting, which surfaced in the
in-flight heartbeat as "K8s Job was deleted externally" — even though
nothing external touched it.

With reattachOrphanedJobs=true (default), this is exactly the wrong
behaviour: leaving the Jobs alive lets the next paperclip process
discover them via the orphan-classification path and reattach their
log streams.  With reattachOrphanedJobs=false the operator opted into
manual cleanup, so we still must not auto-delete.

The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the
Job, so both survive together and TTL handles cleanup on natural
completion.  Test rewritten to assert the new contract: SIGTERM must
not touch K8s Jobs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:19:02 +00:00
Chris Farhood be84428226 fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107)
The error previously fired with no diagnostic context, making it impossible
to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a
missed Complete condition, or (c) actual external deletion without cluster
shell access.  Two changes:

1. Grace-period verification: when the log stream exits and the 30s grace
   timer fires, do a one-shot readNamespacedJob before declaring the Job
   gone.  If it's still there, settle as gracePeriodFired (not jobGone) so
   we don't mis-classify K8s condition propagation lag as deletion.

2. Forensic capture: track which of the three detection paths
   (completion-poll-404, grace-period-verify-404, recheck-poll-404)
   first observed the 404, the last successful Job conditions read, the
   poll count, elapsed time since pod-running, and stdout size.  Append
   all of it to the errorMessage so the next occurrence is self-diagnosing.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 21:05:04 +00:00
Chris Farhood 76fc6fcdfc fix: surface pod terminated reason/message in adapter_failed errors (FAR-100)
The init-only and partial-run error paths now embed the K8s container
terminated state (reason, message, signal, OOM hint) directly in the
errorMessage. This eliminates the kubectl round-trip when diagnosing
adapter_failed runs — the surfaced error self-explains.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 14:48:12 +00:00
Chris Farhood e0b35d230f fix: distinguish init-only non-zero exits in buildPartialRunError (FAR-100)
Init-only runs that exit with a non-zero code now surface a more actionable
message naming the exit code and the likely cause (unsupported model or
rejected session) instead of the generic "did not produce a result" text.
Helps operators diagnose model-id / billing-tier failures (e.g. opus 4.6).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 13:04:43 +00:00
Chris Farhood 8474f78fe1 fix: include pod terminated reason/message in claude_truncated error (FAR-95)
Capture the claude container's terminated state (exit code, reason, message,
signal) and surface it in the truncation error so operators see *why* the run
was cut short — e.g. "exit code 137, SIGKILL (commonly OOMKilled),
reason=OOMKilled, message=Memory cgroup out of memory" instead of just a
"truncated" label with no diagnostic context.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 01:57:43 +00:00
Chris Farhood a2874c0426 fix: detect mid-stream truncation and emit claude_truncated error code (FAR-95)
When Claude produces assistant content (output_tokens > 0) but the stream ends
without a result event, classify the run as truncated mid-stream rather than
falling through to the generic "did not produce a result — check API
credentials" message. The misleading hint pointed operators at auth/model
config when the real cause was pod termination, OOMKill, or CLI crash.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 01:54:35 +00:00
Chris Farhood 818aa0f1d6 feat: log bundled skill names and add skills to onMeta commandNotes (FAR-36)
Adds a diagnostic log line after skill resolution so operators can see exactly
which skills were bundled into each run, making it straightforward to diagnose
skill availability issues. Also surfaces the skill list in the onMeta
commandNotes for run metadata visibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 20:41:01 +00:00
Chris Farhood 55fd3021fb fix: add per-agent mutex to eliminate TOCTOU race in K8s concurrency guard (FAR-29)
Two concurrent execute() calls for the same agent can both pass the
list-then-create guard before either job appears in the other's query.
The new module-level agentCreationMutex serializes the guard+create phase
within the process so only one call enters listNamespacedJob at a time.

The mutex is acquired after sanitizing the agent ID and released in a
finally block that wraps the entire guard+create section, so all early
return paths (guard blocks, create failures) cleanly release it. Variables
used in both the guard+create and log-streaming phases are hoisted to
before the try block. Cross-agent calls use separate mutex slots and are
unaffected.

Added two vitest cases verifying same-agent serialization and that
different-agent calls are not serialized.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 20:10:01 +00:00
Chris Farhood 602afa9b84 fix: return k8s_job_deleted_externally error code when job deleted mid-run (FAR-31)
When a K8s Job is deleted externally (kubectl delete job or TTL before
terminal condition observed) and stdout has no result event, the adapter
now returns errorCode "k8s_job_deleted_externally" with the message
"K8s Job was deleted externally before Claude could complete" instead of
the misleading "Claude exited with code -1".

Tracks a jobDeletedExternally flag in execute() on the jobGone path and
checks it in the !parsed branch before falling through to buildPartialRunError.
Only applies when exitCode is null (pod gone alongside the job).

Adds regression test: FAR-31 scenario where job 404s mid-run with partial
stdout and missing pod produces the new error code.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 19:58:46 +00:00
Chris Farhood 357f035418 fix: skip K8s jobs with deletionTimestamp in concurrency guard (FAR-34)
Jobs being deleted via kubectl enter a Terminating state where
deletionTimestamp is set but no Complete/Failed condition is added.
The concurrency guard previously treated these as running, blocking
all subsequent heartbeat runs for the agent until the job fully
disappeared from the K8s API.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 18:36:19 +00:00
Chris Farhood ecc477d0be fix: stream raw stream-json to onLog so Paperclip UI renders structured transcript entries (FAR-32)
The prior approach (commit b607657) converted Claude's stream-json into
flat plain text before calling onLog.  This stripped the structure the
Paperclip UI needs — its adapter ui-parser (src/ui-parser.ts, exported
via the package's ./ui-parser entry) expects raw stream-json lines and
emits structured transcript entries (assistant / thinking / tool_call /
tool_result / init / result) that the UI renders as rich blocks, just
like claude_local.

claude_local passes stdout through unchanged to onLog for the same
reason — the server persists raw lines and the UI parser turns them
into rendered transcript entries.  Mirror that here.

formatClaudeStreamLine stays as an internal helper for future CLI use,
but is no longer applied in the K8s streaming path.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:56:10 +00:00
Chris Farhood b60765785b feat: format Claude stream-json events in K8s streaming path for consistency with claude_local (FAR-32)
All output sent to Paperclip via onLog now passes through formatClaudeStreamLine,
converting raw stream-json blobs into human-readable text consistent with how
the CLI and claude_local adapter format events.

Changes:
- format-event.ts: add formatClaudeStreamLine(raw) -> string | null
  Plain-text equivalent of printClaudeStreamEvent — no ANSI colours, returns
  null for lines to suppress (assistant with no content, unknown events).
  Handles: system/init, assistant (text/thinking/tool_use), user (tool_result),
  result (summary + tokens), rate_limit_event. Non-JSON lines pass through.
- execute.ts: wire formatClaudeStreamLine into streamPodLogsOnce write handler.
  raw chunks still stored in 'chunks[]' for parseClaudeStreamJson; only the
  onLog path receives formatted text.
- 12 new tests for formatClaudeStreamLine covering all event types.
- 352/352 tests pass.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:26:37 +00:00
Chris Farhood cabdc3df98 fix: skip all structured streaming events in buildPartialRunError (FAR-32 followup)
Extends the previous fix (which only covered assistant/user) to skip every
JSON object with a non-empty "type" field — system, assistant, user,
rate_limit_event, result, and any future event types.  This prevents all
structured protocol artefacts from being surfaced verbatim as error messages.

Root cause of the new repro: when Claude emits a rate_limit_event before
producing output and then exits without a result event, the rate_limit_event
JSON blob was becoming the "first content line" and appearing in the error:

  Claude exited with code -1: {"type":"rate_limit_event","rate_limit_info":{...}}

With this fix, all typed events are filtered and the initOnlyOutput branch
fires, producing the clean diagnostic:

  Claude started but did not produce a result (model: claude-opus-4-7)
  — check API credentials, model support, and adapter config

Updated the "result event as content" test to match the new (correct) behaviour:
in production buildPartialRunError is only called when parseClaudeStreamJson
returns null (no result event), so the prior test was exercising a degenerate
state that cannot occur through execute().

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:17:48 +00:00
Chris Farhood f9ff04a354 fix: skip assistant/user events in buildPartialRunError to avoid raw JSON blobs in error messages (FAR-32)
When a model produces assistant events with output_tokens=0 but no result
event (e.g. MiniMax-M2.7 thinking-only output), the partial-run error
previously surfaced the raw assistant JSON blob verbatim, producing an
unreadable message like "Claude exited with code -1: {\"type\":\"assistant\",...}".

Fix: extend the content-line filter in buildPartialRunError to also skip
assistant and user event types (intermediate streaming events), in addition
to system events. result events are still retained since they may carry
useful terminal error details. When all stdout lines are filtered, the
existing initOnlyOutput branch triggers and surfaces a clean diagnostic:
"Claude started but did not produce a result (model: MiniMax-M2.7) — check
API credentials, model support, and adapter config".

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 17:11:20 +00:00
Chris Farhood f097440f3c feat: implement cancel support via keepalive poll and SIGTERM handler (FAR-26)
- Poll GET /api/heartbeat-runs/:runId on every keepalive tick (15s); when
  status != 'running', delete the K8s Job, set logStopSignal, and return
  errorCode='cancelled' — Job gone within ~15s of external cancellation.
- SIGTERM handler best-effort deletes all active Jobs/Secrets and re-emits
  the signal to let the process exit naturally.
- Export shouldAbortForCancellation() helper; add tests for helper, cancel
  poll path, and SIGTERM cleanup.
- Guard: PAPERCLIP_API_URL missing logs a warning and skips cancel polling;
  HTTP 5xx from poll treated as transient; reattach path skips cancel poll.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 15:20:45 +00:00
Chris Farhood c55d6c61fc feat: declare hasOutOfProcessLiveness and remove onSpawn workarounds (FAR-24)
- Add `hasOutOfProcessLiveness: true` to createServerAdapter() so the
  reaper skips local PID checks and uses the staleness window instead.
- Remove the initial onSpawn call and all periodic keepalive onSpawn
  refreshes that were compensating for the missing flag.
- Remove POST_TERMINAL_KEEPALIVE_MS constant and keepaliveTick counter
  that backed those workarounds.
- Cast required: adapter-utils ServerAdapterModule type predates this field.
- Bump to 0.1.38.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 14:14:10 +00:00
Chris Farhood f9d8a2e0ce fix: resolve grace-period deadlock for stale UI status (FAR-23)
The log-stream-exit grace timer never fired because logExitTime was set
in the .then() of streamPodLogs, which only resolves once stopSignal is
set — but stopSignal is only set when completionWithGrace fires, which
requires logExitTime to be non-null. Classic deadlock.

Fix: add onFirstStreamExit callback to streamPodLogs, called after
attempt=0's streamPodLogsOnce returns (the first container exit signal).
execute() passes a closure that sets logExitTime immediately, breaking
the circular dependency and allowing the 30s grace timer to fire
correctly when K8s Job conditions lag container exit.

Tests: all 323 pass including the two FAR-23 grace-period regression tests.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 12:20:10 +00:00
Chris Farhood 29a4e709d0 fix: sanitize agent/run/company labels to RFC 1123 (N4)
Co-Authored-By: Claude Sonnet <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 00:00:56 +00:00
Chris Farhood 8a08e6a6ee fix: relabel reattached Job with current run-id and session-id (N3)
Co-Authored-By: Claude Sonnet <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 23:59:05 +00:00
Chris Farhood c0dba8e904 fix: never auto-delete live K8s orphans; block on mismatch (#8)
Co-Authored-By: Claude Sonnet <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 23:58:51 +00:00
Chris Farhood b91859c258 refactor: extract classifyOrphan helper with decision matrix (#8)
Co-Authored-By: Claude Sonnet <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 23:58:23 +00:00
Chris Farhood f1433b05a6 fix: reserve paperclip.io/ and app.kubernetes.io/ label prefixes (N2)
Co-Authored-By: Claude Sonnet <noreply@anthropic.com>
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 23:54:15 +00:00
Gandalf the Greybeard 21a02da00f fix: prevent prompt Secret leak by attaching ownerReference to Job (FAR-15)
When a large prompt creates a K8s Secret, it can orphan if the process
crashes before the finally block runs. Now the Secret gets an
ownerReference pointing to the Job after creation, so K8s GC cleans it
up automatically. Also cleans up the Secret on job creation failure.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 23:29:47 +00:00
Gandalf the Greybeard ef73586a41 fix: address 6 critical/minor code review findings (FAR-15)
1. Fix resources.* dotted-key config — UI fields now correctly read
2. Fix operator precedence bug in container status key (add parens)
3. Add missing RBAC checks to testEnvironment (jobs/list, secrets/*, pvc)
4. Add bail timer log message for debuggability
5. Make result-event detection robust to JSON whitespace variations
6. Remove namespace short-circuit so all checks run on first attempt

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 23:15:01 +00:00
Gandalf the Greybeard f41ae818ef fix: fire onSpawn immediately on job terminal transition (FAR-14)
Prevents process_lost false positives for 2-3 minute K8s jobs by
resetting the reaper clock when the keepalive loop detects the job
has completed (or been deleted), rather than waiting for the next
periodic refresh.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 22:29:22 +00:00
Gandalf the Greybeard 77ed2004f8 fix: port prepareClaudePromptBundle flow to claude_k8s adapter (FAR-11)
K8s Job pods were starting without the Paperclip skill loaded, so agents
could not find their heartbeat procedure and reported "no issue content in
my workspace" on every wake. Root cause: claude_local materialises skills
into a PVC-backed prompt-bundle directory and passes --add-dir to Claude,
but claude_k8s did neither.

Changes:
- Add src/server/prompt-cache.ts with prepareClaudePromptBundle (ported
  from adapter-claude-local). Writes skill symlinks and the agent's
  instructions file into a content-addressed bundle directory under the
  shared PVC (/paperclip/instances/.../claude-prompt-cache/<hash>/).
- execute.ts: read desired skills and instructions file before building
  the Job manifest, then call prepareClaudePromptBundle and pass the
  resulting bundle to buildJobManifest.
- job-manifest.ts: accept optional promptBundle in JobBuildInput; when
  present, pass --add-dir <bundle.addDir> and use bundle.instructionsFilePath
  for --append-system-prompt-file. Also fix: skip --append-system-prompt-file
  on session resumes to avoid wasting tokens on re-injection.
- skills.ts: correct the detail string to reflect actual materialisation.
- job-manifest.test.ts: add 5 new tests covering --add-dir injection,
  bundle path preference, session-resume skipping, and fallback behaviour.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 19:34:35 +00:00
Gandalf the Greybeard 69d0f4972f test: regression for streamPodLogsOnce bail timer (FAR-10)
Uses vi.mock on k8s-client and vi.useFakeTimers to prove that when
logApi.log() never resolves (the FAR-10 hang shape) and stopSignal
fires, streamPodLogsOnce still returns within the bail window
(LOG_STREAM_BAIL_TIMEOUT_MS).  Exports streamPodLogsOnce so the test
can call it directly.  Also covers the no-stopSignal happy path.

269/269 passing (+2 new).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 16:43:32 +00:00
Gandalf the Greybeard c7706d742f 0.1.31: harden streamPodLogsOnce with Promise.race bail (FAR-10)
Defensive follow-up to the FAR-10 fix.  The original patch aborts the
in-flight follow stream by destroying the Writable once stopSignal
fires, and relies on the @kubernetes/client-node library propagating
that destroy into an abort of the underlying HTTP request.  If that
propagation ever fails (e.g. the client is awaiting a response that
never arrives), logApi.log() can still hang forever.

Adds a Promise.race with a 3s bail timer that starts when stopSignal
fires.  In the happy path (destroy-propagation works), logApi.log()
resolves first and the bail timer is cleared.  In the failure path,
the bail timer fires and streamPodLogsOnce returns with whatever
chunks were captured — preventing the hang from reaching execute().

No test change: existing 267 tests pass and the race path needs a k8s
mock to exercise end-to-end; validated by monitoring real runs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 16:36:51 +00:00
Gandalf the Greybeard b3c1519cf5 fix: prevent process_lost when K8s Job completes (FAR-10)
Four stacked bugs caused the adapter to hang after K8s Job completion,
allowing the 5-minute reaper to mark runs process_lost even when the Job
actually succeeded.

- streamPodLogsOnce: add stopSignal polling loop that destroys the
  writable every 200ms once the job-completion branch fires, aborting
  any in-flight follow stream that would otherwise hang indefinitely
- waitForPod: treat phase=Failed as a terminal error (throw via
  describePodTerminatedError) instead of entering the log-stream path
  with a dead pod (new helper is exported for unit tests)
- waitForPod: surface cs.state?.terminated in the per-tick detail line
  so operators see exit code / reason without needing kubectl
- keepalive: add POST_TERMINAL_KEEPALIVE_MS (90s) window after Job goes
  terminal so onSpawn keeps refreshing updatedAt during cleanup; if
  execute() genuinely stalls past 90s the reaper will still catch it

Regression tests added for describePodTerminatedError (phase=Failed
with and without claude container status).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-23 15:59:51 +00:00
Test User c8968598e4 fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124)
When the Paperclip pod restarts mid-run, the in-process setInterval
keepalive dies, `updatedAt` goes stale, and the server's orphan reaper
fails the run with the (misleading) "child pid 1 is no longer running"
message.  Paperclip then dispatches a continuation run, whose execute()
finds the previous run's K8s Job still happily running and deletes it
as an "orphan" — throwing away work and producing the transcript/run
cascade reported on FAR-124.

Changes:

- job-manifest: add `paperclip.io/task-id` and `paperclip.io/session-id`
  labels (sanitized via new `sanitizeLabelValue` helper) so a later
  execute() can identify an orphan as the continuation of the same
  logical unit of work.
- execute: in the concurrency guard, when `reattachOrphanedJobs` is on
  (default) and an orphan matches agent + task + session + is not
  terminal, pick it as the reattach target; delete only the other
  orphans.  Branch the build/create/waitForPod block so the reattach
  path skips manifest building, Secret creation, Job creation, and
  scheduling wait — it jumps straight to streaming logs and waiting
  for the existing pod's completion.
- config-schema: expose `reattachOrphanedJobs` toggle (default true).
- Tests: `sanitizeLabelValue`, `isReattachableOrphan`, new label
  presence/absence, config default.

No server-side changes; the misleading reaper message and lack of a
non-local retry path will be addressed in a follow-up upstream PR.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-22 21:59:25 +00:00
Test User 5e01ae99b3 fix: dedup replayed K8s log lines at the streaming UI boundary (FAR-123)
The K8s log follow stream replays the trailing few seconds of output on
every reconnect because `sinceSeconds` uses integer-second granularity
with a small safety buffer.  FAR-105 dedupped those replays at the final
parser (parse.ts), but the streaming UI consumes raw onLog chunks and
still showed each replayed assistant/tool event as a fresh entry — which
is how the duplicate "Three nits to fix…" blocks in the screenshot
appeared between successive tool calls.

Fix: add a stateful line-level dedup filter around onLog, shared across
reconnects.  Claude stream-json events are keyed by their stable
structural IDs (message.id, tool_use_id, session_id); non-JSON output
(paperclip status lines, shell output) passes through unchanged.

- New `src/server/log-dedup.ts` + tests: LogLineDedupFilter handles
  chunk-to-line buffering, replay dedup, and end-of-stream flush.
- `streamPodLogs` instantiates one filter per run so dedup state persists
  across reconnect attempts.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-22 19:49:04 +00:00
Test User 8c8c2f2ec0 fix: address review nits — refactor fallbacks, add unit tests (FAR-122)
- Merge both one-shot log fallbacks into a single conditional block using a
  cheap string-scan guard (`stdout.includes('"type":"result"')`) to avoid
  calling parseClaudeStreamJson twice and prevent double readPodLogs calls
  when the first fallback already ran.

- Extract error-message logic into `buildPartialRunError(exitCode, model, stdout)`
  (exported for tests) so the `!parsed` branch is a one-liner and the logic
  is independently testable.

- Export `isK8s404` for tests.

- Add execute.test.ts with 15 unit tests covering:
  - isK8s404: v0.x response.statusCode, v1.0+ response.status, direct
    statusCode, message-based detection, non-404 codes
  - buildPartialRunError: exitCode=0 path, empty stdout, init-only output
    (model surfaced), first non-system content line, null exitCode (-1),
    multiple consecutive system events

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-22 19:42:57 +00:00
Test User b9def0964e fix: improve partial-log handling and error messages for fast-exit containers (FAR-122)
- Add a second log fallback: if the follow stream captured partial output (init
  event present but no result event), attempt a one-shot readPodLogs before the
  pod is cleaned up. Fast-exiting containers (bad model, missing API key, etc.)
  can cause the follow stream to return only the init line before the connection
  drops; the one-shot read is more reliable for already-terminated containers.

- Improve the `!parsed` error message: skip system/init events when searching
  for the first content line, so the error reads "Claude started but did not
  produce a result (model: MiniMax-M2.7) — check API credentials..." instead of
  "Claude exited with code -1: {"type":"system","subtype":"init",...}".

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-22 19:33:15 +00:00
Test User 2a31fe1f9b fix: handle K8s 404 (job deleted) gracefully in waitForJobCompletion (FAR-122)
- Add `isK8s404()` helper compatible with @kubernetes/client-node v0.x and v1.0+
  (checks response.statusCode, response.status, err.statusCode, and message text)
- `waitForJobCompletion` now catches 404 and returns `{ jobGone: true }` instead
  of throwing — prevents uncaught exceptions when the K8s Job is TTL-deleted or
  externally removed while the adapter is polling for a terminal condition
- Keepalive job-liveness check now uses `isK8s404` (was checking `response.statusCode`
  which is absent in the v1.0+ fetch-based client, silently breaking 404 detection)
- `jobGone` case in completion handler logs a diagnostic and falls through to stdout
  parsing rather than returning an opaque 404 error to the user

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-22 16:42:07 +00:00
Iceman 5926b302e5 fix: replace pid:-1 sentinel with process.pid to prevent false process_lost
The adapter was calling onSpawn({ pid: -1 }) as a sentinel value for
K8s Jobs (which run out-of-process), then the server's orphan reaper
was checking isProcessAlive(-1) which always returns false, causing
legitimate runs to be reaped as 'process_lost'.

Using process.pid (the Paperclip server's own PID) is always alive
while the adapter runs in-process, preventing false reaping.

Fixes FAR-116.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 18:09:41 +00:00
Test User 1e517bb9bb fix: P1 correctness and operational fixes from FAR-104/FAR-105 analysis
5. Cap log stream reconnect attempts at 50 — prevents infinite
   reconnect loops during sustained API partitions.

6. Fire keepalive refresh earlier — tick 1 + every 12 ticks (~3min)
   instead of every 16 ticks (~4min), providing better safety margin
   under the 5-minute reaper window.

7. Catch rejections from onLog inside keepalive — add .catch(() => {})
   to prevent unhandledRejection on SSE backpressure.

8. Prevent sanitized-name collisions — extend slugs to 16 chars each,
   add a 6-char SHA-256 hash suffix, shorten prefix to `ac-` to stay
   well within the 63-char DNS label limit.

10. Fix config-hint parity for nodeSelector and labels — parse both
    `key=value` multiline text and JSON objects, matching what the
    textarea hint promises.

11. Large-prompt fallback via Secret — prompts >256 KiB are staged as a
    K8s Secret and mounted as a volume instead of passed via env var,
    protecting against the ~1 MiB PodSpec limit.

13. Track last-seen log timestamp on reconnect — anchor sinceSeconds at
    the last received log line instead of stream start, fixing FAR-105
    duplicative logs. Belt-and-braces: dedupe assistantTexts at the
    parser boundary in parse.ts.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-20 19:05:07 +00:00
Test User d74b6d34b3 fix: P0 correctness fixes from FAR-104/FAR-105 analysis
1. Inherit envFrom and env.valueFrom from self pod — secrets wired via
   valueFrom.secretKeyRef or envFrom.secretRef are now forwarded to Job
   pods, fixing credentials silently dropped for K8s-idiomatic secret
   patterns (e.g. ANTHROPIC_API_KEY via Secret).

2. Distinguish 404 vs transient errors in keepalive — only mark the
   keepalive as terminal on 404 (Job deleted). Transient 5xx/connection
   errors are logged and retried on the next tick, preventing premature
   reaper kills during API instability.

3. Fail closed on concurrency-guard read failure — a failing
   listNamespacedJob now returns k8s_concurrency_guard_unreachable
   instead of silently proceeding, protecting against zombie Jobs on
   shared PVCs.

4. Bound the waitForJobCompletion re-check — pass a 60s timeout instead
   of polling forever, preventing indefinite hangs when the K8s API is
   degraded.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-20 18:57:16 +00:00
Test User 5f5ae92ce7 fix: skip keepalive updatedAt refresh once K8s Job is terminal
The previous fix (df856e6) made the keepalive timer call onSpawn every
~4 minutes to refresh the run's updatedAt in the DB, so the stale-run
reaper wouldn't kill live runs in multi-instance deployments.  That was
correct for live jobs, but it was unconditional — if execute() stalled
after the pod terminated (slow K8s API call, hung log stream drain, or
a Job whose Complete condition lags pod termination), the keepalive
kept the run marked "alive" indefinitely even though the pod was gone.

That manifests as the opposite of the original bug: the UI shows jobs
as running when they have actually finished.

Two changes:

1. Verify the Job is still alive before the keepalive refreshes
   updatedAt.  If the Job has reached a terminal Complete/Failed
   condition (or has been deleted / the API read fails), stop
   refreshing.  If execute() truly ends up stuck past that point, the
   reaper will catch the run within the normal 5-minute staleness
   window instead of never.

2. Clear the keepalive interval immediately once Promise.allSettled
   resolves, rather than only in the finally block.  Post-completion
   work (exit-code fetch, log fallback read, job cleanup) must not be
   able to emit another onSpawn refresh that keeps the run "alive".

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-17 02:57:17 +00:00
Test User df856e6ca5 fix: clean up orphaned K8s Jobs and refresh updatedAt to prevent UI desync
Two root causes behind the "plugin losing sync" issue:

1. After a server restart, the in-memory activeRunExecutions set is lost.
   The K8s Job keeps running but the reaper marks the server-side run as
   failed after 5 min (stale updatedAt).  Next heartbeat fires a new run,
   the adapter's concurrency guard blocks it because the old Job is still
   alive, and this loops indefinitely.

   Fix: the concurrency guard now compares each running Job's
   paperclip.io/run-id label against the current runId.  Jobs from a
   previous (dead) run are cleaned up automatically so the new run
   can proceed.

2. onLog (keepalive) does NOT update the run's updatedAt in the DB —
   it only writes to the log store and publishes SSE events.  In
   multi-instance deployments, a reaper on instance B can mark a run
   being executed on instance A as stale after 5 min of no DB updates.

   Fix: the keepalive timer now calls onSpawn every ~4 min (16 ticks)
   to refresh updatedAt, staying within the 5-min reaper threshold.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-16 21:48:16 +00:00
Test User 4bf5cf64a4 fix: call onSpawn after pod enters Running state to prevent UI desync
The k8s adapter never called ctx.onSpawn(), so the Paperclip server
had no processStartedAt timestamp for the run. The stale-run reaper
(reapOrphanedRuns) would then mark live k8s runs as failed/orphaned,
causing the UI to show no active runs and triggering duplicate run
attempts that hit the concurrency guard.

Uses pid=-1 as a sentinel since there is no local process — the
server's isProcessAlive check safely returns false for pid <= 0.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-16 15:46:34 +00:00
Chris Farhood b8ba457790 fix: don't delete job when returning state-mismatch error to keep UI in sync
When waitForJobCompletion threw and the job was still not terminal, we
were returning an error but still deleting the job in the finally block.
This left the UI holding an error while the job (still alive) would be
cleaned up by Kubernetes, causing the next heartbeat to find nothing and
think it was safe to retry — spawning a concurrent pod.

Now we set skipCleanup=true when returning the mismatch error, so the
job is retained and the heartbeat can still find and wait on it.

Also removes a duplicate empty-stdout fallback block.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 11:29:42 -04:00
Chris Farhood efbbfbc299 fix: re-check job state when completion waiter throws to prevent UI staleness
When waitForJobCompletion threw a transient error (API disconnect, etc.),
the code fell through with jobTimedOut=true and returned a result even
though the job was still running. This caused the UI to think the run
was complete while the job kept running, resulting in concurrency errors.

Now when completion throws, we re-check the job's actual state. If still
not terminal, we return a k8s_job_state_mismatch error so the UI knows
the run is not done.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 07:26:10 -04:00
Pawla Abdul 77ba40d9bf Reconnect K8s log stream on silent API disconnects
The adapter opened a single follow-stream to the K8s API for pod logs.
If that TCP connection silently dropped (API server hiccup, network
timeout, load-balancer idle cut), streamPodLogs returned early and no
more real Claude output reached the UI — only keepalive pings.  The
pod kept running and producing logs (visible via kubectl), but the
adapter never reconnected.

Splits streamPodLogs into streamPodLogsOnce (single follow attempt) and
a reconnecting wrapper that retries with sinceSeconds until a shared
stop signal fires when waitForJobCompletion resolves.  On reconnect,
requests logs from the original stream start time (+5s overlap) so no
output is lost; the UI deduplicates chunks.

Bumps version to 0.1.12.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-13 10:34:41 +00:00
Pawla Abdul e760bf9386 Add keepalive pings during job execution to prevent UI timeout desync
The adapter had no mechanism to signal liveness while a K8s Job was
running. When Claude entered long thinking phases with no log output,
the Paperclip UI could lose sync and consider the run stuck even though
the pod was still actively working.

Adds a 15-second interval keepalive that sends status messages via
onLog during execution. The keepalive tracks time since last real log
output and reports it, keeping the connection alive. The timer is
cleaned up in the finally block to prevent leaks on any exit path.

Bumps version to 0.1.11.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-12 18:44:09 +00:00
Chris Farhood 9dbb5f337e Initial commit: Paperclip adapter for Claude Code on Kubernetes
Adapter plugin that runs Claude Code agents as Kubernetes Jobs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 23:16:31 -04:00