- #9: match Paperclip container by name in k8s-client instead of
trusting spec.containers[0], which could be a service-mesh sidecar
- #11: key assistant-text dedup by (message.id, index) so legitimate
duplicate content across turns isn't collapsed in the summary
- #16: trim trailing hyphens from sanitized K8s names so truncation
doesn't produce names ending in "-"
Findings #5 (keepalive re-verify) and #6 (one-shot log dedup) were
already addressed in the current code — verified during this review.
#8 (orphan reattach behavior) requires a product decision on whether
"new session wins" is intentional, so deferring.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
When a large prompt creates a K8s Secret, it can orphan if the process
crashes before the finally block runs. Now the Secret gets an
ownerReference pointing to the Job after creation, so K8s GC cleans it
up automatically. Also cleans up the Secret on job creation failure.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
The trunc function in the RTK filter script now walks back from the
truncation point past continuation bytes and checks whether the full
codepoint fits, avoiding replacement characters from mid-codepoint slicing.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Prevents process_lost false positives for 2-3 minute K8s jobs by
resetting the reaper clock when the keepalive loop detects the job
has completed (or been deleted), rather than waiting for the next
periodic refresh.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
K8s Job pods were starting without the Paperclip skill loaded, so agents
could not find their heartbeat procedure and reported "no issue content in
my workspace" on every wake. Root cause: claude_local materialises skills
into a PVC-backed prompt-bundle directory and passes --add-dir to Claude,
but claude_k8s did neither.
Changes:
- Add src/server/prompt-cache.ts with prepareClaudePromptBundle (ported
from adapter-claude-local). Writes skill symlinks and the agent's
instructions file into a content-addressed bundle directory under the
shared PVC (/paperclip/instances/.../claude-prompt-cache/<hash>/).
- execute.ts: read desired skills and instructions file before building
the Job manifest, then call prepareClaudePromptBundle and pass the
resulting bundle to buildJobManifest.
- job-manifest.ts: accept optional promptBundle in JobBuildInput; when
present, pass --add-dir <bundle.addDir> and use bundle.instructionsFilePath
for --append-system-prompt-file. Also fix: skip --append-system-prompt-file
on session resumes to avoid wasting tokens on re-injection.
- skills.ts: correct the detail string to reflect actual materialisation.
- job-manifest.test.ts: add 5 new tests covering --add-dir injection,
bundle path preference, session-resume skipping, and fallback behaviour.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Uses vi.mock on k8s-client and vi.useFakeTimers to prove that when
logApi.log() never resolves (the FAR-10 hang shape) and stopSignal
fires, streamPodLogsOnce still returns within the bail window
(LOG_STREAM_BAIL_TIMEOUT_MS). Exports streamPodLogsOnce so the test
can call it directly. Also covers the no-stopSignal happy path.
269/269 passing (+2 new).
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Defensive follow-up to the FAR-10 fix. The original patch aborts the
in-flight follow stream by destroying the Writable once stopSignal
fires, and relies on the @kubernetes/client-node library propagating
that destroy into an abort of the underlying HTTP request. If that
propagation ever fails (e.g. the client is awaiting a response that
never arrives), logApi.log() can still hang forever.
Adds a Promise.race with a 3s bail timer that starts when stopSignal
fires. In the happy path (destroy-propagation works), logApi.log()
resolves first and the bail timer is cleared. In the failure path,
the bail timer fires and streamPodLogsOnce returns with whatever
chunks were captured — preventing the hang from reaching execute().
No test change: existing 267 tests pass and the race path needs a k8s
mock to exercise end-to-end; validated by monitoring real runs.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Update repository, bugs, and homepage URLs in package.json to use
the correct farhoodlabs GitHub org
- Add NODE_AUTH_TOKEN: NPM_TOKEN to the CI publish step so the newly
added NPM_TOKEN secret is picked up for authentication
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Replaces NPM_TOKEN secret with id-token: write + --provenance so
publishing uses GitHub's OIDC token directly. No repository secret
required; provenance attestation is generated automatically.
Also collapses the redundant second setup-node step (registry-url is
now set on the first one).
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Four stacked bugs caused the adapter to hang after K8s Job completion,
allowing the 5-minute reaper to mark runs process_lost even when the Job
actually succeeded.
- streamPodLogsOnce: add stopSignal polling loop that destroys the
writable every 200ms once the job-completion branch fires, aborting
any in-flight follow stream that would otherwise hang indefinitely
- waitForPod: treat phase=Failed as a terminal error (throw via
describePodTerminatedError) instead of entering the log-stream path
with a dead pod (new helper is exported for unit tests)
- waitForPod: surface cs.state?.terminated in the per-tick detail line
so operators see exit code / reason without needing kubectl
- keepalive: add POST_TERMINAL_KEEPALIVE_MS (90s) window after Job goes
terminal so onSpawn keeps refreshing updatedAt during cleanup; if
execute() genuinely stalls past 90s the reaper will still catch it
Regression tests added for describePodTerminatedError (phase=Failed
with and without claude container status).
Co-Authored-By: Paperclip <noreply@paperclip.ing>
When the Paperclip pod restarts mid-run, the in-process setInterval
keepalive dies, `updatedAt` goes stale, and the server's orphan reaper
fails the run with the (misleading) "child pid 1 is no longer running"
message. Paperclip then dispatches a continuation run, whose execute()
finds the previous run's K8s Job still happily running and deletes it
as an "orphan" — throwing away work and producing the transcript/run
cascade reported on FAR-124.
Changes:
- job-manifest: add `paperclip.io/task-id` and `paperclip.io/session-id`
labels (sanitized via new `sanitizeLabelValue` helper) so a later
execute() can identify an orphan as the continuation of the same
logical unit of work.
- execute: in the concurrency guard, when `reattachOrphanedJobs` is on
(default) and an orphan matches agent + task + session + is not
terminal, pick it as the reattach target; delete only the other
orphans. Branch the build/create/waitForPod block so the reattach
path skips manifest building, Secret creation, Job creation, and
scheduling wait — it jumps straight to streaming logs and waiting
for the existing pod's completion.
- config-schema: expose `reattachOrphanedJobs` toggle (default true).
- Tests: `sanitizeLabelValue`, `isReattachableOrphan`, new label
presence/absence, config default.
No server-side changes; the misleading reaper message and lack of a
non-local retry path will be addressed in a follow-up upstream PR.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
The K8s log follow stream replays the trailing few seconds of output on
every reconnect because `sinceSeconds` uses integer-second granularity
with a small safety buffer. FAR-105 dedupped those replays at the final
parser (parse.ts), but the streaming UI consumes raw onLog chunks and
still showed each replayed assistant/tool event as a fresh entry — which
is how the duplicate "Three nits to fix…" blocks in the screenshot
appeared between successive tool calls.
Fix: add a stateful line-level dedup filter around onLog, shared across
reconnects. Claude stream-json events are keyed by their stable
structural IDs (message.id, tool_use_id, session_id); non-JSON output
(paperclip status lines, shell output) passes through unchanged.
- New `src/server/log-dedup.ts` + tests: LogLineDedupFilter handles
chunk-to-line buffering, replay dedup, and end-of-stream flush.
- `streamPodLogs` instantiates one filter per run so dedup state persists
across reconnect attempts.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Merge both one-shot log fallbacks into a single conditional block using a
cheap string-scan guard (`stdout.includes('"type":"result"')`) to avoid
calling parseClaudeStreamJson twice and prevent double readPodLogs calls
when the first fallback already ran.
- Extract error-message logic into `buildPartialRunError(exitCode, model, stdout)`
(exported for tests) so the `!parsed` branch is a one-liner and the logic
is independently testable.
- Export `isK8s404` for tests.
- Add execute.test.ts with 15 unit tests covering:
- isK8s404: v0.x response.statusCode, v1.0+ response.status, direct
statusCode, message-based detection, non-404 codes
- buildPartialRunError: exitCode=0 path, empty stdout, init-only output
(model surfaced), first non-system content line, null exitCode (-1),
multiple consecutive system events
Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Add a second log fallback: if the follow stream captured partial output (init
event present but no result event), attempt a one-shot readPodLogs before the
pod is cleaned up. Fast-exiting containers (bad model, missing API key, etc.)
can cause the follow stream to return only the init line before the connection
drops; the one-shot read is more reliable for already-terminated containers.
- Improve the `!parsed` error message: skip system/init events when searching
for the first content line, so the error reads "Claude started but did not
produce a result (model: MiniMax-M2.7) — check API credentials..." instead of
"Claude exited with code -1: {"type":"system","subtype":"init",...}".
Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Add `isK8s404()` helper compatible with @kubernetes/client-node v0.x and v1.0+
(checks response.statusCode, response.status, err.statusCode, and message text)
- `waitForJobCompletion` now catches 404 and returns `{ jobGone: true }` instead
of throwing — prevents uncaught exceptions when the K8s Job is TTL-deleted or
externally removed while the adapter is polling for a terminal condition
- Keepalive job-liveness check now uses `isK8s404` (was checking `response.statusCode`
which is absent in the v1.0+ fetch-based client, silently breaking 404 detection)
- `jobGone` case in completion handler logs a diagnostic and falls through to stdout
parsing rather than returning an opaque 404 error to the user
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Replace the init-container RTK binary approach with a self-contained
Node.js implementation. When `enableRtk: true` is set in adapter config,
the job's main container startup:
1. Writes a Node.js filter script to /tmp/.rtk-filter.js (base64-encoded
inline — no curl, no wget, no external binary download required).
2. Merges a PostToolUse hook into ~/.claude/settings.json so Claude Code
runs the filter after every tool call.
3. The filter truncates tool_response/tool_result content that exceeds
`rtkMaxOutputBytes` (default: 50 000 B), handling both string and
array (text-block) content formats.
New config fields:
enableRtk toggle — off by default
rtkMaxOutputBytes number — truncation threshold (default 50 000)
9 new tests cover: command shape, ordering, no-external-binary guarantee,
threshold injection, PostToolUse hook presence, and filter-script logic.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
The adapter was calling onSpawn({ pid: -1 }) as a sentinel value for
K8s Jobs (which run out-of-process), then the server's orphan reaper
was checking isProcessAlive(-1) which always returns false, causing
legitimate runs to be reaped as 'process_lost'.
Using process.pid (the Paperclip server's own PID) is always alive
while the adapter runs in-process, preventing false reaping.
Fixes FAR-116.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5. Cap log stream reconnect attempts at 50 — prevents infinite
reconnect loops during sustained API partitions.
6. Fire keepalive refresh earlier — tick 1 + every 12 ticks (~3min)
instead of every 16 ticks (~4min), providing better safety margin
under the 5-minute reaper window.
7. Catch rejections from onLog inside keepalive — add .catch(() => {})
to prevent unhandledRejection on SSE backpressure.
8. Prevent sanitized-name collisions — extend slugs to 16 chars each,
add a 6-char SHA-256 hash suffix, shorten prefix to `ac-` to stay
well within the 63-char DNS label limit.
10. Fix config-hint parity for nodeSelector and labels — parse both
`key=value` multiline text and JSON objects, matching what the
textarea hint promises.
11. Large-prompt fallback via Secret — prompts >256 KiB are staged as a
K8s Secret and mounted as a volume instead of passed via env var,
protecting against the ~1 MiB PodSpec limit.
13. Track last-seen log timestamp on reconnect — anchor sinceSeconds at
the last received log line instead of stream start, fixing FAR-105
duplicative logs. Belt-and-braces: dedupe assistantTexts at the
parser boundary in parse.ts.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
1. Inherit envFrom and env.valueFrom from self pod — secrets wired via
valueFrom.secretKeyRef or envFrom.secretRef are now forwarded to Job
pods, fixing credentials silently dropped for K8s-idiomatic secret
patterns (e.g. ANTHROPIC_API_KEY via Secret).
2. Distinguish 404 vs transient errors in keepalive — only mark the
keepalive as terminal on 404 (Job deleted). Transient 5xx/connection
errors are logged and retried on the next tick, preventing premature
reaper kills during API instability.
3. Fail closed on concurrency-guard read failure — a failing
listNamespacedJob now returns k8s_concurrency_guard_unreachable
instead of silently proceeding, protecting against zombie Jobs on
shared PVCs.
4. Bound the waitForJobCompletion re-check — pass a 60s timeout instead
of polling forever, preventing indefinite hangs when the K8s API is
degraded.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Busybox echo interprets escape sequences by default (\c, \n, \t, \0NNN, etc.).
If the prompt contains \c (common in file paths or shell references), echo
silently stops output at that point, truncating the prompt file. This can
leave Claude CLI with an empty or garbled stdin, causing it to hang with
zero output — manifesting as endless keepalive messages in the UI.
printf '%s' passes content through verbatim, avoiding the issue.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Add missing us.anthropic.claude-sonnet-4-6 entry
- Correct sonnet version from v2:0 to v1:0 (verified against AWS docs)
- All model IDs verified against current Bedrock documentation
Co-Authored-By: Paperclip <noreply@paperclip.ing>
The previous fix (df856e6) made the keepalive timer call onSpawn every
~4 minutes to refresh the run's updatedAt in the DB, so the stale-run
reaper wouldn't kill live runs in multi-instance deployments. That was
correct for live jobs, but it was unconditional — if execute() stalled
after the pod terminated (slow K8s API call, hung log stream drain, or
a Job whose Complete condition lags pod termination), the keepalive
kept the run marked "alive" indefinitely even though the pod was gone.
That manifests as the opposite of the original bug: the UI shows jobs
as running when they have actually finished.
Two changes:
1. Verify the Job is still alive before the keepalive refreshes
updatedAt. If the Job has reached a terminal Complete/Failed
condition (or has been deleted / the API read fails), stop
refreshing. If execute() truly ends up stuck past that point, the
reaper will catch the run within the normal 5-minute staleness
window instead of never.
2. Clear the keepalive interval immediately once Promise.allSettled
resolves, rather than only in the finally block. Post-completion
work (exit-code fetch, log fallback read, job cleanup) must not be
able to emit another onSpawn refresh that keeps the run "alive".
Co-Authored-By: Paperclip <noreply@paperclip.ing>