Extends the previous fix (which only covered assistant/user) to skip every
JSON object with a non-empty "type" field — system, assistant, user,
rate_limit_event, result, and any future event types. This prevents all
structured protocol artefacts from being surfaced verbatim as error messages.
Root cause of the new repro: when Claude emits a rate_limit_event before
producing output and then exits without a result event, the rate_limit_event
JSON blob was becoming the "first content line" and appearing in the error:
Claude exited with code -1: {"type":"rate_limit_event","rate_limit_info":{...}}
With this fix, all typed events are filtered and the initOnlyOutput branch
fires, producing the clean diagnostic:
Claude started but did not produce a result (model: claude-opus-4-7)
— check API credentials, model support, and adapter config
Updated the "result event as content" test to match the new (correct) behaviour:
in production buildPartialRunError is only called when parseClaudeStreamJson
returns null (no result event), so the prior test was exercising a degenerate
state that cannot occur through execute().
Co-Authored-By: Paperclip <noreply@paperclip.ing>
When a model produces assistant events with output_tokens=0 but no result
event (e.g. MiniMax-M2.7 thinking-only output), the partial-run error
previously surfaced the raw assistant JSON blob verbatim, producing an
unreadable message like "Claude exited with code -1: {\"type\":\"assistant\",...}".
Fix: extend the content-line filter in buildPartialRunError to also skip
assistant and user event types (intermediate streaming events), in addition
to system events. result events are still retained since they may carry
useful terminal error details. When all stdout lines are filtered, the
existing initOnlyOutput branch triggers and surfaces a clean diagnostic:
"Claude started but did not produce a result (model: MiniMax-M2.7) — check
API credentials, model support, and adapter config".
Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Poll GET /api/heartbeat-runs/:runId on every keepalive tick (15s); when
status != 'running', delete the K8s Job, set logStopSignal, and return
errorCode='cancelled' — Job gone within ~15s of external cancellation.
- SIGTERM handler best-effort deletes all active Jobs/Secrets and re-emits
the signal to let the process exit naturally.
- Export shouldAbortForCancellation() helper; add tests for helper, cancel
poll path, and SIGTERM cleanup.
- Guard: PAPERCLIP_API_URL missing logs a warning and skips cancel polling;
HTTP 5xx from poll treated as transient; reattach path skips cancel poll.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Previously the test suite relied on real fs.stat completing within the fake
timer advance window (~11200ms). Under CI with 11 parallel test files the I/O
could drain later than the advances allowed, causing a 1-in-4 timeout on the
"logs pod pending" test.
Fix: mock @paperclipai/adapter-utils/server-utils using vi.hoisted() + Object.assign
so readPaperclipRuntimeSkillEntries resolves immediately as a microtask. All other
exports are forwarded to the real module via importOriginal. Each beforeEach that
calls vi.resetAllMocks() or vi.clearAllMocks() now also calls
mockReadSkillEntries.mockResolvedValue([]) to restore the implementation.
Timer advances in affected tests are simplified to reflect the purely fake-timer
sequence (no I/O drain prefix). All 323 tests pass deterministically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The log-stream-exit grace timer never fired because logExitTime was set
in the .then() of streamPodLogs, which only resolves once stopSignal is
set — but stopSignal is only set when completionWithGrace fires, which
requires logExitTime to be non-null. Classic deadlock.
Fix: add onFirstStreamExit callback to streamPodLogs, called after
attempt=0's streamPodLogsOnce returns (the first container exit signal).
execute() passes a closure that sets logExitTime immediately, breaking
the circular dependency and allowing the 30s grace timer to fire
correctly when K8s Job conditions lag container exit.
Tests: all 323 pass including the two FAR-23 grace-period regression tests.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
readPaperclipRuntimeSkillEntries does real fs.stat I/O under fake timers,
delaying execute()'s fake-timer registration by ~3200-4200ms of fake time
when tests run in isolation (cold OS page cache). The previous approach
tried vi.spyOn on an ESM module namespace export, which throws
"Cannot redefine property" — a fundamental ESM constraint.
Fix: remove the broken spy. Instead, each timer-heavy test now uses enough
advanceTimersByTimeAsync calls to (a) give the event loop sufficient turns
for the I/O to drain, and (b) cover the full fake-timer sequence even with
the maximum observed I/O delay. Patterns chosen:
reconnects (needs t+6000): 6 advances, ~12200ms total
deadline exceeded (needs t+3000): 5 advances, ~8400ms total
pod-creation wait (needs t+5000): 5 advances, ~9400ms total
execute.ts line coverage: 82.57% (was ~24% before this task's test additions).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses vi.mock on k8s-client and vi.useFakeTimers to prove that when
logApi.log() never resolves (the FAR-10 hang shape) and stopSignal
fires, streamPodLogsOnce still returns within the bail window
(LOG_STREAM_BAIL_TIMEOUT_MS). Exports streamPodLogsOnce so the test
can call it directly. Also covers the no-stopSignal happy path.
269/269 passing (+2 new).
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Four stacked bugs caused the adapter to hang after K8s Job completion,
allowing the 5-minute reaper to mark runs process_lost even when the Job
actually succeeded.
- streamPodLogsOnce: add stopSignal polling loop that destroys the
writable every 200ms once the job-completion branch fires, aborting
any in-flight follow stream that would otherwise hang indefinitely
- waitForPod: treat phase=Failed as a terminal error (throw via
describePodTerminatedError) instead of entering the log-stream path
with a dead pod (new helper is exported for unit tests)
- waitForPod: surface cs.state?.terminated in the per-tick detail line
so operators see exit code / reason without needing kubectl
- keepalive: add POST_TERMINAL_KEEPALIVE_MS (90s) window after Job goes
terminal so onSpawn keeps refreshing updatedAt during cleanup; if
execute() genuinely stalls past 90s the reaper will still catch it
Regression tests added for describePodTerminatedError (phase=Failed
with and without claude container status).
Co-Authored-By: Paperclip <noreply@paperclip.ing>
When the Paperclip pod restarts mid-run, the in-process setInterval
keepalive dies, `updatedAt` goes stale, and the server's orphan reaper
fails the run with the (misleading) "child pid 1 is no longer running"
message. Paperclip then dispatches a continuation run, whose execute()
finds the previous run's K8s Job still happily running and deletes it
as an "orphan" — throwing away work and producing the transcript/run
cascade reported on FAR-124.
Changes:
- job-manifest: add `paperclip.io/task-id` and `paperclip.io/session-id`
labels (sanitized via new `sanitizeLabelValue` helper) so a later
execute() can identify an orphan as the continuation of the same
logical unit of work.
- execute: in the concurrency guard, when `reattachOrphanedJobs` is on
(default) and an orphan matches agent + task + session + is not
terminal, pick it as the reattach target; delete only the other
orphans. Branch the build/create/waitForPod block so the reattach
path skips manifest building, Secret creation, Job creation, and
scheduling wait — it jumps straight to streaming logs and waiting
for the existing pod's completion.
- config-schema: expose `reattachOrphanedJobs` toggle (default true).
- Tests: `sanitizeLabelValue`, `isReattachableOrphan`, new label
presence/absence, config default.
No server-side changes; the misleading reaper message and lack of a
non-local retry path will be addressed in a follow-up upstream PR.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
- Merge both one-shot log fallbacks into a single conditional block using a
cheap string-scan guard (`stdout.includes('"type":"result"')`) to avoid
calling parseClaudeStreamJson twice and prevent double readPodLogs calls
when the first fallback already ran.
- Extract error-message logic into `buildPartialRunError(exitCode, model, stdout)`
(exported for tests) so the `!parsed` branch is a one-liner and the logic
is independently testable.
- Export `isK8s404` for tests.
- Add execute.test.ts with 15 unit tests covering:
- isK8s404: v0.x response.statusCode, v1.0+ response.status, direct
statusCode, message-based detection, non-404 codes
- buildPartialRunError: exitCode=0 path, empty stdout, init-only output
(model surfaced), first non-system content line, null exitCode (-1),
multiple consecutive system events
Co-Authored-By: Paperclip <noreply@paperclip.ing>