fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124) #8

Merged
farhoodliquor-paperclip[bot] merged 1 commits from fix/far-124-reattach-orphan-k8s-jobs into master 2026-04-23 02:33:05 +00:00
farhoodliquor-paperclip[bot] commented 2026-04-22 22:00:05 +00:00 (Migrated from github.com)

Summary

  • Add paperclip.io/task-id and paperclip.io/session-id labels to K8s Jobs (via new sanitizeLabelValue helper) so a later execute() can identify an orphan as the continuation of the same logical unit of work.
  • In execute()'s concurrency guard, when reattachOrphanedJobs is on (default) and an orphan matches agent + task + session + is not terminal, pick it as the reattach target and stream its logs/wait for completion instead of deleting it and starting a new pod.
  • New reattachOrphanedJobs config toggle (default true).

Why

When the Paperclip pod restarts mid-run, the in-process setInterval keepalive dies, updatedAt goes stale, and the server's orphan reaper fails the run with the (misleading) "child pid 1 is no longer running" message. Paperclip then dispatches a continuation run, whose execute() finds the previous run's K8s Job still happily running and deletes it as an "orphan" — throwing away work and producing the transcript/run cascade reported on FAR-124.

Reading the server source confirmed the "child pid 1 no longer running" message is a hardcoded template for non-local adapters (it's not an actual PID-liveness check — see heartbeat.ts:2378). The real trigger is the 5-minute updatedAt staleness. A follow-up upstream PR will fix the message and add a non-local retry path; this PR fixes the adapter-side half by preserving in-flight work across Paperclip restarts.

Test plan

  • npm run typecheck — clean
  • npm test — 262 tests pass (was 241; added 21)
  • New unit tests:
    • sanitizeLabelValue sanitization & edge cases
    • isReattachableOrphan match/mismatch on every label dimension + terminal state
    • paperclip.io/task-id / paperclip.io/session-id label presence/absence in manifests
    • reattachOrphanedJobs config default
  • Post-deploy: confirm on a live cluster that after a Paperclip pod restart, a new run attaches to the prior K8s Job instead of deleting it (log line "Reattaching to in-flight K8s Job").

Fixes FAR-124.

## Summary - Add `paperclip.io/task-id` and `paperclip.io/session-id` labels to K8s Jobs (via new `sanitizeLabelValue` helper) so a later `execute()` can identify an orphan as the continuation of the same logical unit of work. - In `execute()`'s concurrency guard, when `reattachOrphanedJobs` is on (default) and an orphan matches agent + task + session + is not terminal, pick it as the reattach target and stream its logs/wait for completion instead of deleting it and starting a new pod. - New `reattachOrphanedJobs` config toggle (default `true`). ## Why When the Paperclip pod restarts mid-run, the in-process `setInterval` keepalive dies, `updatedAt` goes stale, and the server's orphan reaper fails the run with the (misleading) "child pid 1 is no longer running" message. Paperclip then dispatches a continuation run, whose `execute()` finds the previous run's K8s Job still happily running and deletes it as an "orphan" — throwing away work and producing the transcript/run cascade reported on FAR-124. Reading the server source confirmed the "child pid 1 no longer running" message is a hardcoded template for non-local adapters (it's not an actual PID-liveness check — see `heartbeat.ts:2378`). The real trigger is the 5-minute `updatedAt` staleness. A follow-up upstream PR will fix the message and add a non-local retry path; this PR fixes the adapter-side half by preserving in-flight work across Paperclip restarts. ## Test plan - [x] `npm run typecheck` — clean - [x] `npm test` — 262 tests pass (was 241; added 21) - [x] New unit tests: - `sanitizeLabelValue` sanitization & edge cases - `isReattachableOrphan` match/mismatch on every label dimension + terminal state - `paperclip.io/task-id` / `paperclip.io/session-id` label presence/absence in manifests - `reattachOrphanedJobs` config default - [ ] Post-deploy: confirm on a live cluster that after a Paperclip pod restart, a new run attaches to the prior K8s Job instead of deleting it (log line "Reattaching to in-flight K8s Job"). Fixes FAR-124.
Sign in to join this conversation.