fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124) #8

Merged
farhoodliquor-paperclip[bot] merged 1 commits from fix/far-124-reattach-orphan-k8s-jobs into master 2026-04-23 02:33:05 +00:00

1 Commits

Author SHA1 Message Date
Test User c8968598e4 fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124)
When the Paperclip pod restarts mid-run, the in-process setInterval
keepalive dies, `updatedAt` goes stale, and the server's orphan reaper
fails the run with the (misleading) "child pid 1 is no longer running"
message.  Paperclip then dispatches a continuation run, whose execute()
finds the previous run's K8s Job still happily running and deletes it
as an "orphan" — throwing away work and producing the transcript/run
cascade reported on FAR-124.

Changes:

- job-manifest: add `paperclip.io/task-id` and `paperclip.io/session-id`
  labels (sanitized via new `sanitizeLabelValue` helper) so a later
  execute() can identify an orphan as the continuation of the same
  logical unit of work.
- execute: in the concurrency guard, when `reattachOrphanedJobs` is on
  (default) and an orphan matches agent + task + session + is not
  terminal, pick it as the reattach target; delete only the other
  orphans.  Branch the build/create/waitForPod block so the reattach
  path skips manifest building, Secret creation, Job creation, and
  scheduling wait — it jumps straight to streaming logs and waiting
  for the existing pod's completion.
- config-schema: expose `reattachOrphanedJobs` toggle (default true).
- Tests: `sanitizeLabelValue`, `isReattachableOrphan`, new label
  presence/absence, config default.

No server-side changes; the misleading reaper message and lack of a
non-local retry path will be addressed in a follow-up upstream PR.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-22 21:59:25 +00:00