When the Paperclip pod restarts mid-run, the in-process setInterval
keepalive dies, `updatedAt` goes stale, and the server's orphan reaper
fails the run with the (misleading) "child pid 1 is no longer running"
message. Paperclip then dispatches a continuation run, whose execute()
finds the previous run's K8s Job still happily running and deletes it
as an "orphan" — throwing away work and producing the transcript/run
cascade reported on FAR-124.
Changes:
- job-manifest: add `paperclip.io/task-id` and `paperclip.io/session-id`
labels (sanitized via new `sanitizeLabelValue` helper) so a later
execute() can identify an orphan as the continuation of the same
logical unit of work.
- execute: in the concurrency guard, when `reattachOrphanedJobs` is on
(default) and an orphan matches agent + task + session + is not
terminal, pick it as the reattach target; delete only the other
orphans. Branch the build/create/waitForPod block so the reattach
path skips manifest building, Secret creation, Job creation, and
scheduling wait — it jumps straight to streaming logs and waiting
for the existing pod's completion.
- config-schema: expose `reattachOrphanedJobs` toggle (default true).
- Tests: `sanitizeLabelValue`, `isReattachableOrphan`, new label
presence/absence, config default.
No server-side changes; the misleading reaper message and lack of a
non-local retry path will be addressed in a follow-up upstream PR.
Co-Authored-By: Paperclip <noreply@paperclip.ing>