fix: reattach to orphaned K8s Jobs across Paperclip restarts (FAR-124) #8
Reference in New Issue
Block a user
Delete Branch "fix/far-124-reattach-orphan-k8s-jobs"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
paperclip.io/task-idandpaperclip.io/session-idlabels to K8s Jobs (via newsanitizeLabelValuehelper) so a laterexecute()can identify an orphan as the continuation of the same logical unit of work.execute()'s concurrency guard, whenreattachOrphanedJobsis on (default) and an orphan matches agent + task + session + is not terminal, pick it as the reattach target and stream its logs/wait for completion instead of deleting it and starting a new pod.reattachOrphanedJobsconfig toggle (defaulttrue).Why
When the Paperclip pod restarts mid-run, the in-process
setIntervalkeepalive dies,updatedAtgoes stale, and the server's orphan reaper fails the run with the (misleading) "child pid 1 is no longer running" message. Paperclip then dispatches a continuation run, whoseexecute()finds the previous run's K8s Job still happily running and deletes it as an "orphan" — throwing away work and producing the transcript/run cascade reported on FAR-124.Reading the server source confirmed the "child pid 1 no longer running" message is a hardcoded template for non-local adapters (it's not an actual PID-liveness check — see
heartbeat.ts:2378). The real trigger is the 5-minuteupdatedAtstaleness. A follow-up upstream PR will fix the message and add a non-local retry path; this PR fixes the adapter-side half by preserving in-flight work across Paperclip restarts.Test plan
npm run typecheck— cleannpm test— 262 tests pass (was 241; added 21)sanitizeLabelValuesanitization & edge casesisReattachableOrphanmatch/mismatch on every label dimension + terminal statepaperclip.io/task-id/paperclip.io/session-idlabel presence/absence in manifestsreattachOrphanedJobsconfig defaultFixes FAR-124.