fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107)

Root cause of Nancy's k8s_job_deleted_externally false positive: the
paperclip server itself receives SIGTERM during rolling deploys,
evictions, scale-down, etc.  The previous SIGTERM handler iterated
activeJobs and deleted every Job before exiting, which surfaced in the
in-flight heartbeat as "K8s Job was deleted externally" — even though
nothing external touched it.

With reattachOrphanedJobs=true (default), this is exactly the wrong
behaviour: leaving the Jobs alive lets the next paperclip process
discover them via the orphan-classification path and reattach their
log streams.  With reattachOrphanedJobs=false the operator opted into
manual cleanup, so we still must not auto-delete.

The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the
Job, so both survive together and TTL handles cleanup on natural
completion.  Test rewritten to assert the new contract: SIGTERM must
not touch K8s Jobs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
2026-04-26 21:19:02 +00:00
committed by Hugh Commit [agent]
parent d184a1732b
commit 6923597b31
2 changed files with 22 additions and 30 deletions
+14 -24
View File
@@ -58,30 +58,20 @@ function ensureSigtermHandler(): void {
if (sigtermHandlerRegistered) return;
sigtermHandlerRegistered = true;
process.once("SIGTERM", () => {
const jobs = [...activeJobs];
void Promise.allSettled(
jobs.map(async (ref) => {
try {
const batchApi = getBatchApi(ref.kubeconfigPath);
await batchApi.deleteNamespacedJob({
name: ref.jobName,
namespace: ref.namespace,
body: { propagationPolicy: "Background" },
});
} catch { /* best-effort */ }
if (ref.promptSecretName && ref.promptSecretNamespace) {
try {
const coreApi = getCoreApi(ref.kubeconfigPath);
await coreApi.deleteNamespacedSecret({
name: ref.promptSecretName,
namespace: ref.promptSecretNamespace,
});
} catch { /* best-effort */ }
}
}),
).then(() => {
process.kill(process.pid, "SIGTERM");
});
// Do NOT delete active K8s Jobs on SIGTERM (FAR-107). Paperclip itself
// receives SIGTERM during rolling deploys, evictions, scale-down, etc.
// Deleting the Jobs we own there causes the in-flight heartbeat to surface
// a false-positive `k8s_job_deleted_externally` error and tears down work
// the user expected to keep running.
//
// The correct behaviour with `reattachOrphanedJobs=true` (default) is to
// leave the Jobs alive: the next paperclip process discovers them via the
// orphan-classification path and reattaches their log streams. When
// `reattachOrphanedJobs=false` the operator explicitly opted into manual
// cleanup and should not have us auto-deleting either. The owning Job's
// ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so
// both survive together and TTL cleans them up after natural completion.
process.kill(process.pid, "SIGTERM");
});
}