fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107)
Root cause of Nancy's k8s_job_deleted_externally false positive: the paperclip server itself receives SIGTERM during rolling deploys, evictions, scale-down, etc. The previous SIGTERM handler iterated activeJobs and deleted every Job before exiting, which surfaced in the in-flight heartbeat as "K8s Job was deleted externally" — even though nothing external touched it. With reattachOrphanedJobs=true (default), this is exactly the wrong behaviour: leaving the Jobs alive lets the next paperclip process discover them via the orphan-classification path and reattach their log streams. With reattachOrphanedJobs=false the operator opted into manual cleanup, so we still must not auto-delete. The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so both survive together and TTL handles cleanup on natural completion. Test rewritten to assert the new contract: SIGTERM must not touch K8s Jobs. Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
+14
-24
@@ -58,30 +58,20 @@ function ensureSigtermHandler(): void {
|
||||
if (sigtermHandlerRegistered) return;
|
||||
sigtermHandlerRegistered = true;
|
||||
process.once("SIGTERM", () => {
|
||||
const jobs = [...activeJobs];
|
||||
void Promise.allSettled(
|
||||
jobs.map(async (ref) => {
|
||||
try {
|
||||
const batchApi = getBatchApi(ref.kubeconfigPath);
|
||||
await batchApi.deleteNamespacedJob({
|
||||
name: ref.jobName,
|
||||
namespace: ref.namespace,
|
||||
body: { propagationPolicy: "Background" },
|
||||
});
|
||||
} catch { /* best-effort */ }
|
||||
if (ref.promptSecretName && ref.promptSecretNamespace) {
|
||||
try {
|
||||
const coreApi = getCoreApi(ref.kubeconfigPath);
|
||||
await coreApi.deleteNamespacedSecret({
|
||||
name: ref.promptSecretName,
|
||||
namespace: ref.promptSecretNamespace,
|
||||
});
|
||||
} catch { /* best-effort */ }
|
||||
}
|
||||
}),
|
||||
).then(() => {
|
||||
process.kill(process.pid, "SIGTERM");
|
||||
});
|
||||
// Do NOT delete active K8s Jobs on SIGTERM (FAR-107). Paperclip itself
|
||||
// receives SIGTERM during rolling deploys, evictions, scale-down, etc.
|
||||
// Deleting the Jobs we own there causes the in-flight heartbeat to surface
|
||||
// a false-positive `k8s_job_deleted_externally` error and tears down work
|
||||
// the user expected to keep running.
|
||||
//
|
||||
// The correct behaviour with `reattachOrphanedJobs=true` (default) is to
|
||||
// leave the Jobs alive: the next paperclip process discovers them via the
|
||||
// orphan-classification path and reattaches their log streams. When
|
||||
// `reattachOrphanedJobs=false` the operator explicitly opted into manual
|
||||
// cleanup and should not have us auto-deleting either. The owning Job's
|
||||
// ownerReference (FAR-15) keeps the prompt Secret tied to the Job, so
|
||||
// both survive together and TTL cleans them up after natural completion.
|
||||
process.kill(process.pid, "SIGTERM");
|
||||
});
|
||||
}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user