fix: do not delete active Jobs on SIGTERM — leave for orphan reattach (FAR-107)

Root cause of Nancy's k8s_job_deleted_externally false positive: the
paperclip server itself receives SIGTERM during rolling deploys,
evictions, scale-down, etc.  The previous SIGTERM handler iterated
activeJobs and deleted every Job before exiting, which surfaced in the
in-flight heartbeat as "K8s Job was deleted externally" — even though
nothing external touched it.

With reattachOrphanedJobs=true (default), this is exactly the wrong
behaviour: leaving the Jobs alive lets the next paperclip process
discover them via the orphan-classification path and reattach their
log streams.  With reattachOrphanedJobs=false the operator opted into
manual cleanup, so we still must not auto-delete.

The Job's ownerReference (FAR-15) keeps the prompt Secret tied to the
Job, so both survive together and TTL handles cleanup on natural
completion.  Test rewritten to assert the new contract: SIGTERM must
not touch K8s Jobs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
2026-04-26 21:19:02 +00:00
committed by Hugh Commit [agent]
parent d184a1732b
commit 6923597b31
2 changed files with 22 additions and 30 deletions
+8 -6
View File
@@ -1771,7 +1771,7 @@ describe("execute: SIGTERM handler best-effort cleanup", () => {
vi.useRealTimers();
});
it("deletes the active Job when SIGTERM fires during execution", async () => {
it("does NOT delete active Jobs on SIGTERM — leaves them for orphan reattach (FAR-107)", async () => {
// Mock process.kill to prevent the test process from actually being killed.
const killSpy = vi.spyOn(process, "kill").mockImplementation(() => true);
@@ -1782,17 +1782,19 @@ describe("execute: SIGTERM handler best-effort cleanup", () => {
// Flush microtasks through the async setup chain: getSelfPodInfo, listJobs,
// readSkillEntries, prepareBundle, createJob, onLog, activeJobs.add(), and
// ensureSigtermHandler() all complete before the try block enters streaming.
// 30 rounds is more than enough for the ~7 sequential await points.
for (let i = 0; i < 30; i++) await Promise.resolve();
// Emit SIGTERM — the process.once handler fires synchronously and kicks off
// async cleanup (deleteNamespacedJob). The mock resolves immediately.
// Reset deleteJob spy after setup so we can detect any SIGTERM-driven calls.
mockBatchDeleteJob.mockClear();
// Emit SIGTERM — the handler must re-raise to the default handler without
// touching the K8s Job. Deleting the Job here would surface as
// k8s_job_deleted_externally in the in-flight run (FAR-107).
process.emit("SIGTERM");
// Flush microtasks for deleteJob to resolve and the .then(process.kill) to run.
for (let i = 0; i < 10; i++) await Promise.resolve();
expect(mockBatchDeleteJob).toHaveBeenCalled();
expect(mockBatchDeleteJob).not.toHaveBeenCalled();
expect(killSpy).toHaveBeenCalledWith(process.pid, "SIGTERM");
killSpy.mockRestore();