fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107)

The error previously fired with no diagnostic context, making it impossible
to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a
missed Complete condition, or (c) actual external deletion without cluster
shell access.  Two changes:

1. Grace-period verification: when the log stream exits and the 30s grace
   timer fires, do a one-shot readNamespacedJob before declaring the Job
   gone.  If it's still there, settle as gracePeriodFired (not jobGone) so
   we don't mis-classify K8s condition propagation lag as deletion.

2. Forensic capture: track which of the three detection paths
   (completion-poll-404, grace-period-verify-404, recheck-poll-404)
   first observed the 404, the last successful Job conditions read, the
   poll count, elapsed time since pod-running, and stdout size.  Append
   all of it to the errorMessage so the next occurrence is self-diagnosing.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
2026-04-26 21:05:04 +00:00
committed by Hugh Commit [agent]
parent d9928030d6
commit be84428226
2 changed files with 87 additions and 7 deletions
+2 -1
View File
@@ -1019,7 +1019,8 @@ describe("execute: happy path", () => {
const result = await executePromise;
expect(result.errorCode).toBe("k8s_job_deleted_externally");
expect(result.errorMessage).toBe("K8s Job was deleted externally before Claude could complete");
expect(result.errorMessage).toMatch(/^K8s Job was deleted externally before Claude could complete \[/);
expect(result.errorMessage).toContain("detected_via=");
expect(result.exitCode).toBeNull();
});