fix: enrich k8s_job_deleted_externally error with forensics + verify Job presence on grace fire (FAR-107)
The error previously fired with no diagnostic context, making it impossible to distinguish (a) self-delete by our SIGTERM/cancel path, (b) TTL after a missed Complete condition, or (c) actual external deletion without cluster shell access. Two changes: 1. Grace-period verification: when the log stream exits and the 30s grace timer fires, do a one-shot readNamespacedJob before declaring the Job gone. If it's still there, settle as gracePeriodFired (not jobGone) so we don't mis-classify K8s condition propagation lag as deletion. 2. Forensic capture: track which of the three detection paths (completion-poll-404, grace-period-verify-404, recheck-poll-404) first observed the 404, the last successful Job conditions read, the poll count, elapsed time since pod-running, and stdout size. Append all of it to the errorMessage so the next occurrence is self-diagnosing. Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
@@ -1019,7 +1019,8 @@ describe("execute: happy path", () => {
|
||||
const result = await executePromise;
|
||||
|
||||
expect(result.errorCode).toBe("k8s_job_deleted_externally");
|
||||
expect(result.errorMessage).toBe("K8s Job was deleted externally before Claude could complete");
|
||||
expect(result.errorMessage).toMatch(/^K8s Job was deleted externally before Claude could complete \[/);
|
||||
expect(result.errorMessage).toContain("detected_via=");
|
||||
expect(result.exitCode).toBeNull();
|
||||
});
|
||||
|
||||
|
||||
Reference in New Issue
Block a user