fix: prevent process_lost when K8s Job completes (FAR-10) #9

Merged
cpfarhood merged 1 commits from fix/far-10-process-lost-after-job-complete into master 2026-04-23 16:07:33 +00:00
cpfarhood commented 2026-04-23 16:00:10 +00:00 (Migrated from github.com)

Summary

Four stacked bugs in src/server/execute.ts caused the claude_k8s adapter to hang after a K8s Job completed, letting the 5-minute reaper mark the heartbeat run failed/process_lost even when the Job succeeded.

Bug 1 — Log follow outlives the Job: streamPodLogsOnce had no mechanism to abort an in-flight logApi.log(..., follow: true) call when the job-completion branch fired. Added a 200ms polling loop that destroys the Writable once stopSignal.stopped is set, aborting the hung HTTP stream. The stopSignal is also now threaded through streamPodLogsstreamPodLogsOnce.

Bug 2 — waitForPod entered log-stream path on phase=Failed: The original code treated phase === "Failed" the same as Running, returning the pod name and continuing into log streaming against a dead pod. Fixed to throw a structured error via describePodTerminatedError (new exported helper) so callers get a real error with exit code and reason.

Bug 3 — Terminated container state not surfaced: The per-tick status detail loop omitted cs.state?.terminated, so operators saw only phase=Failed with no indication of why the claude container exited. Added the terminated case with exit code and reason.

Bug 4 — Keepalive stopped refreshing updatedAt immediately at Job terminal: Once keepaliveJobTerminal = true, onSpawn stopped firing, meaning any cleanup activity (job deletion, log parsing, K8s API calls) after the Job went terminal would trip the 5-minute reaper. Added a POST_TERMINAL_KEEPALIVE_MS (90s) window during which onSpawn continues refreshing at a lower cadence.

Test plan

  • describePodTerminatedError unit tests: phase=Failed with claude container (exit code + reason), fallback to message field, empty container list, non-claude containers, null exitCode
  • All 267 existing tests pass (npm test)
  • Manual evidence: the 15:38Z run hung ~2.5 min after Job completion due to bugs 1+4; these fixes eliminate both hang paths

🤖 Generated with Claude Code

## Summary Four stacked bugs in `src/server/execute.ts` caused the `claude_k8s` adapter to hang after a K8s Job completed, letting the 5-minute reaper mark the heartbeat run `failed/process_lost` even when the Job succeeded. **Bug 1 — Log follow outlives the Job:** `streamPodLogsOnce` had no mechanism to abort an in-flight `logApi.log(..., follow: true)` call when the job-completion branch fired. Added a 200ms polling loop that destroys the `Writable` once `stopSignal.stopped` is set, aborting the hung HTTP stream. The `stopSignal` is also now threaded through `streamPodLogs` → `streamPodLogsOnce`. **Bug 2 — `waitForPod` entered log-stream path on `phase=Failed`:** The original code treated `phase === "Failed"` the same as `Running`, returning the pod name and continuing into log streaming against a dead pod. Fixed to throw a structured error via `describePodTerminatedError` (new exported helper) so callers get a real error with exit code and reason. **Bug 3 — Terminated container state not surfaced:** The per-tick status detail loop omitted `cs.state?.terminated`, so operators saw only `phase=Failed` with no indication of why the claude container exited. Added the terminated case with exit code and reason. **Bug 4 — Keepalive stopped refreshing `updatedAt` immediately at Job terminal:** Once `keepaliveJobTerminal = true`, `onSpawn` stopped firing, meaning any cleanup activity (job deletion, log parsing, K8s API calls) after the Job went terminal would trip the 5-minute reaper. Added a `POST_TERMINAL_KEEPALIVE_MS` (90s) window during which `onSpawn` continues refreshing at a lower cadence. ## Test plan - [x] `describePodTerminatedError` unit tests: phase=Failed with claude container (exit code + reason), fallback to `message` field, empty container list, non-claude containers, null `exitCode` - [x] All 267 existing tests pass (`npm test`) - [x] Manual evidence: the 15:38Z run hung ~2.5 min after Job completion due to bugs 1+4; these fixes eliminate both hang paths 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.