Files
Test User 5f5ae92ce7 fix: skip keepalive updatedAt refresh once K8s Job is terminal
The previous fix (df856e6) made the keepalive timer call onSpawn every
~4 minutes to refresh the run's updatedAt in the DB, so the stale-run
reaper wouldn't kill live runs in multi-instance deployments.  That was
correct for live jobs, but it was unconditional — if execute() stalled
after the pod terminated (slow K8s API call, hung log stream drain, or
a Job whose Complete condition lags pod termination), the keepalive
kept the run marked "alive" indefinitely even though the pod was gone.

That manifests as the opposite of the original bug: the UI shows jobs
as running when they have actually finished.

Two changes:

1. Verify the Job is still alive before the keepalive refreshes
   updatedAt.  If the Job has reached a terminal Complete/Failed
   condition (or has been deleted / the API read fails), stop
   refreshing.  If execute() truly ends up stuck past that point, the
   reaper will catch the run within the normal 5-minute staleness
   window instead of never.

2. Clear the keepalive interval immediately once Promise.allSettled
   resolves, rather than only in the finally block.  Post-completion
   work (exit-code fetch, log fallback read, job cleanup) must not be
   able to emit another onSpawn refresh that keeps the run "alive".

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-17 02:57:17 +00:00
..