fix: skip keepalive updatedAt refresh once K8s Job is terminal

The previous fix (df856e6) made the keepalive timer call onSpawn every ~4 minutes to refresh the run's updatedAt in the DB, so the stale-run reaper wouldn't kill live runs in multi-instance deployments. That was correct for live jobs, but it was unconditional — if execute() stalled after the pod terminated (slow K8s API call, hung log stream drain, or a Job whose Complete condition lags pod termination), the keepalive kept the run marked "alive" indefinitely even though the pod was gone. That manifests as the opposite of the original bug: the UI shows jobs as running when they have actually finished. Two changes: 1. Verify the Job is still alive before the keepalive refreshes updatedAt. If the Job has reached a terminal Complete/Failed condition (or has been deleted / the API read fails), stop refreshing. If execute() truly ends up stuck past that point, the reaper will catch the run within the normal 5-minute staleness window instead of never. 2. Clear the keepalive interval immediately once Promise.allSettled resolves, rather than only in the finally block. Post-completion work (exit-code fetch, log fallback read, job cleanup) must not be able to emit another onSpawn refresh that keeps the run "alive". Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-17 02:57:05 +00:00
parent 20b85b8391
commit 5f5ae92ce7
1 changed files with 53 additions and 8 deletions
@@ -456,18 +456,54 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
    // onSpawn periodically to refresh it.  Without this, multi-instance
    // deployments can reap a live run from another server instance
    // after the 5-minute staleness window.
+    //
+    // BUT: the keepalive must NEVER refresh updatedAt if the underlying
+    // K8s Job is already terminal.  Otherwise, if execute() stalls after
+    // the pod finishes (e.g. a slow K8s API call, a hung log stream
+    // drain, or a Job whose Complete condition lags pod termination),
+    // we would keep the run marked "alive" indefinitely while the pod
+    // is actually gone — the exact "UI thinks jobs are running when
+    // they are not" bug.  We verify Job liveness every tick and stop
+    // refreshing as soon as the Job reaches a terminal state; if
+    // execute() is truly stuck, the reaper will then catch it within
+    // the normal 5-minute staleness window.
    let lastLogAt = Date.now();
    let keepaliveTick = 0;
+    let keepaliveJobTerminal = false;
    keepaliveTimer = setInterval(() => {
-      const silenceSec = Math.round((Date.now() - lastLogAt) / 1000);
-      void onLog("stdout", `[paperclip] keepalive — job ${jobName} running (${silenceSec}s since last output)\n`);
+      // Fire-and-forget the async work; setInterval callbacks must be
+      // synchronous or the timer will drift.
+      void (async () => {
+        if (keepaliveJobTerminal) return;

-      // Refresh updatedAt every ~4 minutes (16 ticks × 15s) to stay
-      // well within the 5-minute reaper staleness window.
-      keepaliveTick++;
-      if (ctx.onSpawn && keepaliveTick % 16 === 0) {
-        void ctx.onSpawn({ pid: -1, processGroupId: null, startedAt: new Date().toISOString() }).catch(() => {});
-      }
+        // Verify the Job is still alive before announcing or refreshing.
+        try {
+          const job = await batchApi.readNamespacedJob({ name: jobName, namespace });
+          const terminal = job.status?.conditions?.some(
+            (c) => (c.type === "Complete" || c.type === "Failed") && c.status === "True",
+          );
+          if (terminal) {
+            keepaliveJobTerminal = true;
+            return;
+          }
+        } catch {
+          // Job may have been deleted out from under us, or the API call
+          // transiently failed.  Either way, do not refresh updatedAt —
+          // either the Job really is gone, or the next tick will re-check.
+          keepaliveJobTerminal = true;
+          return;
+        }
+
+        const silenceSec = Math.round((Date.now() - lastLogAt) / 1000);
+        void onLog("stdout", `[paperclip] keepalive — job ${jobName} running (${silenceSec}s since last output)\n`);
+
+        // Refresh updatedAt every ~4 minutes (16 ticks × 15s) to stay
+        // well within the 5-minute reaper staleness window.
+        keepaliveTick++;
+        if (ctx.onSpawn && keepaliveTick % 16 === 0) {
+          void ctx.onSpawn({ pid: -1, processGroupId: null, startedAt: new Date().toISOString() }).catch(() => {});
+        }
+      })();
    }, KEEPALIVE_INTERVAL_MS);
    const wrappedOnLog: typeof onLog = async (stream, chunk) => {
      lastLogAt = Date.now();
@@ -486,6 +522,15 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
      }),
    ]);

+    // Stop the keepalive immediately once the job has reached a terminal
+    // state — do not wait for the finally block.  Any K8s API call or
+    // cleanup that happens after this point should not keep the run
+    // marked "alive" in the DB via onSpawn refreshes.
+    if (keepaliveTimer) {
+      clearInterval(keepaliveTimer);
+      keepaliveTimer = null;
+    }
+
    if (logResult.status === "fulfilled") {
      stdout = logResult.value;
    }