fix: skip keepalive updatedAt refresh once K8s Job is terminal

The previous fix (df856e6) made the keepalive timer call onSpawn every
~4 minutes to refresh the run's updatedAt in the DB, so the stale-run
reaper wouldn't kill live runs in multi-instance deployments.  That was
correct for live jobs, but it was unconditional — if execute() stalled
after the pod terminated (slow K8s API call, hung log stream drain, or
a Job whose Complete condition lags pod termination), the keepalive
kept the run marked "alive" indefinitely even though the pod was gone.

That manifests as the opposite of the original bug: the UI shows jobs
as running when they have actually finished.

Two changes:

1. Verify the Job is still alive before the keepalive refreshes
   updatedAt.  If the Job has reached a terminal Complete/Failed
   condition (or has been deleted / the API read fails), stop
   refreshing.  If execute() truly ends up stuck past that point, the
   reaper will catch the run within the normal 5-minute staleness
   window instead of never.

2. Clear the keepalive interval immediately once Promise.allSettled
   resolves, rather than only in the finally block.  Post-completion
   work (exit-code fetch, log fallback read, job cleanup) must not be
   able to emit another onSpawn refresh that keeps the run "alive".

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
Test User
2026-04-17 02:57:05 +00:00
parent 20b85b8391
commit 5f5ae92ce7
+53 -8
View File
@@ -456,18 +456,54 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
// onSpawn periodically to refresh it. Without this, multi-instance
// deployments can reap a live run from another server instance
// after the 5-minute staleness window.
//
// BUT: the keepalive must NEVER refresh updatedAt if the underlying
// K8s Job is already terminal. Otherwise, if execute() stalls after
// the pod finishes (e.g. a slow K8s API call, a hung log stream
// drain, or a Job whose Complete condition lags pod termination),
// we would keep the run marked "alive" indefinitely while the pod
// is actually gone — the exact "UI thinks jobs are running when
// they are not" bug. We verify Job liveness every tick and stop
// refreshing as soon as the Job reaches a terminal state; if
// execute() is truly stuck, the reaper will then catch it within
// the normal 5-minute staleness window.
let lastLogAt = Date.now();
let keepaliveTick = 0;
let keepaliveJobTerminal = false;
keepaliveTimer = setInterval(() => {
const silenceSec = Math.round((Date.now() - lastLogAt) / 1000);
void onLog("stdout", `[paperclip] keepalive — job ${jobName} running (${silenceSec}s since last output)\n`);
// Fire-and-forget the async work; setInterval callbacks must be
// synchronous or the timer will drift.
void (async () => {
if (keepaliveJobTerminal) return;
// Refresh updatedAt every ~4 minutes (16 ticks × 15s) to stay
// well within the 5-minute reaper staleness window.
keepaliveTick++;
if (ctx.onSpawn && keepaliveTick % 16 === 0) {
void ctx.onSpawn({ pid: -1, processGroupId: null, startedAt: new Date().toISOString() }).catch(() => {});
}
// Verify the Job is still alive before announcing or refreshing.
try {
const job = await batchApi.readNamespacedJob({ name: jobName, namespace });
const terminal = job.status?.conditions?.some(
(c) => (c.type === "Complete" || c.type === "Failed") && c.status === "True",
);
if (terminal) {
keepaliveJobTerminal = true;
return;
}
} catch {
// Job may have been deleted out from under us, or the API call
// transiently failed. Either way, do not refresh updatedAt —
// either the Job really is gone, or the next tick will re-check.
keepaliveJobTerminal = true;
return;
}
const silenceSec = Math.round((Date.now() - lastLogAt) / 1000);
void onLog("stdout", `[paperclip] keepalive — job ${jobName} running (${silenceSec}s since last output)\n`);
// Refresh updatedAt every ~4 minutes (16 ticks × 15s) to stay
// well within the 5-minute reaper staleness window.
keepaliveTick++;
if (ctx.onSpawn && keepaliveTick % 16 === 0) {
void ctx.onSpawn({ pid: -1, processGroupId: null, startedAt: new Date().toISOString() }).catch(() => {});
}
})();
}, KEEPALIVE_INTERVAL_MS);
const wrappedOnLog: typeof onLog = async (stream, chunk) => {
lastLogAt = Date.now();
@@ -486,6 +522,15 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
}),
]);
// Stop the keepalive immediately once the job has reached a terminal
// state — do not wait for the finally block. Any K8s API call or
// cleanup that happens after this point should not keep the run
// marked "alive" in the DB via onSpawn refreshes.
if (keepaliveTimer) {
clearInterval(keepaliveTimer);
keepaliveTimer = null;
}
if (logResult.status === "fulfilled") {
stdout = logResult.value;
}