fix: skip keepalive updatedAt refresh once K8s Job is terminal
The previous fix (df856e6) made the keepalive timer call onSpawn every
~4 minutes to refresh the run's updatedAt in the DB, so the stale-run
reaper wouldn't kill live runs in multi-instance deployments. That was
correct for live jobs, but it was unconditional — if execute() stalled
after the pod terminated (slow K8s API call, hung log stream drain, or
a Job whose Complete condition lags pod termination), the keepalive
kept the run marked "alive" indefinitely even though the pod was gone.
That manifests as the opposite of the original bug: the UI shows jobs
as running when they have actually finished.
Two changes:
1. Verify the Job is still alive before the keepalive refreshes
updatedAt. If the Job has reached a terminal Complete/Failed
condition (or has been deleted / the API read fails), stop
refreshing. If execute() truly ends up stuck past that point, the
reaper will catch the run within the normal 5-minute staleness
window instead of never.
2. Clear the keepalive interval immediately once Promise.allSettled
resolves, rather than only in the finally block. Post-completion
work (exit-code fetch, log fallback read, job cleanup) must not be
able to emit another onSpawn refresh that keeps the run "alive".
Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
+53
-8
@@ -456,18 +456,54 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
|
||||
// onSpawn periodically to refresh it. Without this, multi-instance
|
||||
// deployments can reap a live run from another server instance
|
||||
// after the 5-minute staleness window.
|
||||
//
|
||||
// BUT: the keepalive must NEVER refresh updatedAt if the underlying
|
||||
// K8s Job is already terminal. Otherwise, if execute() stalls after
|
||||
// the pod finishes (e.g. a slow K8s API call, a hung log stream
|
||||
// drain, or a Job whose Complete condition lags pod termination),
|
||||
// we would keep the run marked "alive" indefinitely while the pod
|
||||
// is actually gone — the exact "UI thinks jobs are running when
|
||||
// they are not" bug. We verify Job liveness every tick and stop
|
||||
// refreshing as soon as the Job reaches a terminal state; if
|
||||
// execute() is truly stuck, the reaper will then catch it within
|
||||
// the normal 5-minute staleness window.
|
||||
let lastLogAt = Date.now();
|
||||
let keepaliveTick = 0;
|
||||
let keepaliveJobTerminal = false;
|
||||
keepaliveTimer = setInterval(() => {
|
||||
const silenceSec = Math.round((Date.now() - lastLogAt) / 1000);
|
||||
void onLog("stdout", `[paperclip] keepalive — job ${jobName} running (${silenceSec}s since last output)\n`);
|
||||
// Fire-and-forget the async work; setInterval callbacks must be
|
||||
// synchronous or the timer will drift.
|
||||
void (async () => {
|
||||
if (keepaliveJobTerminal) return;
|
||||
|
||||
// Refresh updatedAt every ~4 minutes (16 ticks × 15s) to stay
|
||||
// well within the 5-minute reaper staleness window.
|
||||
keepaliveTick++;
|
||||
if (ctx.onSpawn && keepaliveTick % 16 === 0) {
|
||||
void ctx.onSpawn({ pid: -1, processGroupId: null, startedAt: new Date().toISOString() }).catch(() => {});
|
||||
}
|
||||
// Verify the Job is still alive before announcing or refreshing.
|
||||
try {
|
||||
const job = await batchApi.readNamespacedJob({ name: jobName, namespace });
|
||||
const terminal = job.status?.conditions?.some(
|
||||
(c) => (c.type === "Complete" || c.type === "Failed") && c.status === "True",
|
||||
);
|
||||
if (terminal) {
|
||||
keepaliveJobTerminal = true;
|
||||
return;
|
||||
}
|
||||
} catch {
|
||||
// Job may have been deleted out from under us, or the API call
|
||||
// transiently failed. Either way, do not refresh updatedAt —
|
||||
// either the Job really is gone, or the next tick will re-check.
|
||||
keepaliveJobTerminal = true;
|
||||
return;
|
||||
}
|
||||
|
||||
const silenceSec = Math.round((Date.now() - lastLogAt) / 1000);
|
||||
void onLog("stdout", `[paperclip] keepalive — job ${jobName} running (${silenceSec}s since last output)\n`);
|
||||
|
||||
// Refresh updatedAt every ~4 minutes (16 ticks × 15s) to stay
|
||||
// well within the 5-minute reaper staleness window.
|
||||
keepaliveTick++;
|
||||
if (ctx.onSpawn && keepaliveTick % 16 === 0) {
|
||||
void ctx.onSpawn({ pid: -1, processGroupId: null, startedAt: new Date().toISOString() }).catch(() => {});
|
||||
}
|
||||
})();
|
||||
}, KEEPALIVE_INTERVAL_MS);
|
||||
const wrappedOnLog: typeof onLog = async (stream, chunk) => {
|
||||
lastLogAt = Date.now();
|
||||
@@ -486,6 +522,15 @@ export async function execute(ctx: AdapterExecutionContext): Promise<AdapterExec
|
||||
}),
|
||||
]);
|
||||
|
||||
// Stop the keepalive immediately once the job has reached a terminal
|
||||
// state — do not wait for the finally block. Any K8s API call or
|
||||
// cleanup that happens after this point should not keep the run
|
||||
// marked "alive" in the DB via onSpawn refreshes.
|
||||
if (keepaliveTimer) {
|
||||
clearInterval(keepaliveTimer);
|
||||
keepaliveTimer = null;
|
||||
}
|
||||
|
||||
if (logResult.status === "fulfilled") {
|
||||
stdout = logResult.value;
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user