fix: gate grace timer on stream-output silence, not first disconnect (FAR-107)

The 30s grace timer that bounds K8s Job condition propagation lag was
armed by streamPodLogs's onFirstStreamExit callback the moment
streamPodLogsOnce returned for the first time.  A transient K8s log-API
disconnect mid-run also returns from streamPodLogsOnce — so the grace
timer fired 30s later regardless of whether streamPodLogs had already
reconnected and the container was still producing output.

Nancy / Privileged Escalation reproduced this on long Opus-4-6 runs:
the prod paperclip pod was stable, the cancel-poll guard was already
narrowed in 0.1.51, but every long run truncated with claude_truncated
+ "container terminated state not yet observable (pod phase=Running)"
because the run was being abandoned mid-output.

Replace the boolean onFirstStreamExit signal with a streamActivity ref
carrying lastActiveAt + streamHasExited.  streamPodLogs refreshes
lastActiveAt every time a streamPodLogsOnce attempt returns non-empty
output, so reconnects that resume real output keep the grace clock
reset.  The grace timer fires only once the stream has exited at least
once AND no chunk has arrived for the full grace window — which
preserves the original FAR-23 behaviour (container truly exited but
Job condition lags) while ending the false-truncation of healthy
streams.  Adds a regression test that asserts a stream drop + reconnect
+ deferred Job completion does not surface as truncated.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
This commit is contained in:
2026-04-27 00:28:44 +00:00
committed by Hugh Commit [agent]
parent fd7dce7239
commit 07ef106c66
2 changed files with 84 additions and 18 deletions
+48
View File
@@ -1512,6 +1512,54 @@ describe("execute: log-stream-exit grace period (FAR-23)", () => {
// (grace did not fire, real completion arrived)
expect(result.errorMessage).toBeNull();
});
it("does NOT fire grace when stream drops mid-output and reconnects with more output (FAR-107)", async () => {
// Reproduces Nancy / Privileged Escalation symptom: the K8s log API drops
// the streaming connection mid-run; streamPodLogs reconnects and the
// container is still producing. Before the fix, the grace timer was
// armed on first stream exit and fired 30s later regardless of whether
// output had resumed, surfacing claude_truncated even though the pod was
// still phase=Running.
let attemptIndex = 0;
mockLogFn.mockImplementation(
async (_ns: string, _pod: string, _ctr: string, writable: import("node:stream").Writable) => {
if (attemptIndex === 0) {
// Stream a partial init line then "drop" the connection without a
// result event — this is the transient API disconnect.
writable.write(JSON.stringify({ type: "system", subtype: "init", model: "claude-sonnet-4-6", session_id: "sess_test123" }) + "\n");
attemptIndex++;
return;
}
// Reconnect produces the rest of the stream including the result event.
writable.write(CLAUDE_HAPPY_OUTPUT);
},
);
// Job condition arrives only after the reconnect produces output, well
// beyond the 30s grace window; the old code would have grace-fired at
// ~30s and treated the run as truncated.
let readJobCalls = 0;
mockBatchReadJob.mockImplementation(async () => {
readJobCalls++;
// Stay non-terminal until the reconnect has had time to run and the
// grace window has fully elapsed since the FIRST disconnect.
if (readJobCalls < 25) return { status: { conditions: [] } };
return { status: { conditions: [{ type: "Complete", status: "True" }] } };
});
const executePromise = execute(makeCtx());
// t=3000: first reconnect sleep fires → second streamPodLogsOnce attempt
await vi.advanceTimersByTimeAsync(3_100);
// Drive past the old (buggy) 30s grace boundary without firing real completion
await vi.advanceTimersByTimeAsync(35_000);
// Then let the Job's Complete condition land
await vi.advanceTimersByTimeAsync(20_000);
const result = await executePromise;
// Run completed normally — grace must not have falsely truncated it.
expect(result.exitCode).toBe(0);
expect(result.errorCode).toBeUndefined();
expect(result.sessionId).toBe("sess_test123");
}, 80_000);
});
// ─── execute: concurrency guard — multiple orphan sorting ────────────────────