diff --git a/LOGGINGCHANGE.md b/LOGGINGCHANGE.md new file mode 100644 index 0000000..ffc34d6 --- /dev/null +++ b/LOGGINGCHANGE.md @@ -0,0 +1,361 @@ + You are implementing two coordinated changes to a Paperclip adapter plugin. + The repo is at /Users/Repositories/paperclip-adapter-opencode-k8s on branch + master. Work on a new branch off master — do NOT commit directly to master. + + Before you start, read these files fully: + - src/server/execute.ts (~1218 lines; this is the main file you'll edit) + - src/server/job-manifest.ts + - src/server/log-dedup.ts (you will delete this) + - src/server/parse.ts + - src/server/index.ts + - src/index.ts + + Run `npm install` and then `npm test`. Confirm green. Note the test count + for later comparison. Do NOT run `npm run build` — CI handles that. + + ============================================================================= + WHY WE ARE DOING THIS + ============================================================================= + + Two tightly coupled bugs: + + (A) The adapter doesn't declare `hasOutOfProcessLiveness: true` on its + ServerAdapterModule. The revitalize reaper therefore treats it as an + in-process adapter, expects a local child PID, finds none, and marks + every run `process_lost` after 5 minutes of staleness. + + (B) The adapter reads pod logs via the Kubernetes log API (follow mode). + At production scale the stream drops every few seconds, exhausting + the 50-reconnect cap within 2.5 minutes. Long runs lose live UI + output, and combined with (A) they fail entirely. + + Fix both in one PR: + + 1. Declare `hasOutOfProcessLiveness: true` in createServerAdapter(). + 2. Have the pod tee opencode's stdout to a file on the shared PVC, and + have the adapter tail that file from the Paperclip server process. + + We are NOT going to: + - wrap the opencode binary + - use hooks + - add a sidecar + - change revitalize + - keep the k8s log API as a fallback + + We ARE going to: + - replace k8s log streaming with filesystem tailing entirely + - delete all reconnect logic and the log-dedup filter + - keep `kubectl logs -f` working (tee preserves stdout) + - add the liveness flag so the reaper uses staleness-based liveness + + ============================================================================= + SCOPE OF CHANGES + ============================================================================= + + --- hasOutOfProcessLiveness flag (src/server/index.ts) --- + + The file today returns a plain object from createServerAdapter(). Add + `hasOutOfProcessLiveness: true` to the returned object, matching the + pattern from paperclip-adapter-claude-k8s. The adapter-utils type predates + this field, so the return needs a cast. + + Before (approximately): + export function createServerAdapter(): ServerAdapterModule { + return { + type, + execute, + // ... other fields ... + }; + } + + After: + export function createServerAdapter(): ServerAdapterModule { + return { + type, + execute, + // ... other fields ... + // Tells the reaper to skip local PID checks and use the staleness-based + // liveness window instead (adapter spawns K8s Jobs in separate pods). + // Cast required: adapter-utils ServerAdapterModule type predates this field. + hasOutOfProcessLiveness: true, + } as ServerAdapterModule; + } + + --- Job manifest (src/server/job-manifest.ts) --- + + 1. MODIFY the main container command to tee stdout. Current code at + approximately line 409: + const mainCommand = `${configSetup}cat /tmp/prompt/prompt.txt | opencode ${opencodeArgsEscaped}`; + Change to: + const podLogPath = + `/paperclip/instances/default/run-logs/${companyId}/${agentId}/${runId}.pod.ndjson`; + const mainCommand = `${configSetup}cat /tmp/prompt/prompt.txt | opencode ${opencodeArgsEscaped} | tee ${podLogPath}`; + + `companyId`, `agentId`, `runId` are already in scope in buildJobManifest + via the destructuring at line 219 (`agent`, `runId`) — use `agent.id` + and `agent.companyId`. If you prefer cleaner code, add a local: + const companyId = agent.companyId; + const agentId = agent.id; + + 2. MODIFY the init container command to create the parent directory before + the main container starts. The existing init container today writes the + prompt file with `printf`. Amend its command to also `mkdir -p` the log + directory. The init container is at approximately line 444 (prompt + secret path) and line 451 (direct printf path) — there are TWO init + container variants. Amend BOTH to prepend the mkdir: + + Variant 1 (large-prompt path, approx line 444): + Before: `cp /tmp/prompt-secret/prompt /tmp/prompt/prompt.txt` + After: `mkdir -p /paperclip/instances/default/run-logs/${companyId}/${agentId} && cp /tmp/prompt-secret/prompt /tmp/prompt/prompt.txt` + + Variant 2 (direct path, approx line 451): + Before: `printf '%s' "$PROMPT_CONTENT" > /tmp/prompt/prompt.txt` + After: `mkdir -p /paperclip/instances/default/run-logs/${companyId}/${agentId} && printf '%s' "$PROMPT_CONTENT" > /tmp/prompt/prompt.txt` + + Use template substitution for companyId/agentId/runId — these are all + in scope in the builder. + + 3. EXPORT the log path builder so execute.ts can compute the same path: + export function buildPodLogPath(companyId: string, agentId: string, runId: string): string { + return `/paperclip/instances/default/run-logs/${companyId}/${agentId}/${runId}.pod.ndjson`; + } + Return this path from buildJobManifest alongside other fields in + JobBuildResult (add `podLogPath: string` to the interface at approx + line 46). Update the final `return { job, jobName, namespace, prompt, + opencodeArgs, promptMetrics }` (approx line 482) to include podLogPath. + + 4. ID SANITIZATION: before using companyId/agentId/runId in the path, + validate they match `^[a-zA-Z0-9-]+$`. Add a helper at the top of + job-manifest.ts: + function assertSafePathComponent(field: string, value: string): void { + if (!/^[a-zA-Z0-9-]+$/.test(value)) { + throw new Error(`Invalid ${field} for log path: ${value}`); + } + } + Call it for companyId, agentId, and runId before computing podLogPath + and before interpolating into the init container commands. + + --- Adapter (src/server/execute.ts) --- + + 1. DELETE the `LogLineDedupFilter` import (approx line 13). + 2. DELETE constants (approx lines 19-26): + LOG_STREAM_RECONNECT_DELAY_MS + LOG_STREAM_RECONNECT_MAX_DELAY_MS + MAX_LOG_RECONNECT_ATTEMPTS + LOG_STREAM_BAIL_TIMEOUT_MS + 3. DELETE functions: + streamPodLogsOnce (approx line 168) + streamPodLogs (approx line 252) + readPodLogs (approx line 330) + waitForPodTermination (approx line 355) — only used by the fallback + 4. DELETE the bail timer machinery inside any function being removed + (bailTimer, bailResolve, bailPromise, stopPoller). + 5. DELETE the fallback path in `execute` around lines 675-693: + if (!stdout.trim()) { + // ... waitForPodTermination + readPodLogs fallback + } else if (!parseOpenCodeJsonl(stdout).sessionId) { + // ... partial-stdout fallback + } + + 6. ADD a new function `tailPodLogFile` in execute.ts. Inline is fine; do + not create a new module. Signature: + + interface TailOptions { + onLog: AdapterExecutionContext["onLog"]; + stopSignal: { stopped: boolean }; + } + + async function tailPodLogFile( + filePath: string, + opts: TailOptions, + ): Promise { ... } + + Behavior: + - Wait up to 30 seconds for the file to exist. Poll with + fs.promises.stat every 250ms. If the file doesn't appear in 30s, + throw an Error: `Pod log file never appeared at ${filePath}`. + - Once it exists, open with fs.promises.open(filePath, 'r'). + - Track a byte offset starting at 0. + - Poll loop: 250ms active cadence, backs off to 1000ms if the file + hasn't grown for 5 consecutive polls (reset to 250ms on any + growth). For each poll: + a. stat the file, compare size to offset + b. if size > offset, read bytes from [offset, size) into a Buffer + c. update offset = size + d. concatenate any pending partial line with the new buffer, + split on '\n' + e. last element is the new pending partial line (if no trailing + newline) or empty + f. for every complete line, call onLog("stdout", line + "\n") + and append to an in-memory accumulator (string) + - Exit when opts.stopSignal.stopped === true. Before returning, do + ONE final read-to-EOF to drain tail bytes. Close the handle. + Return the accumulator. + + Use fs.promises.open / FileHandle.read / FileHandle.close. Do NOT use + fs.watch or chokidar. + + 7. REPLACE the existing log-streaming section of `execute`. Find where + streamPodLogs is invoked inside a `Promise.allSettled` with + waitForJobCompletion (approx line 660). Replace that call with + tailPodLogFile. Pattern: + + const { /* ..., */ podLogPath } = built; + // ... create secret, create job, wait for pod ... + const stopSignal = { stopped: false }; + const [tailResult, completionResult] = await Promise.allSettled([ + tailPodLogFile(podLogPath, { onLog, stopSignal }), + waitForJobCompletion(namespace, jobName, ...).then(r => { stopSignal.stopped = true; return r; }), + ]); + const stdout = tailResult.status === "fulfilled" ? tailResult.value : ""; + + Keep waitForJobCompletion unchanged. Keep the existing `keepaliveTimer` + and `cancelSignal` / cancel-polling machinery unchanged — those are + independent of log streaming. + + 8. ADD log file cleanup. Find `cleanupJob` (the function that deletes the + K8s Job). After successful deletion, best-effort delete the log file: + try { await fs.promises.unlink(podLogPath); } catch { /* non-fatal */ } + Skip the unlink if `retainJobs === true`. + cleanupJob will need podLogPath passed in; thread it from the caller. + + --- Delete entire files --- + + - src/server/log-dedup.ts + - src/server/log-dedup.test.ts + + --- Tests --- + + - Delete any execute.test.ts tests covering streamPodLogsOnce, + streamPodLogs, readPodLogs, waitForPodTermination, the bail timer, or + LogLineDedupFilter. Search for those identifiers; remove matching + describe/it blocks. Non-log-streaming tests in the same file stay. + - Add test cases for tailPodLogFile to execute.test.ts. Cover: + 1. File appears within 30s; content is tailed line-by-line + 2. File never appears; function throws with expected message + 3. Partial trailing line buffered and emitted on next poll + 4. Stop signal exits the loop; final drain reads remaining bytes + 5. Adaptive backoff: idle polls slow; active polls speed up + Use vitest fake timers (vi.useFakeTimers) and a tmpdir via + `fs.mkdtempSync(path.join(os.tmpdir(), 'opencode-tailer-'))`. + + ============================================================================= + TESTING + ============================================================================= + + After all changes: + 1. `npm run typecheck` — must pass (the `as ServerAdapterModule` cast + may be needed; mirror claude-k8s's pattern) + 2. `npm test` — must pass. Test count will drop vs baseline because you + deleted tests. Record the new passing count. + + Do NOT run the adapter end-to-end. Do NOT require a k8s cluster. + + ============================================================================= + BRANCH, COMMIT, PUSH, PR + ============================================================================= + + 1. Create a new branch off master: + git checkout master && git pull && git checkout -b feat/filesystem-log-tail-and-liveness-flag + + 2. Make all changes above. Commit as ONE commit: + + feat: declare hasOutOfProcessLiveness and tail pod log from filesystem + + Two coordinated fixes for long-running agent failures: + + (1) Declare hasOutOfProcessLiveness: true on the ServerAdapterModule. + Without it the reaper treated this adapter as in-process, expected + a local child PID, and marked every run process_lost after 5min + staleness. Flag tells the reaper to use the staleness-based + liveness window for out-of-process adapters. + + (2) Replace k8s log API streaming with filesystem tailing. The k8s + follow stream drops every ~3 seconds at production scale, + exhausting the 50-attempt reconnect cap within 2.5 minutes. Pod + now tees opencode's stdout to + /paperclip/instances/default/run-logs///.pod.ndjson + on the shared PVC; adapter tails the file directly. kubectl logs -f + still works (tee preserves stdout). + + Deletes: + - LogLineDedupFilter and all reconnect logic + - streamPodLogsOnce, streamPodLogs, readPodLogs, waitForPodTermination + - Both fallback paths (empty-stream and missing-sessionId) + + Adds: + - tailPodLogFile: adaptive 250ms/1s poll loop with partial-line + buffering and tail-drain on stopSignal + - Log file cleanup tied to retainJobs + - Path-component sanitization (companyId/agentId/runId must match + [a-zA-Z0-9-]+) + + Co-Authored-By: Claude Sonnet + + 3. Push: + git push -u origin feat/filesystem-log-tail-and-liveness-flag + + 4. Open a PR against master with `gh pr create`: + Title: `feat: declare hasOutOfProcessLiveness and tail pod log from filesystem` + Body (use a heredoc): + + ## Summary + - Declares `hasOutOfProcessLiveness: true` so the reaper uses + staleness-based liveness instead of expecting a local PID + - Pod tees opencode stdout to PVC; adapter tails the file directly + - Eliminates k8s log API dependency for streaming + - Deletes LogLineDedupFilter, reconnect logic, both fallback paths + + ## Why + At production scale (144 concurrent runs), two bugs combined: + (a) no liveness flag → reaper marked runs process_lost at 5min + (b) k8s log follow stream drops every ~3s, exhausting the 50-reconnect + cap. Runs over ~2.5min lost live output; over 5min failed outright. + + Both must be fixed together — the flag alone doesn't help if the log + stream still drops, and the log tail alone doesn't help if the reaper + kills the run for missing PID. + + ## Path + `/paperclip/instances/default/run-logs///.pod.ndjson` + — the `.pod.ndjson` suffix distinguishes the pod-written file from + revitalize's server-side `.ndjson` log store. + + ## Breaking + Old Job manifests (pre-tee) are incompatible — the tailer's 30s + "file missing" window will surface an error on in-flight runs at + deploy time. Operator retry required. Consistent with the companion + change in paperclip-adapter-claude-k8s. + + ## Test plan + - [ ] npm test passes + - [ ] Manual: deploy to cluster, run a >5min agent, confirm live UI + output and no reaper fire + - [ ] Manual: verify kubectl logs -f still works on the Job pod + - [ ] Manual: confirm log file is cleaned up when Job cleanup runs + (retainJobs=false) and preserved when retainJobs=true + + ============================================================================= + WRAPPING UP + ============================================================================= + + Report back with: + 1. Branch name and commit hash + 2. PR URL + 3. Final test count (numbers will drop vs baseline because you deleted + tests — record baseline and final) + 4. Line count of execute.ts before and after (should drop significantly) + 5. Any deviation from these instructions, with reason + + If ANY of the following happens, STOP and report instead of improvising: + - A file path doesn't match what's described (e.g. the mainCommand + pattern has changed) + - A function you're supposed to delete has other callers you didn't + expect (streamPodLogsOnce in particular may have test-only imports + that need untangling) + - A test you're supposed to keep depends on something you deleted + - Typecheck fails and the fix is non-obvious + - The `as ServerAdapterModule` cast doesn't satisfy TypeScript + + Do NOT push to master. Do NOT tag a version. Do NOT bump package.json + version — leave it as-is.