Paperclip Adapter Logging Methodology Change #10

Open
opened 2026-04-27 20:53:52 +00:00 by cpfarhood · 0 comments
cpfarhood commented 2026-04-27 20:53:52 +00:00 (Migrated from github.com)

You are implementing a change to a Paperclip adapter plugin. The repo is at
/Users/Repositories/paperclip-adapter-claude-k8s on branch master. Work on a
new branch off master — do NOT commit directly to master.

Before you start, read these files fully:
- src/server/execute.ts (large; this is the main file you'll edit)
- src/server/job-manifest.ts
- src/server/log-dedup.ts (you will delete this)
- src/server/parse.ts
- src/server/config-schema.ts
- src/index.ts

Run npm install and then npm test. Confirm it's green (should be 379 tests
passing). Do NOT run npm run build — CI handles that.

=============================================================================
WHY WE ARE DOING THIS

Today the adapter reads pod logs via the Kubernetes log API (follow mode). At
production scale this stream drops every few seconds, the adapter hits the
50-reconnect cap after ~2.5 minutes, and long-running agents fail. The fix is
to have the pod's claude command tee its stdout to a file on the shared
PVC, and have the adapter tail that file directly from the Paperclip server
process. The PVC is mounted at /paperclip in both pods so the file is
visible on both sides.

We are NOT going to:
- wrap the claude binary
- use Claude hooks
- add a sidecar
- change revitalize (the consumer app)
- keep the k8s log API as a fallback

We ARE going to:
- replace k8s log streaming with filesystem tailing entirely
- delete all reconnect logic and the log-dedup filter
- delete the RTK tool-output truncation feature entirely
- keep kubectl logs -f working (tee preserves stdout)

=============================================================================
SCOPE OF CHANGES

--- Job manifest (src/server/job-manifest.ts) ---

  1. DELETE the buildRtkSetupCommands function entirely.

  2. DELETE the enableRtk and rtkMaxOutputBytes config reads inside
    buildJobManifest.

  3. DELETE the mainCommand conditional that prepends RTK setup:
    const mainCommand = enableRtk
    ? ${buildRtkSetupCommands(rtkMaxOutputBytes)} && ${claudeInvocation}
    : claudeInvocation;
    Replace with just claudeInvocation.

  4. MODIFY claudeInvocation to add tee:
    Before (approximately):
    const claudeInvocation = cat /tmp/prompt/prompt.txt | claude ${claudeArgsEscaped};
    After:
    const podLogPath =
    /paperclip/instances/default/run-logs/${companyId}/${agentId}/${runId}.pod.ndjson;
    const claudeInvocation =
    cat /tmp/prompt/prompt.txt | claude ${claudeArgsEscaped} | tee ${podLogPath};

    companyId, agentId, and runId come from ctx (search surrounding
    code — they're already in scope in buildJobManifest via destructuring).

  5. MODIFY the init container command to create the parent directory before
    the main container starts. The init container today writes the prompt
    file. Amend its command to also mkdir -p the log directory:
    const initCommand = mkdir -p /paperclip/instances/default/run-logs/${companyId}/${agentId} && printf '%s' "$PROMPT" > /tmp/prompt/prompt.txt;
    (Your existing init command may differ slightly — keep its behavior,
    just prepend the mkdir.)

  6. EXPORT the log path builder as a helper so execute.ts can compute the
    same path without duplicating the template:
    export function buildPodLogPath(companyId: string, agentId: string, runId: string): string {
    return /paperclip/instances/default/run-logs/${companyId}/${agentId}/${runId}.pod.ndjson;
    }
    Return this path from buildJobManifest alongside the other fields in
    JobBuildResult (add podLogPath: string to the interface).

  7. ID SANITIZATION (critical): before using companyId/agentId/runId in the
    path, validate they match ^[a-zA-Z0-9-]+$. If any of them doesn't,
    throw an Error with message:
    Invalid ${field} for log path: ${value}
    The existing code probably does not validate; add a helper at the top of
    job-manifest.ts:
    function assertSafePathComponent(field: string, value: string): void {
    if (!/^[a-zA-Z0-9-]+$/.test(value)) {
    throw new Error(Invalid ${field} for log path: ${value});
    }
    }
    Call it for all three before buildPodLogPath.

--- Adapter (src/server/execute.ts) ---

  1. DELETE the LogLineDedupFilter import.

  2. DELETE constants: MAX_LOG_RECONNECT_ATTEMPTS, LOG_STREAM_RECONNECT_DELAY_MS,
    LOG_STREAM_BAIL_TIMEOUT_MS.

  3. DELETE functions: streamPodLogs, streamPodLogsOnce, readPodLogs.
    Also delete any helpers exclusively used by them.

  4. DELETE the bail timer machinery (bailTimer, bailResolve, bailPromise,
    stopPoller).

  5. DELETE the one-shot fallback path inside the main execute function:
    the block computing hasResultEvent, needsOneShot, and the
    readPodLogs fallback call.

  6. DELETE the sinceSeconds reconnect-window logic.

  7. ADD a new function tailPodLogFile in execute.ts (or a new file
    src/server/file-log-tailer.ts if you prefer — but keep it simple; inline
    is fine). Signature:

    interface TailOptions {                                                                                                                                                                   
      onLog: AdapterExecutionContext["onLog"];                                                                                                                                                
      stopSignal: { stopped: boolean };                                                                                                                                                       
    }                                                                                                                                                                                         
    
    async function tailPodLogFile(                                                                                                                                                            
      filePath: string,                                                                                                                                                                       
      opts: TailOptions,                                                                                                                                                                      
    ): Promise<string> { ... }                                                                                                                                                                
    

    Behavior:

    • Wait up to 30 seconds for the file to exist. Poll with
      fs.promises.stat every 250ms. If the file doesn't appear in 30s,
      throw an Error: Pod log file never appeared at ${filePath}.
    • Once it exists, open with fs.promises.open(filePath, 'r').
    • Track a byte offset starting at 0.
    • Poll loop: every 250ms active, backed off to 1000ms if the file
      hasn't grown for 5 consecutive polls (reset to 250ms on any
      growth). For each poll:
      a. stat the file, compare size to offset
      b. if size > offset, read bytes from [offset, size) into a Buffer
      c. update offset = size
      d. concatenate any pending partial line from previous poll with
      the new buffer, split on '\n'
      e. the last element of the split is either the new pending
      partial line (if the buffer didn't end with '\n') or empty
      f. for every complete line, call onLog("stdout", line + "\n")
      and append to an in-memory accumulator (string)
    • Exit when opts.stopSignal.stopped === true. Before returning, do
      ONE final read-to-EOF to drain any tail bytes (same logic as
      above). Close the file handle. Return the accumulator as a string.

    Use fs.promises.open / FileHandle.read / FileHandle.close. Do not
    use fs.watch or chokidar.

  8. REPLACE the existing log-streaming section of execute with a call to
    tailPodLogFile. The file path comes from buildJobManifest's
    return value (add podLogPath there as noted above). The pattern is
    approximately:

    const { job, jobName, namespace, claudeArgs, prompt, podLogPath } = buildJobManifest(...);                                                                                                
    // ... create secret, create job, wait for pod ...                                                                                                                                        
    const stopSignal = { stopped: false };                                                                                                                                                    
    const [tailResult, completionResult] = await Promise.allSettled([                                                                                                                         
      tailPodLogFile(podLogPath, { onLog, stopSignal }),                                                                                                                                      
      waitForJobCompletion(namespace, jobName, ...).then(r => { stopSignal.stopped = true; return r; }),                                                                                      
    ]);                                                                                                                                                                                       
    const stdout = tailResult.status === "fulfilled" ? tailResult.value : "";                                                                                                                 
    

    The existing Promise.allSettled pattern is in the code today — mirror
    its shape. Keep waitForJobCompletion unchanged.

  9. ADD log file cleanup to cleanupJob. After successful Job deletion,
    best-effort delete the log file:
    try { await fs.promises.unlink(podLogPath); } catch { /* non-fatal */ }
    Skip the unlink if retainJobs === true.

    cleanupJob will need podLogPath passed in; thread it through from
    the caller.

--- Config schema (src/server/config-schema.ts) ---

DELETE the enableRtk and rtkMaxOutputBytes field definitions.

--- Documentation (src/index.ts) ---

DELETE any lines in agentConfigurationDoc referring to enableRtk,
rtkMaxOutputBytes, or RTK generally. Search for "rtk" (case-insensitive)
and remove matching lines/sections.

--- Delete entire files ---

- src/server/log-dedup.ts                                                                                                                                                                      
- src/server/log-dedup.test.ts                          

--- Tests ---

- Delete any test files that test RTK exclusively. Search                                                                                                                                      
  src/server/*.test.ts for "rtk" (case-insensitive); delete tests that
  specifically cover RTK behavior. Non-RTK tests in the same file stay.                                                                                                                        
- Delete any execute.test.ts tests covering streamPodLogsOnce or the                                                                                                                           
  bail timer. Search for "streamPodLogsOnce", "bail timer",                                                                                                                                    
  "LogLineDedupFilter" and remove matching describe/it blocks.                                                                                                                                 
- Add src/server/file-log-tailer.test.ts (if you extracted to a module)                                                                                                                        
  or add test cases to execute.test.ts. Cover:                                                                                                                                                 
    1. File appears within 30s, content is tailed line-by-line                                                                                                                                 
    2. File never appears, function throws with the expected message                                                                                                                           
    3. Partial trailing line is buffered and emitted on next poll                                                                                                                              
    4. Stop signal exits the loop, final drain reads remaining bytes                                                                                                                           
    5. Adaptive backoff: idle polls slow down, active polls speed up                                                                                                                           
  Use vitest fake timers (vi.useFakeTimers) and a tmpdir via                                                                                                                                   
  `fs.mkdtempSync(path.join(os.tmpdir(), 'tailer-'))`.                                                                                                                                         

=============================================================================
TESTING

After all changes:
1. npm run typecheck — must pass
2. npm test — must pass. Record the new passing count.

Do NOT run the adapter end-to-end. Do NOT require a k8s cluster.

=============================================================================
BRANCH, COMMIT, PUSH, PR

  1. Create a new branch off master:
    git checkout master && git pull && git checkout -b feat/filesystem-log-tail

  2. Make all the changes above. Commit as ONE commit (this is a coordinated
    change — init, adapter, tests belong together). Commit message:

    feat: replace k8s log API streaming with filesystem tailing                                                                                                                               
    
    The k8s log follow stream drops every ~3 seconds at production scale,                                                                                                                     
    exhausting the 50-attempt reconnect cap within 2.5 minutes. Runs
    longer than that lose live UI output and eventually fail.                                                                                                                                 
    
    Replace the k8s log streaming path entirely with filesystem tailing:                                                                                                                      
    the pod tees claude's stdout to                                                                                                                                                           
    /paperclip/instances/default/run-logs/<companyId>/<agentId>/<runId>.pod.ndjson                                                                                                            
    on the shared PVC, and the adapter tails that file from the Paperclip                                                                                                                     
    server process. kubectl logs -f still works (tee preserves stdout).                                                                                                                       
    
    Deletes:                                                                                                                                                                                  
    - LogLineDedupFilter and all reconnect logic (no more replay window)                                                                                                                      
    - RTK tool-output truncation feature (deprecated, unused at scale)                                                                                                                        
    - One-shot readPodLogs fallback                                                                                                                                                           
    - sinceSeconds reconnect-window machinery                                                                                                                                                 
    
    Adds:                                                                                                                                                                                     
    - tailPodLogFile: adaptive 250ms/1s poll loop with partial-line                                                                                                                           
      buffering and tail-drain on stopSignal                                                                                                                                                  
    - Log file cleanup tied to retainJobs                                                                                                                                                     
    - Path-component sanitization (companyId/agentId/runId must match                                                                                                                         
      [a-zA-Z0-9-]+)                                                                                                                                                                          
    
    Co-Authored-By: Claude Sonnet <noreply@anthropic.com>                                                                                                                                     
    
  3. Push:
    git push -u origin feat/filesystem-log-tail

  4. Open a PR against master with gh pr create:
    Title: feat: replace k8s log API streaming with filesystem tailing
    Body (use a heredoc):

    ## Summary                                                                                                                                                                                
    - Pod tees claude stdout to PVC; adapter tails the file directly                                                                                                                          
    - Eliminates k8s log API dependency for streaming (drops every ~3s                                                                                                                        
      at scale)                                                                                                                                                                               
    - Deletes LogLineDedupFilter, reconnect logic, one-shot fallback,                                                                                                                         
      RTK feature entirely                                                                                                                                                                    
    
    ## Why                                                                                                                                                                                    
    At production scale (144 concurrent runs across 12 companies × 12                                                                                                                         
    agents), the k8s log follow stream was dropping within seconds of                                                                                                                         
    each connect, exhausting the 50-attempt reconnect cap. Runs longer                                                                                                                        
    than 2.5 minutes lost live output; combined with a separate reaper                                                                                                                        
    bug (fixed in revitalize), runs over 5 minutes failed outright.                                                                                                                           
    
    ## Path                                                                                                                                                                                   
    `/paperclip/instances/default/run-logs/<companyId>/<agentId>/<runId>.pod.ndjson`                                                                                                          
    — the `.pod.ndjson` suffix distinguishes the pod-written file from                                                                                                                        
    revitalize's server-side `<runId>.ndjson` log store.                                                                                                                                      
    
    ## Breaking                                                                                                                                                                               
    Old Job manifests (pre-tee) are incompatible — the tailer will fail                                                                                                                       
    after its 30s "file missing" window. Rolling cutover required: any                                                                                                                        
    in-flight runs at deploy time will surface `k8s_pod_log_file_missing`                                                                                                                     
    and need operator retry.                                                                                                                                                                  
    
    ## Test plan                                                                                                                                                                              
    - [ ] npm test passes                                                                                                                                                                     
    - [ ] Manual: deploy to cluster, run a >5min agent, confirm live UI                                                                                                                       
          output and no reaper fire                                                                                                                                                           
    - [ ] Manual: verify kubectl logs -f still works on the Job pod                                                                                                                           
    - [ ] Manual: confirm log file is cleaned up when Job cleanup runs                                                                                                                        
          (retainJobs=false) and preserved when retainJobs=true                                                                                                                               
    

=============================================================================
WRAPPING UP

Report back with:
1. Branch name and commit hash
2. PR URL
3. Final test count (e.g. "368 tests passing" — number will drop vs 379
because you deleted tests)
4. Line count of execute.ts before and after (should drop significantly)
5. Any deviation from these instructions, with reason

If ANY of the following happens, STOP and report instead of improvising:
- A file path doesn't match what's described (e.g. the mainCommand
pattern has changed)
- A function you're supposed to delete has other callers you didn't
expect
- A test you're supposed to keep depends on something you deleted
- Typecheck fails and the fix is non-obvious

Do NOT push to master. Do NOT tag a version. Do NOT bump package.json
version — leave it as-is.

You are implementing a change to a Paperclip adapter plugin. The repo is at /Users/Repositories/paperclip-adapter-claude-k8s on branch master. Work on a new branch off master — do NOT commit directly to master. Before you start, read these files fully: - src/server/execute.ts (large; this is the main file you'll edit) - src/server/job-manifest.ts - src/server/log-dedup.ts (you will delete this) - src/server/parse.ts - src/server/config-schema.ts - src/index.ts Run `npm install` and then `npm test`. Confirm it's green (should be 379 tests passing). Do NOT run `npm run build` — CI handles that. ============================================================================= WHY WE ARE DOING THIS ============================================================================= Today the adapter reads pod logs via the Kubernetes log API (follow mode). At production scale this stream drops every few seconds, the adapter hits the 50-reconnect cap after ~2.5 minutes, and long-running agents fail. The fix is to have the pod's claude command `tee` its stdout to a file on the shared PVC, and have the adapter tail that file directly from the Paperclip server process. The PVC is mounted at /paperclip in both pods so the file is visible on both sides. We are NOT going to: - wrap the claude binary - use Claude hooks - add a sidecar - change revitalize (the consumer app) - keep the k8s log API as a fallback We ARE going to: - replace k8s log streaming with filesystem tailing entirely - delete all reconnect logic and the log-dedup filter - delete the RTK tool-output truncation feature entirely - keep `kubectl logs -f` working (tee preserves stdout) ============================================================================= SCOPE OF CHANGES ============================================================================= --- Job manifest (src/server/job-manifest.ts) --- 1. DELETE the `buildRtkSetupCommands` function entirely. 2. DELETE the `enableRtk` and `rtkMaxOutputBytes` config reads inside `buildJobManifest`. 3. DELETE the mainCommand conditional that prepends RTK setup: const mainCommand = enableRtk ? `${buildRtkSetupCommands(rtkMaxOutputBytes)} && ${claudeInvocation}` : claudeInvocation; Replace with just `claudeInvocation`. 4. MODIFY `claudeInvocation` to add tee: Before (approximately): const claudeInvocation = `cat /tmp/prompt/prompt.txt | claude ${claudeArgsEscaped}`; After: const podLogPath = `/paperclip/instances/default/run-logs/${companyId}/${agentId}/${runId}.pod.ndjson`; const claudeInvocation = `cat /tmp/prompt/prompt.txt | claude ${claudeArgsEscaped} | tee ${podLogPath}`; `companyId`, `agentId`, and `runId` come from `ctx` (search surrounding code — they're already in scope in buildJobManifest via destructuring). 5. MODIFY the init container command to create the parent directory before the main container starts. The init container today writes the prompt file. Amend its command to also `mkdir -p` the log directory: const initCommand = `mkdir -p /paperclip/instances/default/run-logs/${companyId}/${agentId} && printf '%s' "$PROMPT" > /tmp/prompt/prompt.txt`; (Your existing init command may differ slightly — keep its behavior, just prepend the mkdir.) 6. EXPORT the log path builder as a helper so execute.ts can compute the same path without duplicating the template: export function buildPodLogPath(companyId: string, agentId: string, runId: string): string { return `/paperclip/instances/default/run-logs/${companyId}/${agentId}/${runId}.pod.ndjson`; } Return this path from `buildJobManifest` alongside the other fields in `JobBuildResult` (add `podLogPath: string` to the interface). 7. ID SANITIZATION (critical): before using companyId/agentId/runId in the path, validate they match `^[a-zA-Z0-9-]+$`. If any of them doesn't, throw an Error with message: `Invalid ${field} for log path: ${value}` The existing code probably does not validate; add a helper at the top of job-manifest.ts: function assertSafePathComponent(field: string, value: string): void { if (!/^[a-zA-Z0-9-]+$/.test(value)) { throw new Error(`Invalid ${field} for log path: ${value}`); } } Call it for all three before buildPodLogPath. --- Adapter (src/server/execute.ts) --- 1. DELETE the `LogLineDedupFilter` import. 2. DELETE constants: `MAX_LOG_RECONNECT_ATTEMPTS`, `LOG_STREAM_RECONNECT_DELAY_MS`, `LOG_STREAM_BAIL_TIMEOUT_MS`. 3. DELETE functions: `streamPodLogs`, `streamPodLogsOnce`, `readPodLogs`. Also delete any helpers exclusively used by them. 4. DELETE the bail timer machinery (bailTimer, bailResolve, bailPromise, stopPoller). 5. DELETE the one-shot fallback path inside the main `execute` function: the block computing `hasResultEvent`, `needsOneShot`, and the `readPodLogs` fallback call. 6. DELETE the `sinceSeconds` reconnect-window logic. 7. ADD a new function `tailPodLogFile` in execute.ts (or a new file src/server/file-log-tailer.ts if you prefer — but keep it simple; inline is fine). Signature: interface TailOptions { onLog: AdapterExecutionContext["onLog"]; stopSignal: { stopped: boolean }; } async function tailPodLogFile( filePath: string, opts: TailOptions, ): Promise<string> { ... } Behavior: - Wait up to 30 seconds for the file to exist. Poll with `fs.promises.stat` every 250ms. If the file doesn't appear in 30s, throw an Error: `Pod log file never appeared at ${filePath}`. - Once it exists, open with `fs.promises.open(filePath, 'r')`. - Track a byte offset starting at 0. - Poll loop: every 250ms active, backed off to 1000ms if the file hasn't grown for 5 consecutive polls (reset to 250ms on any growth). For each poll: a. stat the file, compare size to offset b. if size > offset, read bytes from [offset, size) into a Buffer c. update offset = size d. concatenate any pending partial line from previous poll with the new buffer, split on '\n' e. the last element of the split is either the new pending partial line (if the buffer didn't end with '\n') or empty f. for every complete line, call `onLog("stdout", line + "\n")` and append to an in-memory accumulator (string) - Exit when `opts.stopSignal.stopped === true`. Before returning, do ONE final read-to-EOF to drain any tail bytes (same logic as above). Close the file handle. Return the accumulator as a string. Use `fs.promises.open` / `FileHandle.read` / `FileHandle.close`. Do not use `fs.watch` or `chokidar`. 8. REPLACE the existing log-streaming section of `execute` with a call to `tailPodLogFile`. The file path comes from `buildJobManifest`'s return value (add `podLogPath` there as noted above). The pattern is approximately: const { job, jobName, namespace, claudeArgs, prompt, podLogPath } = buildJobManifest(...); // ... create secret, create job, wait for pod ... const stopSignal = { stopped: false }; const [tailResult, completionResult] = await Promise.allSettled([ tailPodLogFile(podLogPath, { onLog, stopSignal }), waitForJobCompletion(namespace, jobName, ...).then(r => { stopSignal.stopped = true; return r; }), ]); const stdout = tailResult.status === "fulfilled" ? tailResult.value : ""; The existing `Promise.allSettled` pattern is in the code today — mirror its shape. Keep `waitForJobCompletion` unchanged. 9. ADD log file cleanup to `cleanupJob`. After successful Job deletion, best-effort delete the log file: try { await fs.promises.unlink(podLogPath); } catch { /* non-fatal */ } Skip the unlink if `retainJobs === true`. `cleanupJob` will need `podLogPath` passed in; thread it through from the caller. --- Config schema (src/server/config-schema.ts) --- DELETE the `enableRtk` and `rtkMaxOutputBytes` field definitions. --- Documentation (src/index.ts) --- DELETE any lines in `agentConfigurationDoc` referring to enableRtk, rtkMaxOutputBytes, or RTK generally. Search for "rtk" (case-insensitive) and remove matching lines/sections. --- Delete entire files --- - src/server/log-dedup.ts - src/server/log-dedup.test.ts --- Tests --- - Delete any test files that test RTK exclusively. Search src/server/*.test.ts for "rtk" (case-insensitive); delete tests that specifically cover RTK behavior. Non-RTK tests in the same file stay. - Delete any execute.test.ts tests covering streamPodLogsOnce or the bail timer. Search for "streamPodLogsOnce", "bail timer", "LogLineDedupFilter" and remove matching describe/it blocks. - Add src/server/file-log-tailer.test.ts (if you extracted to a module) or add test cases to execute.test.ts. Cover: 1. File appears within 30s, content is tailed line-by-line 2. File never appears, function throws with the expected message 3. Partial trailing line is buffered and emitted on next poll 4. Stop signal exits the loop, final drain reads remaining bytes 5. Adaptive backoff: idle polls slow down, active polls speed up Use vitest fake timers (vi.useFakeTimers) and a tmpdir via `fs.mkdtempSync(path.join(os.tmpdir(), 'tailer-'))`. ============================================================================= TESTING ============================================================================= After all changes: 1. `npm run typecheck` — must pass 2. `npm test` — must pass. Record the new passing count. Do NOT run the adapter end-to-end. Do NOT require a k8s cluster. ============================================================================= BRANCH, COMMIT, PUSH, PR ============================================================================= 1. Create a new branch off master: git checkout master && git pull && git checkout -b feat/filesystem-log-tail 2. Make all the changes above. Commit as ONE commit (this is a coordinated change — init, adapter, tests belong together). Commit message: feat: replace k8s log API streaming with filesystem tailing The k8s log follow stream drops every ~3 seconds at production scale, exhausting the 50-attempt reconnect cap within 2.5 minutes. Runs longer than that lose live UI output and eventually fail. Replace the k8s log streaming path entirely with filesystem tailing: the pod tees claude's stdout to /paperclip/instances/default/run-logs/<companyId>/<agentId>/<runId>.pod.ndjson on the shared PVC, and the adapter tails that file from the Paperclip server process. kubectl logs -f still works (tee preserves stdout). Deletes: - LogLineDedupFilter and all reconnect logic (no more replay window) - RTK tool-output truncation feature (deprecated, unused at scale) - One-shot readPodLogs fallback - sinceSeconds reconnect-window machinery Adds: - tailPodLogFile: adaptive 250ms/1s poll loop with partial-line buffering and tail-drain on stopSignal - Log file cleanup tied to retainJobs - Path-component sanitization (companyId/agentId/runId must match [a-zA-Z0-9-]+) Co-Authored-By: Claude Sonnet <noreply@anthropic.com> 3. Push: git push -u origin feat/filesystem-log-tail 4. Open a PR against master with `gh pr create`: Title: `feat: replace k8s log API streaming with filesystem tailing` Body (use a heredoc): ## Summary - Pod tees claude stdout to PVC; adapter tails the file directly - Eliminates k8s log API dependency for streaming (drops every ~3s at scale) - Deletes LogLineDedupFilter, reconnect logic, one-shot fallback, RTK feature entirely ## Why At production scale (144 concurrent runs across 12 companies × 12 agents), the k8s log follow stream was dropping within seconds of each connect, exhausting the 50-attempt reconnect cap. Runs longer than 2.5 minutes lost live output; combined with a separate reaper bug (fixed in revitalize), runs over 5 minutes failed outright. ## Path `/paperclip/instances/default/run-logs/<companyId>/<agentId>/<runId>.pod.ndjson` — the `.pod.ndjson` suffix distinguishes the pod-written file from revitalize's server-side `<runId>.ndjson` log store. ## Breaking Old Job manifests (pre-tee) are incompatible — the tailer will fail after its 30s "file missing" window. Rolling cutover required: any in-flight runs at deploy time will surface `k8s_pod_log_file_missing` and need operator retry. ## Test plan - [ ] npm test passes - [ ] Manual: deploy to cluster, run a >5min agent, confirm live UI output and no reaper fire - [ ] Manual: verify kubectl logs -f still works on the Job pod - [ ] Manual: confirm log file is cleaned up when Job cleanup runs (retainJobs=false) and preserved when retainJobs=true ============================================================================= WRAPPING UP ============================================================================= Report back with: 1. Branch name and commit hash 2. PR URL 3. Final test count (e.g. "368 tests passing" — number will drop vs 379 because you deleted tests) 4. Line count of execute.ts before and after (should drop significantly) 5. Any deviation from these instructions, with reason If ANY of the following happens, STOP and report instead of improvising: - A file path doesn't match what's described (e.g. the mainCommand pattern has changed) - A function you're supposed to delete has other callers you didn't expect - A test you're supposed to keep depends on something you deleted - Typecheck fails and the fix is non-obvious Do NOT push to master. Do NOT tag a version. Do NOT bump package.json version — leave it as-is.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: farhoodlabs/paperclip-adapter-claude-k8s#10