Reaper Cancel at 5mins #6

Closed
opened 2026-04-27 20:26:49 +00:00 by cpfarhood · 0 comments
cpfarhood commented 2026-04-27 20:26:49 +00:00 (Migrated from github.com)

You are fixing a bug in Paperclip where long-running adapter runs are reaped
after exactly 5 minutes, even though the adapter is actively producing
output. The repo is at /Users/Repositories/revitalize. Work on a new branch
off dev — do NOT commit directly to dev.

=== THE BUG ===

External adapters that declare hasOutOfProcessLiveness: true (e.g.
paperclip-adapter-claude-k8s) don't call onSpawn during their run. The
reaper at server/src/services/heartbeat.ts:3876-3942 decides liveness
for these adapters by comparing heartbeatRuns.updatedAt against a
5-minute staleness threshold.

Only persistRunProcessMetadata() (line 2802) refreshes updatedAt, and
it is only called from the onSpawn handler (line 5676). The onLog
handler (line 5545) writes to the log store and publishes SSE events but
DOES NOT touch updatedAt.

Result: for out-of-process adapters that don't call onSpawn,
updatedAt is frozen at run start, and the reaper guaranteed-fires at
exactly 5 minutes regardless of actual activity. Any run longer than
5 minutes fails with adapter_liveness_lost.

=== THE FIX ===

Treat log output as proof of life for hasOutOfProcessLiveness adapters.
When the adapter calls onLog, refresh heartbeatRuns.updatedAt so the
reaper's staleness window resets.

Apply this change in server/src/services/heartbeat.ts.

Find the onLog handler definition around line 5545 (search for
const onLog = async (stream: "stdout" | "stderr", chunk: string)).

Inside that handler, AFTER publishLiveEvent(...) (around line 5577),
add a DB update that sets updatedAt: new Date() on the current run row,
but ONLY for adapters with out-of-process liveness AND only when the
last refresh was more than REFRESH_THROTTLE_MS ago.

=== THROTTLING (IMPORTANT) ===

Adapter runs produce ~1 log chunk/second. Without throttling, every run
would write to heartbeatRuns once per second — a hot row contention
point and significant WAL churn.

Throttle to once every 60 seconds per run. Reaper window is 5 minutes, so
a 60-second refresh cadence gives 5× margin. Implementation:

  // Module-scope (near the top of heartbeat.ts, alongside other                                                                                                                               
  // module-scope state like `activeRunExecutions`).                                                                                                                                           
  const REFRESH_UPDATED_AT_THROTTLE_MS = 60_000;                                                                                                                                               
  const lastUpdatedAtRefresh = new Map<string, number>();                                                                                                                                      

Inside onLog, after publishLiveEvent:

  if (agent.adapterType && adapterHasOutOfProcessLiveness(agent.adapterType)) {                                                                                                                
    const now = Date.now();                             
    const last = lastUpdatedAtRefresh.get(run.id) ?? 0;                                                                                                                                        
    if (now - last >= REFRESH_UPDATED_AT_THROTTLE_MS) {                                                                                                                                        
      lastUpdatedAtRefresh.set(run.id, now);                                                                                                                                                   
      await db                                                                                                                                                                                 
        .update(heartbeatRuns)                                                                                                                                                                 
        .set({ updatedAt: new Date() })                                                                                                                                                        
        .where(eq(heartbeatRuns.id, run.id));                                                                                                                                                  
    }                                                                                                                                                                                          
  }                                                                                                                                                                                            

Also: when a run ends, clean up the map entry to prevent unbounded
growth. Find where activeRunExecutions.delete(run.id) is called (around
line 6061) and add lastUpdatedAtRefresh.delete(run.id) immediately
after it.

=== IMPORTS / SYMBOLS TO VERIFY ===

Before editing, check:

  • db, heartbeatRuns, eq are already imported/in scope. They are —
    used on lines 2808, 2815, and elsewhere. Do not re-import.
  • adapterHasOutOfProcessLiveness is already defined in this file.
    Don't re-define.
  • run and agent are in the closure where onLog is defined. Confirm
    by reading lines 5500-5545 before the handler.

=== DO NOT ===

  • Do NOT use a smaller throttle than 60 seconds. If you think the reaper
    needs finer granularity, you are misreading the reaper — its threshold
    is 300 seconds.
  • Do NOT add exponential backoff or any other complexity to the
    throttle. A fixed interval is correct.
  • Do NOT change the reaper. The reaper is correct; only the refresh path
    is broken.
  • Do NOT change any adapter packages. The fix is entirely in revitalize.
  • Do NOT touch persistRunProcessMetadata — it still handles onSpawn
    correctly for in-process adapters.
  • Do NOT add new columns or migrations.
  • Do NOT refactor onLog beyond adding this block.
  • Do NOT make the DB write fire-and-forget with .catch(() => {}). If
    the update fails, it should surface.

=== TESTING ===

  1. Run the existing test suite using the repo's standard test command
    (check package.json scripts — likely pnpm test or
    pnpm --filter @paperclipai/server test). All existing tests must
    still pass.

  2. Find the test file covering reapOrphanedRuns or the heartbeat
    service. Look in server/src/__tests__/ or
    server/src/services/__tests__/. Add ONE new test that verifies:
    given a hasOutOfProcessLiveness adapter, calling onLog causes
    updatedAt to advance in the DB (within throttle window). If the
    test harness already has a mock out-of-process adapter, reuse it.
    If adding a test would require >30 lines of scaffolding, skip it
    and note in the PR description why.

  3. Do NOT run the adapter end-to-end — no k8s access needed.

=== BRANCH, COMMIT, PUSH, PR ===

  1. Create a new branch off dev:
    git checkout dev && git pull && git checkout -b fix/heartbeat-updatedat-refresh-on-log

  2. Make the changes and commit. Commit message:

    fix(heartbeat): refresh updatedAt on adapter log output                                                                                                                                   
    
    Out-of-process adapters (hasOutOfProcessLiveness: true) rely on the                                                                                                                       
    reaper's staleness window against heartbeatRuns.updatedAt. Previously
    only onSpawn refreshed updatedAt, so adapters that don't call onSpawn                                                                                                                     
    (e.g. paperclip-adapter-claude-k8s since v0.1.38) had updatedAt                                                                                                                           
    frozen at run start, causing the reaper to fire at exactly 5 minutes                                                                                                                      
    regardless of actual activity.                                                                                                                                                            
    
    Treat log output as proof of life: onLog now refreshes updatedAt for                                                                                                                      
    out-of-process adapters. Throttled to once per 60s per run to avoid                                                                                                                       
    hammering heartbeatRuns with one UPDATE per log chunk (adapter runs                                                                                                                       
    can emit ~1 chunk/second). 60s cadence gives 5× margin vs the                                                                                                                             
    5-minute reaper window.                                                                                                                                                                   
    
    Co-Authored-By: Claude Sonnet <noreply@anthropic.com>                                                                                                                                     
    
  3. Push the branch:
    git push -u origin fix/heartbeat-updatedat-refresh-on-log

  4. Open a PR against dev using gh pr create. Title:

    fix(heartbeat): refresh updatedAt on adapter log output                                                                                                                                   
    

    PR body — use a heredoc with this structure:

    ## Summary                                                                                                                                                                                
    - Fixes reaper firing at exactly 5 minutes for out-of-process adapters
      that don't call `onSpawn` (chiefly paperclip-adapter-claude-k8s)                                                                                                                        
    - `onLog` now refreshes `heartbeatRuns.updatedAt` for adapters with                                                                                                                       
      `hasOutOfProcessLiveness: true`, throttled to once per 60s per run                                                                                                                      
    - Reaper logic unchanged; only the refresh path is fixed                                                                                                                                  
    
    ## Root cause                                                                                                                                                                             
    Adapter dropped periodic `onSpawn` in v0.1.38 (commit c55d6c6) after                                                                                                                      
    declaring `hasOutOfProcessLiveness: true`, assuming that flag changed                                                                                                                     
    the reaper's liveness mechanism. It doesn't — the flag just skips                                                                                                                         
    the PID check; `updatedAt` is still the signal. Nothing refreshes                                                                                                                         
    `updatedAt` from the adapter-side callbacks, so it froze at run                                                                                                                           
    start and every long run hit the reaper.                                                                                                                                                  
    
    Evidence: a run on 2026-04-27 with `started_at=19:54:35` produced                                                                                                                         
    ~180 log chunks over 5 minutes, reached a successful Claude                                                                                                                               
    `type:"result"` event at 19:59:52, but was marked                                                                                                                                         
    `adapter_liveness_lost` at 19:59:49 — 3 seconds before it would                                                                                                                           
    have finished successfully.                                                                                                                                                               
    
    ## Test plan                                                                                                                                                                              
    - [ ] Existing heartbeat tests pass                                                                                                                                                       
    - [ ] New test: `onLog` from a `hasOutOfProcessLiveness` adapter                                                                                                                          
          advances `updatedAt` in DB                                                                                                                                                          
    - [ ] Manually verify: long-running claude_k8s run no longer reaped                                                                                                                       
          (deploy to a cluster, run an agent with a task taking >5 min)                                                                                                                       
    

=== WRAPPING UP ===

Report back with:

  1. The commit hash and branch name.
  2. The PR URL.
  3. Whether existing tests passed (exact count).
  4. Whether you added a new test (yes/no, and if no, why).
  5. The exact line ranges of the edits (for reviewer).

If anything in the instructions doesn't match what you find in the code
(wrong line numbers, renamed symbols, missing helpers), STOP and report
what you found. Do not improvise.

You are fixing a bug in Paperclip where long-running adapter runs are reaped after exactly 5 minutes, even though the adapter is actively producing output. The repo is at /Users/Repositories/revitalize. Work on a new branch off `dev` — do NOT commit directly to `dev`. === THE BUG === External adapters that declare `hasOutOfProcessLiveness: true` (e.g. paperclip-adapter-claude-k8s) don't call `onSpawn` during their run. The reaper at `server/src/services/heartbeat.ts:3876-3942` decides liveness for these adapters by comparing `heartbeatRuns.updatedAt` against a 5-minute staleness threshold. Only `persistRunProcessMetadata()` (line 2802) refreshes `updatedAt`, and it is only called from the `onSpawn` handler (line 5676). The `onLog` handler (line 5545) writes to the log store and publishes SSE events but DOES NOT touch `updatedAt`. Result: for out-of-process adapters that don't call `onSpawn`, `updatedAt` is frozen at run start, and the reaper guaranteed-fires at exactly 5 minutes regardless of actual activity. Any run longer than 5 minutes fails with `adapter_liveness_lost`. === THE FIX === Treat log output as proof of life for `hasOutOfProcessLiveness` adapters. When the adapter calls `onLog`, refresh `heartbeatRuns.updatedAt` so the reaper's staleness window resets. Apply this change in `server/src/services/heartbeat.ts`. Find the `onLog` handler definition around line 5545 (search for `const onLog = async (stream: "stdout" | "stderr", chunk: string)`). Inside that handler, AFTER `publishLiveEvent(...)` (around line 5577), add a DB update that sets `updatedAt: new Date()` on the current run row, but ONLY for adapters with out-of-process liveness AND only when the last refresh was more than REFRESH_THROTTLE_MS ago. === THROTTLING (IMPORTANT) === Adapter runs produce ~1 log chunk/second. Without throttling, every run would write to `heartbeatRuns` once per second — a hot row contention point and significant WAL churn. Throttle to once every 60 seconds per run. Reaper window is 5 minutes, so a 60-second refresh cadence gives 5× margin. Implementation: // Module-scope (near the top of heartbeat.ts, alongside other // module-scope state like `activeRunExecutions`). const REFRESH_UPDATED_AT_THROTTLE_MS = 60_000; const lastUpdatedAtRefresh = new Map<string, number>(); Inside `onLog`, after `publishLiveEvent`: if (agent.adapterType && adapterHasOutOfProcessLiveness(agent.adapterType)) { const now = Date.now(); const last = lastUpdatedAtRefresh.get(run.id) ?? 0; if (now - last >= REFRESH_UPDATED_AT_THROTTLE_MS) { lastUpdatedAtRefresh.set(run.id, now); await db .update(heartbeatRuns) .set({ updatedAt: new Date() }) .where(eq(heartbeatRuns.id, run.id)); } } Also: when a run ends, clean up the map entry to prevent unbounded growth. Find where `activeRunExecutions.delete(run.id)` is called (around line 6061) and add `lastUpdatedAtRefresh.delete(run.id)` immediately after it. === IMPORTS / SYMBOLS TO VERIFY === Before editing, check: - `db`, `heartbeatRuns`, `eq` are already imported/in scope. They are — used on lines 2808, 2815, and elsewhere. Do not re-import. - `adapterHasOutOfProcessLiveness` is already defined in this file. Don't re-define. - `run` and `agent` are in the closure where `onLog` is defined. Confirm by reading lines 5500-5545 before the handler. === DO NOT === - Do NOT use a smaller throttle than 60 seconds. If you think the reaper needs finer granularity, you are misreading the reaper — its threshold is 300 seconds. - Do NOT add exponential backoff or any other complexity to the throttle. A fixed interval is correct. - Do NOT change the reaper. The reaper is correct; only the refresh path is broken. - Do NOT change any adapter packages. The fix is entirely in revitalize. - Do NOT touch `persistRunProcessMetadata` — it still handles `onSpawn` correctly for in-process adapters. - Do NOT add new columns or migrations. - Do NOT refactor `onLog` beyond adding this block. - Do NOT make the DB write fire-and-forget with `.catch(() => {})`. If the update fails, it should surface. === TESTING === 1. Run the existing test suite using the repo's standard test command (check `package.json` scripts — likely `pnpm test` or `pnpm --filter @paperclipai/server test`). All existing tests must still pass. 2. Find the test file covering `reapOrphanedRuns` or the heartbeat service. Look in `server/src/__tests__/` or `server/src/services/__tests__/`. Add ONE new test that verifies: given a `hasOutOfProcessLiveness` adapter, calling `onLog` causes `updatedAt` to advance in the DB (within throttle window). If the test harness already has a mock out-of-process adapter, reuse it. If adding a test would require >30 lines of scaffolding, skip it and note in the PR description why. 3. Do NOT run the adapter end-to-end — no k8s access needed. === BRANCH, COMMIT, PUSH, PR === 1. Create a new branch off `dev`: git checkout dev && git pull && git checkout -b fix/heartbeat-updatedat-refresh-on-log 2. Make the changes and commit. Commit message: fix(heartbeat): refresh updatedAt on adapter log output Out-of-process adapters (hasOutOfProcessLiveness: true) rely on the reaper's staleness window against heartbeatRuns.updatedAt. Previously only onSpawn refreshed updatedAt, so adapters that don't call onSpawn (e.g. paperclip-adapter-claude-k8s since v0.1.38) had updatedAt frozen at run start, causing the reaper to fire at exactly 5 minutes regardless of actual activity. Treat log output as proof of life: onLog now refreshes updatedAt for out-of-process adapters. Throttled to once per 60s per run to avoid hammering heartbeatRuns with one UPDATE per log chunk (adapter runs can emit ~1 chunk/second). 60s cadence gives 5× margin vs the 5-minute reaper window. Co-Authored-By: Claude Sonnet <noreply@anthropic.com> 3. Push the branch: git push -u origin fix/heartbeat-updatedat-refresh-on-log 4. Open a PR against `dev` using `gh pr create`. Title: fix(heartbeat): refresh updatedAt on adapter log output PR body — use a heredoc with this structure: ## Summary - Fixes reaper firing at exactly 5 minutes for out-of-process adapters that don't call `onSpawn` (chiefly paperclip-adapter-claude-k8s) - `onLog` now refreshes `heartbeatRuns.updatedAt` for adapters with `hasOutOfProcessLiveness: true`, throttled to once per 60s per run - Reaper logic unchanged; only the refresh path is fixed ## Root cause Adapter dropped periodic `onSpawn` in v0.1.38 (commit c55d6c6) after declaring `hasOutOfProcessLiveness: true`, assuming that flag changed the reaper's liveness mechanism. It doesn't — the flag just skips the PID check; `updatedAt` is still the signal. Nothing refreshes `updatedAt` from the adapter-side callbacks, so it froze at run start and every long run hit the reaper. Evidence: a run on 2026-04-27 with `started_at=19:54:35` produced ~180 log chunks over 5 minutes, reached a successful Claude `type:"result"` event at 19:59:52, but was marked `adapter_liveness_lost` at 19:59:49 — 3 seconds before it would have finished successfully. ## Test plan - [ ] Existing heartbeat tests pass - [ ] New test: `onLog` from a `hasOutOfProcessLiveness` adapter advances `updatedAt` in DB - [ ] Manually verify: long-running claude_k8s run no longer reaped (deploy to a cluster, run an agent with a task taking >5 min) === WRAPPING UP === Report back with: 1. The commit hash and branch name. 2. The PR URL. 3. Whether existing tests passed (exact count). 4. Whether you added a new test (yes/no, and if no, why). 5. The exact line ranges of the edits (for reviewer). If anything in the instructions doesn't match what you find in the code (wrong line numbers, renamed symbols, missing helpers), STOP and report what you found. Do not improvise.
Sign in to join this conversation.