Commit Graph

18 Commits

Author SHA1 Message Date
Chris Farhood e3af8aa83b fix(server): make tailPodLogFile exit on job completion + port c8429cf
- Run tailPodLogFile and waitForJobCompletion in parallel via Promise.allSettled;
  completion sets stopSignal.stopped so the tail loop drains and exits. Without
  this, tailPodLogFile loops forever — the only natural exit was fh.stat()
  throwing on file removal, which never happened during normal job completion.
- Restructure tail loop to read-then-sleep, with a final drain after stopSignal
  is set to capture bytes written between the last poll and terminal state.
- Port the c8429cf fix from paperclip-adapter-claude-k8s:
  * buildPodLogPath now writes to /paperclip/instances/default/data/run-logs/...
    to match the server PVC layout (the /data/ segment was missing).
  * Drop the mkdir -p ... && from both init container command variants — the
    PVC isn't mounted in the init container, so the mkdir was failing with
    exit code 1 and the && short-circuit prevented the prompt copy.
- Test infrastructure:
  * Hoisted fs/promises mock now uses importOriginal so readFile (used for
    skill bundle loading) hits the real implementation.
  * setMockJsonl() lets individual tests inject specific JSONL into the tail's
    read buffer (previously dead constants in the test file).
  * fh.read mock now writes into the caller's buffer instead of returning a
    separate one.
- Add src/server/test.test.ts covering testEnvironment (was 0% → 98.5% stmts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:57:40 -04:00
Chris Farhood bc340bfcc9 fix: correct fs mock with vi.hoisted for proper per-test reset
The vi.mock("node:fs/promises") factory previously used a closure variable
that accumulated across tests despite vi.clearAllMocks(). Switched to
vi.hoisted() with an explicit resetFsMocks() called in beforeEach() so
the read offset counter is properly reset between tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 13:55:12 -04:00
Chris Farhood c71d0e5eec feat: replace K8s log streaming with PVC filesystem tailing
- Replaced streamPodLogs / streamPodLogsOnce / readPodLogs / waitForPodTermination
  with tailPodLogFile() that polls a shared PVC file path with adaptive cadence
  (250ms active, 1000ms idle after 5 consecutive empty polls)
- Added buildPodLogPath() export and podLogPath to JobBuildResult
- Added assertSafePathComponent with [a-zA-Z0-9-:] allowance for UUIDs
- Updated Job manifest to tee stdout to /paperclip/instances/default/run-logs/<companyId>/<agentId>/<runId>.pod.ndjson
- Added hasOutOfProcessLiveness: true to createServerAdapter (cast required)
- Deleted log-dedup.ts and log-dedup.test.ts entirely
- Removed all LogLineDedupFilter, Writable, and LOG_STREAM_* constants
- Removed completionResult.status workaround (completionWithGrace returns directly)
- Test infrastructure: mocked node:fs/promises to prevent unmocked fs.stat hangs

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 13:55:12 -04:00
Chris Farhood 2d057f085d refactor: remove PAPERCLIP_DEV_API_KEY runtime hack throughout
Cancel poll now uses ctx.authToken exclusively. Remove forwarding of
PAPERCLIP_DEV_API_KEY into job pods and all associated tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 07:24:14 -04:00
Chris Farhood 985d55e125 fix(cancel-poll): use ctx.authToken instead of process.env for cancel polling
The cancel poll was sending empty Authorization headers because
PAPERCLIP_API_KEY is not set on the Paperclip server pod. Use the
per-run authToken from ctx instead, which is the JWT issued by Paperclip
for this execution. PAPERCLIP_DEV_API_KEY still overrides for dev instances.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 07:11:47 -04:00
Chris Farhood 4fcd3b4547 fix test: stub PAPERCLIP_DEV_API_KEY before each cancel-poll test
The cancel-poll test sets PAPERCLIP_API_KEY='test-key' but the actual
PAPERCLIP_DEV_API_KEY was leaking through from the harness environment.
Since execute.ts prefers PAPERCLIP_DEV_API_KEY over PAPERCLIP_API_KEY,
the poll was sending the real dev key instead of 'test-key'.

Fix: add beforeEach to set PAPERCLIP_DEV_API_KEY='test-key', and afterEach
to clean both env vars.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 23:40:41 +00:00
Chris Farhood 798b80f2f2 test: push coverage to 90%+ on lines for all files except execute.ts (FAR-85)
Overall before: 80.36% lines / 79.06% statements
Overall after:  94.65% lines / 93.30% statements

Per-file lines coverage (all targets ≥90% except execute.ts):

| File              | Before | After  |
|-------------------|--------|--------|
| ui-parser.ts      | 93.63% | 99.09% |
| cli/format-event  | 59.85% | 99.27% |
| server/execute    | 81.47% | 89.64% |
| server/job-mfst   | 90.30% | 98.78% |
| server/k8s-client | 37.50% | 95.83% |
| server/log-dedup  | 97.77% | 97.77% |
| server/parse      | 89.85% | 98.55% |
| server/skills     | 100%   | 100%   |

New tests added:

- k8s-client.test.ts: getSelfPodInfo (env-var inheritance, secret volumes,
  PVC discovery, dnsConfig, all error paths) + kubeconfig file branch
- format-event.test.ts: parseStdoutLine (cli) — full event-type matrix,
  tool_use status branches, errorText fallback paths
- ui-parser.test.ts: errorText edge cases, empty event paths
- parse.test.ts: errorText fallback to data.message, name, code, JSON
- job-manifest.test.ts: workspace context env wiring, linkedIssueIds,
  paperclipWorkspaces/RuntimeServices JSON, authToken, inherited URLs,
  prompt-secret + data PVC + secret-volume mount paths
- execute.test.ts: parseModelProvider, completionWithGrace,
  instructionsFilePath read failure, ensureAgentDbPvc throw paths,
  large-prompt secret create failure, step-limit detection,
  waitForPod no-pod messaging, init-container ImagePullBackOff /
  CrashLoopBackOff, main-container CrashLoopBackOff, all-inits-done
  happy path, skill bundle source loading (SKILL.md + flat-file
  fallback), SIGTERM handler full body via vi.resetModules()

execute.ts remains at 89.64% lines — the residual gap is deep async/timer
paths inside streamAndAwaitJob (grace poller, keepalive ticker, log-stream
stop-signal/bail timer). Those need fake-timer scaffolding heavier than
this batch warrants; tracking separately.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 22:27:04 +00:00
Chris Farhood 3daf2dd676 fix: detect 404 from @kubernetes/client-node v1.x ApiException (FAR-85)
The v1.x ApiException exposes the HTTP status as `code`, not `statusCode`.
Both `isNotFound` (k8s-client) and `isK8s404` (execute) only checked
`statusCode`/`response.statusCode`, so 404s were never recognized:

- `getPvc` re-threw the 404 instead of returning null, which bubbled up
  through `ensureAgentDbPvc` as `k8s_job_create_failed` with the raw
  "persistentvolumeclaims X not found" body — the symptom in FAR-85.
- The PVC was never actually created, because the existence check threw
  before reaching `createPvc`.

Add `code === 404` to both predicates and a regression test for `isK8s404`.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 21:53:59 +00:00
Chris Farhood 139a387508 test: add ensureAgentDbPvc unit tests (FAR-63)
Seven direct unit tests for ensureAgentDbPvc covering ephemeral mode,
existing PVC (no create), PVC creation with storage class/capacity,
missing storage class error, default mode, and agent ID slug derivation.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 12:41:42 +00:00
Chris Farhood 46ce5cc599 feat: dedicated PVC per agent for OPENCODE_DB (FAR-63, Option B)
Replaces the Option A shared-PVC path implementation with a long-lived
dedicated PVC per agent, mounted at /opencode-db with OPENCODE_DB=/opencode-db.

Changes:
- k8s-client.ts: add getPvc/createPvc/deletePvc CoreV1Api helpers
- execute.ts: add ensureAgentDbPvc() that gets-or-creates a PVC named
  opencode-db-<agentId> before Job creation; pass agentDbClaimName through
  to buildJobManifest; return null for ephemeral mode (emptyDir used instead)
- job-manifest.ts: accept agentDbClaimName on JobBuildInput; mount dedicated
  PVC or emptyDir at /opencode-db; set OPENCODE_DB=/opencode-db; revert init
  container to simple form (no mkdir, no PVC mount)
- config-schema.ts: replace opencodeDbMode/opencodeDbPath with agentDbMode
  (dedicated_pvc|ephemeral, default dedicated_pvc), agentDbStorageClass
  (required for dedicated_pvc), agentDbStorageCapacity (default 1Gi)
- test.ts: add create/delete RBAC checks for persistentvolumeclaims
- pvc.test.ts: unit tests for ensureAgentDbPvc (7 cases incl. error paths)
- 289/289 tests pass; typecheck clean
- No agent-delete hook exists; opencode-db PVC janitor routine is a deferred
  follow-up task

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 12:38:54 +00:00
Chris Farhood 5fa9e1396e fix: poll issue status instead of heartbeat-run for cancel detection (FAR-60)
The cancel poller was calling GET /api/heartbeat-runs/{runId} which
returned 401 because the adapter key lacks access to the internal
heartbeat-runs endpoint. Switch to GET /api/issues/{issueId}, which
the adapter key can read. Also tighten the trigger condition from
status !== "running" to status === "cancelled" so that other terminal
states (done, blocked, etc.) do not abort the K8s job.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 12:25:41 +00:00
Chris Farhood 80d18005f9 fix: wait for concurrent job to finish instead of returning permanent blocked error (FAR-61)
When multiple tasks are assigned simultaneously, only one K8s job can run
at a time (shared PVC/session guard). Previously, all other tasks received
k8s_concurrent_run_blocked immediately and stayed blocked forever.

Now the guard retries once: wait for all blocking jobs to complete via
waitForJobCompletion, then re-check before proceeding to create a new job.
If the re-check still shows a running job, the error is returned as before.

The agentCreationMutex already serializes guard-check + job-create, so
tasks naturally queue up and execute one at a time without concurrent jobs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 11:10:27 +00:00
Chris Farhood 2bd8107f1d fix: skills not bundled and resumeLastSession ignored (FAR-56, FAR-57)
Two bugs prevented skill content from reaching K8s Job prompts, and
resumeLastSession: false was silently ignored.

Skills fix (execute.ts, FAR-57):
- Add /paperclip/.claude/skills as additional candidate to
  readPaperclipRuntimeSkillEntries — the relative candidates in
  adapter-utils don't resolve to the PVC-mounted skills home
- Read entry.source/SKILL.md instead of entry.source (which is a
  directory path); fall back to source directly for file-based entries
- Mock readPaperclipRuntimeSkillEntries in execute.test.ts to prevent
  real SKILL.md reads from delaying fake-timer registration

Session fix (job-manifest.ts, FAR-56):
- Gate --session flag on asBoolean(config.resumeLastSession, true)
  so setting resumeLastSession: false actually stops session resumption
- Default true preserves existing behaviour for agents without config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:11:47 +00:00
Chris Farhood 38ed261063 fix(test): widen capturedHandler cast to resolve TS2349 never narrowing (FAR-40)
TypeScript CFA does not trace the assignment inside the vi.spyOn
mockImplementation callback, so it narrows capturedHandler to null at
the if-check, making the body unreachable (never). Cast at the call
site breaks the false narrowing without changing runtime behaviour.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 00:31:27 +00:00
Chris Farhood 2b4049464c feat: per-agent mutex, fail-closed guard, SIGTERM cleanup (FAR-40)
- Add agentCreationMutex (Map<agentId, Promise>) that serializes
  guard-check + job-create per agent, eliminating the TOCTOU race where
  two concurrent execute() calls both pass the list-then-create check.
- Change catch {} on listNamespacedJob errors to return
  errorCode: "k8s_concurrency_guard_unreachable" (fail-closed) instead
  of silently bypassing the concurrency guard.
- Add ensureSigtermHandler() which tracks active Jobs in activeJobs Map
  and deletes all of them (plus prompt Secrets) on SIGTERM before exit.
- Track orphaned-job reattaches in activeJobs for consistent cleanup.
- Update execute.test.ts: change "proceeds on list error" test to assert
  k8s_concurrency_guard_unreachable; add mutex serialization test and
  SIGTERM handler registration tests.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 00:22:17 +00:00
Chris Farhood 0df00f7d95 test(execute): large-prompt Secret path coverage
- Add describe block "execute — large-prompt Secret path" with 5 cases:
  buildJobManifest called twice (promptSecretName on second call),
  Secret created before Job, ownerReference patched after Job creation,
  Secret deleted in finally block, Secret cleaned up on Job create failure
- Update vi.mock for job-manifest to export LARGE_PROMPT_THRESHOLD_BYTES
- Add createNamespacedSecret/deleteNamespacedSecret/patchNamespacedSecret
  to makeCoreApi for completeness
- Update makeBatchApi to return { metadata: { uid } } so ownerRef tests work

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:16:34 +00:00
Chris Farhood 61d2a42a66 feat: inherit valueFrom/envFrom env from Deployment; prefer paperclip container
- SelfPodInfo gains inheritedEnvValueFrom (V1EnvVar[]) and inheritedEnvFrom (V1EnvFromSource[])
- Container selection now prefers the container named "paperclip", falls back to first
- buildJobManifest appends valueFrom env vars (skipping names already overridden)
  and sets envFrom on the opencode container when present
- Tests updated: mock updated, 5 new cases covering secretKeyRef forwarding,
  dedup, envFrom passthrough, and empty-envFrom omission

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:12:31 +00:00
Chris Farhood d60afaebcd feat: pod-failure classification, partial stdout fallback, llm_api_error
- Replace getPodExitCode with getPodTerminatedInfo to capture exit code
  and reason (OOMKilled, Error, etc.) from terminated container state;
  pod failure description now surfaces in returned errorMessage
- Add partial-stdout fallback: readPodLogs is triggered when stdout is
  non-empty but contains no sessionId (missing session result), not just
  when stdout is fully empty
- Detect empty LLM response: when a session ran but produced 0 output
  tokens and no messages, return errorCode "llm_api_error"
- Add 13 new unit tests covering all three new paths

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:09:33 +00:00