Commit Graph

23 Commits

Author SHA1 Message Date
Chris Farhood 2d057f085d refactor: remove PAPERCLIP_DEV_API_KEY runtime hack throughout
Cancel poll now uses ctx.authToken exclusively. Remove forwarding of
PAPERCLIP_DEV_API_KEY into job pods and all associated tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 07:24:14 -04:00
Chris Farhood 985d55e125 fix(cancel-poll): use ctx.authToken instead of process.env for cancel polling
The cancel poll was sending empty Authorization headers because
PAPERCLIP_API_KEY is not set on the Paperclip server pod. Use the
per-run authToken from ctx instead, which is the JWT issued by Paperclip
for this execution. PAPERCLIP_DEV_API_KEY still overrides for dev instances.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 07:11:47 -04:00
Chris Farhood 2daedda537 Surface volume mount and PVC bind errors in waitForPod()
Handle MountVolumeFailed/ContainerCannotMount waiting reasons in pod
container status checks, throwing clear errors instead of silent failure.

Detect PVC/volume/bind/mount keywords in PodScheduled condition messages
and surface as 'PVC bind failed' error.

Fixes FAR-93: serviceAccountName missing error surfacing.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-26 01:47:31 +00:00
Chris Farhood 798b80f2f2 test: push coverage to 90%+ on lines for all files except execute.ts (FAR-85)
Overall before: 80.36% lines / 79.06% statements
Overall after:  94.65% lines / 93.30% statements

Per-file lines coverage (all targets ≥90% except execute.ts):

| File              | Before | After  |
|-------------------|--------|--------|
| ui-parser.ts      | 93.63% | 99.09% |
| cli/format-event  | 59.85% | 99.27% |
| server/execute    | 81.47% | 89.64% |
| server/job-mfst   | 90.30% | 98.78% |
| server/k8s-client | 37.50% | 95.83% |
| server/log-dedup  | 97.77% | 97.77% |
| server/parse      | 89.85% | 98.55% |
| server/skills     | 100%   | 100%   |

New tests added:

- k8s-client.test.ts: getSelfPodInfo (env-var inheritance, secret volumes,
  PVC discovery, dnsConfig, all error paths) + kubeconfig file branch
- format-event.test.ts: parseStdoutLine (cli) — full event-type matrix,
  tool_use status branches, errorText fallback paths
- ui-parser.test.ts: errorText edge cases, empty event paths
- parse.test.ts: errorText fallback to data.message, name, code, JSON
- job-manifest.test.ts: workspace context env wiring, linkedIssueIds,
  paperclipWorkspaces/RuntimeServices JSON, authToken, inherited URLs,
  prompt-secret + data PVC + secret-volume mount paths
- execute.test.ts: parseModelProvider, completionWithGrace,
  instructionsFilePath read failure, ensureAgentDbPvc throw paths,
  large-prompt secret create failure, step-limit detection,
  waitForPod no-pod messaging, init-container ImagePullBackOff /
  CrashLoopBackOff, main-container CrashLoopBackOff, all-inits-done
  happy path, skill bundle source loading (SKILL.md + flat-file
  fallback), SIGTERM handler full body via vi.resetModules()

execute.ts remains at 89.64% lines — the residual gap is deep async/timer
paths inside streamAndAwaitJob (grace poller, keepalive ticker, log-stream
stop-signal/bail timer). Those need fake-timer scaffolding heavier than
this batch warrants; tracking separately.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 22:27:04 +00:00
Chris Farhood 3daf2dd676 fix: detect 404 from @kubernetes/client-node v1.x ApiException (FAR-85)
The v1.x ApiException exposes the HTTP status as `code`, not `statusCode`.
Both `isNotFound` (k8s-client) and `isK8s404` (execute) only checked
`statusCode`/`response.statusCode`, so 404s were never recognized:

- `getPvc` re-threw the 404 instead of returning null, which bubbled up
  through `ensureAgentDbPvc` as `k8s_job_create_failed` with the raw
  "persistentvolumeclaims X not found" body — the symptom in FAR-85.
- The PVC was never actually created, because the existence check threw
  before reaching `createPvc`.

Add `code === 404` to both predicates and a regression test for `isK8s404`.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 21:53:59 +00:00
Chris Farhood 6accb424d4 fix: verify PVC exists after creation in ensureAgentDbPvc
Before creating a PVC, ensureAgentDbPvc checks if it exists and creates
it if not. However, the Kubernetes API may return a Success response
without actually creating the resource. This commit adds a verification
step after createPvc to confirm the PVC actually exists before returning.

Fixes FAR-84.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 21:28:29 +00:00
Chris Farhood 1de9b67fd3 fix: make log reconnect backoff interruptible by stop signal
The exponential backoff sleep in streamPodLogs used a single setTimeout
for the full delay (3s, 6s, 12s...). When logStopSignal.stopped was set
mid-backoff (e.g. by external cancel), the loop body could not check the
signal until the timer expired — causing the cancel test to time out when
the 12s backoff overlapped with the 15s cancel window.

Sleep in 200ms chunks so a stop signal can exit the backoff immediately.
Fixes the pre-existing CI timeout in execute.test.ts.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 20:37:49 +00:00
Chris Farhood 38b3916e11 fix: support PAPERCLIP_DEV_API_KEY for external cancel polling
External cancel polling in execute.ts used PAPERCLIP_API_KEY which is
a short-lived run JWT for the main Paperclip instance. In multi-instance
setups (dev vs main), the agent runs on the dev instance but the run JWT
is only valid on the main instance, causing 401 on every poll.

Now polls with PAPERCLIP_DEV_API_KEY if set, falling back to
PAPERCLIP_API_KEY. The dev key is inherited through job-manifest.ts
from the pod's inherited env.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 14:36:11 +00:00
Chris Farhood 4a95fc0f2d fix: exponential backoff on log stream reconnect (FAR-70)
Replace fixed 3s reconnect delay with exponential backoff (3s → 6s → 12s → 24s → capped at 30s) to avoid hammering the K8s API server during prolonged network blips while remaining responsive during brief disconnects.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 14:31:58 +00:00
Chris Farhood 46ce5cc599 feat: dedicated PVC per agent for OPENCODE_DB (FAR-63, Option B)
Replaces the Option A shared-PVC path implementation with a long-lived
dedicated PVC per agent, mounted at /opencode-db with OPENCODE_DB=/opencode-db.

Changes:
- k8s-client.ts: add getPvc/createPvc/deletePvc CoreV1Api helpers
- execute.ts: add ensureAgentDbPvc() that gets-or-creates a PVC named
  opencode-db-<agentId> before Job creation; pass agentDbClaimName through
  to buildJobManifest; return null for ephemeral mode (emptyDir used instead)
- job-manifest.ts: accept agentDbClaimName on JobBuildInput; mount dedicated
  PVC or emptyDir at /opencode-db; set OPENCODE_DB=/opencode-db; revert init
  container to simple form (no mkdir, no PVC mount)
- config-schema.ts: replace opencodeDbMode/opencodeDbPath with agentDbMode
  (dedicated_pvc|ephemeral, default dedicated_pvc), agentDbStorageClass
  (required for dedicated_pvc), agentDbStorageCapacity (default 1Gi)
- test.ts: add create/delete RBAC checks for persistentvolumeclaims
- pvc.test.ts: unit tests for ensureAgentDbPvc (7 cases incl. error paths)
- 289/289 tests pass; typecheck clean
- No agent-delete hook exists; opencode-db PVC janitor routine is a deferred
  follow-up task

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 12:38:54 +00:00
Chris Farhood 5fa9e1396e fix: poll issue status instead of heartbeat-run for cancel detection (FAR-60)
The cancel poller was calling GET /api/heartbeat-runs/{runId} which
returned 401 because the adapter key lacks access to the internal
heartbeat-runs endpoint. Switch to GET /api/issues/{issueId}, which
the adapter key can read. Also tighten the trigger condition from
status !== "running" to status === "cancelled" so that other terminal
states (done, blocked, etc.) do not abort the K8s job.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 12:25:41 +00:00
Chris Farhood 80d18005f9 fix: wait for concurrent job to finish instead of returning permanent blocked error (FAR-61)
When multiple tasks are assigned simultaneously, only one K8s job can run
at a time (shared PVC/session guard). Previously, all other tasks received
k8s_concurrent_run_blocked immediately and stayed blocked forever.

Now the guard retries once: wait for all blocking jobs to complete via
waitForJobCompletion, then re-check before proceeding to create a new job.
If the re-check still shows a running job, the error is returned as before.

The agentCreationMutex already serializes guard-check + job-create, so
tasks naturally queue up and execute one at a time without concurrent jobs.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 11:10:27 +00:00
Chris Farhood 2bd8107f1d fix: skills not bundled and resumeLastSession ignored (FAR-56, FAR-57)
Two bugs prevented skill content from reaching K8s Job prompts, and
resumeLastSession: false was silently ignored.

Skills fix (execute.ts, FAR-57):
- Add /paperclip/.claude/skills as additional candidate to
  readPaperclipRuntimeSkillEntries — the relative candidates in
  adapter-utils don't resolve to the PVC-mounted skills home
- Read entry.source/SKILL.md instead of entry.source (which is a
  directory path); fall back to source directly for file-based entries
- Mock readPaperclipRuntimeSkillEntries in execute.test.ts to prevent
  real SKILL.md reads from delaying fake-timer registration

Session fix (job-manifest.ts, FAR-56):
- Gate --session flag on asBoolean(config.resumeLastSession, true)
  so setting resumeLastSession: false actually stops session resumption
- Default true preserves existing behaviour for agents without config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 10:11:47 +00:00
Chris Farhood 0e94e84e2c fix: grace poller checks pod liveness before giving up on log stream (FAR-52)
The K8s log client v1.x closes the follow-stream prematurely due to a
known upstream bug — causing the grace timer to fire 30 s after log
stream exit even when the container is still running.  The old behaviour
(`waitForPodTermination` with a hardcoded 120 s timeout) was too short
for agents whose opencode runs take several minutes, leading to premature
failure and issues stuck in `blocked`.

Fix: the grace poller now calls `readNamespacedPod` before resolving the
completion promise.  If the pod is still Running/Pending, it resets
`logExitTime` to defer the grace deadline.  A `graceCheckPending` guard
prevents concurrent checks.  A `graceMaxWaitMs` cap (= completionTimeoutMs
when set, 20 min otherwise) ensures we never wait forever for unlimited
jobs.  Version bumped to 0.1.21.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 01:40:21 +00:00
Chris Farhood 2625b8ffb3 fix: wait for pod termination before readPodLogs fallback (FAR-52)
The @kubernetes/client-node v1.x Log.follow stream closes prematurely
(known upstream TODO). Combined with Node.js buffering stdout to pipes,
the live log stream always returns empty. When the 30s grace timer fires
and the stream is empty, the container may still be running.

Add waitForPodTermination() to block in the empty-stdout fallback path
until the container actually exits (up to 120s), then read its complete
output with readNamespacedPodLog. This makes runs complete successfully
instead of looping indefinitely in in_progress.

Bump version to 0.1.20.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 01:21:40 +00:00
Chris Farhood 2b4049464c feat: per-agent mutex, fail-closed guard, SIGTERM cleanup (FAR-40)
- Add agentCreationMutex (Map<agentId, Promise>) that serializes
  guard-check + job-create per agent, eliminating the TOCTOU race where
  two concurrent execute() calls both pass the list-then-create check.
- Change catch {} on listNamespacedJob errors to return
  errorCode: "k8s_concurrency_guard_unreachable" (fail-closed) instead
  of silently bypassing the concurrency guard.
- Add ensureSigtermHandler() which tracks active Jobs in activeJobs Map
  and deletes all of them (plus prompt Secrets) on SIGTERM before exit.
- Track orphaned-job reattaches in activeJobs for consistent cleanup.
- Update execute.test.ts: change "proceeds on list error" test to assert
  k8s_concurrency_guard_unreachable; add mutex serialization test and
  SIGTERM handler registration tests.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-25 00:22:17 +00:00
Chris Farhood c05d1d7515 feat: log stream reconnect, dedup, bail, keepalive (FAR-38)
- Split streamPodLogs into streamPodLogsOnce (with bail timer + stopSignal)
  and streamPodLogs (reconnect loop, up to MAX_LOG_RECONNECT_ATTEMPTS=50)
- LogLineDedupFilter suppresses replayed JSONL events on reconnect, keyed
  by type+sessionID+part.id (OpenCode shape)
- Bail timer (LOG_STREAM_BAIL_TIMEOUT_MS=3s) forces writable.destroy() +
  promise resolution when stopSignal fires and logApi.log hangs
- Keepalive: emits '[paperclip] keepalive — job X running (Ns since last output)'
  every 15s during silent phases, with 2-consecutive-reading latch to avoid
  false-positive terminal detections
- completionGraced uses logExitTime + grace poller so log stream stop signal
  is set immediately when job condition resolves
- All 235 tests pass, tsc clean

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:14:36 +00:00
Chris Farhood 61d2a42a66 feat: inherit valueFrom/envFrom env from Deployment; prefer paperclip container
- SelfPodInfo gains inheritedEnvValueFrom (V1EnvVar[]) and inheritedEnvFrom (V1EnvFromSource[])
- Container selection now prefers the container named "paperclip", falls back to first
- buildJobManifest appends valueFrom env vars (skipping names already overridden)
  and sets envFrom on the opencode container when present
- Tests updated: mock updated, 5 new cases covering secretKeyRef forwarding,
  dedup, envFrom passthrough, and empty-envFrom omission

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:12:31 +00:00
Chris Farhood 84dc0f5930 fix: merge parsedError + podFailureDescription so OOMKilled surfaces in errorMessage
When both a JSONL error (e.g. "killed") and a pod terminated reason (e.g. "OOMKilled")
are present, join them with "; " so the richer pod classification is never silently
dropped by the parsedError short-circuit.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:10:42 +00:00
Chris Farhood d60afaebcd feat: pod-failure classification, partial stdout fallback, llm_api_error
- Replace getPodExitCode with getPodTerminatedInfo to capture exit code
  and reason (OOMKilled, Error, etc.) from terminated container state;
  pod failure description now surfaces in returned errorMessage
- Add partial-stdout fallback: readPodLogs is triggered when stdout is
  non-empty but contains no sessionId (missing session result), not just
  when stdout is fully empty
- Detect empty LLM response: when a session ran but produced 0 output
  tokens and no messages, return errorCode "llm_api_error"
- Add 13 new unit tests covering all three new paths

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:09:33 +00:00
Chris Farhood 13c2a3032b feat: UI parser kinds, nodeSelector textarea, step-limit session clear, per-line path redaction
- ui-parser: add thinking kind + handler for standalone thinking events, thinking
  blocks in assistant content arrays, and user-turn tool_result blocks
- job-manifest: parseKeyValueOrObject helper so nodeSelector (and labels) accept
  key=value textarea lines in addition to JSON objects
- parse: isOpenCodeStepLimitResult detects step_finish with max_turns / max_steps /
  step_limit reason
- execute: return clearSession:true when step limit reached so next run starts fresh;
  redactHomePathUserSegments moved to per-line to prevent paths split across chunks
- tests: ui-parser.test.ts (new), extended parse.test.ts and job-manifest.test.ts

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-24 22:01:35 +00:00
Pawla Abdul e53bcf2501 Replace local utility stubs with fork's adapter-utils imports
- Replace joinPromptSections, stringifyPaperclipWakePayload, and
  renderPaperclipWakePrompt with imports from adapter-utils/server-utils
  (the fork's renderPaperclipWakePrompt adds execution stage routing,
  resume delta sections, and full comment batch rendering)
- Replace local inferOpenAiCompatibleBiller with import from adapter-utils
- Declare sessionManagement using getAdapterSessionManagement("opencode_local")
  with fallback defaults for proper session compaction policy
- Add log redaction via redactHomePathUserSegments in streamPodLogs
- Bump peerDependency to >=0.3.1 and version to 0.1.14

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-14 11:00:33 +00:00
Chris Farhood be7c525063 Initial commit
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 23:08:05 -04:00