paperclip-adapter-opencode-k8s

farhoodlabs/paperclip-adapter-opencode-k8s

Author	SHA1	Message	Date
Chris Farhood	e3af8aa83b	fix(server): make tailPodLogFile exit on job completion + port c8429cf - Run tailPodLogFile and waitForJobCompletion in parallel via Promise.allSettled; completion sets stopSignal.stopped so the tail loop drains and exits. Without this, tailPodLogFile loops forever — the only natural exit was fh.stat() throwing on file removal, which never happened during normal job completion. - Restructure tail loop to read-then-sleep, with a final drain after stopSignal is set to capture bytes written between the last poll and terminal state. - Port the c8429cf fix from paperclip-adapter-claude-k8s: * buildPodLogPath now writes to /paperclip/instances/default/data/run-logs/... to match the server PVC layout (the /data/ segment was missing). * Drop the mkdir -p ... && from both init container command variants — the PVC isn't mounted in the init container, so the mkdir was failing with exit code 1 and the && short-circuit prevented the prompt copy. - Test infrastructure: * Hoisted fs/promises mock now uses importOriginal so readFile (used for skill bundle loading) hits the real implementation. * setMockJsonl() lets individual tests inject specific JSONL into the tail's read buffer (previously dead constants in the test file). * fh.read mock now writes into the caller's buffer instead of returning a separate one. - Add src/server/test.test.ts covering testEnvironment (was 0% → 98.5% stmts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 14:57:40 -04:00
Chris Farhood	bc340bfcc9	fix: correct fs mock with vi.hoisted for proper per-test reset The vi.mock("node:fs/promises") factory previously used a closure variable that accumulated across tests despite vi.clearAllMocks(). Switched to vi.hoisted() with an explicit resetFsMocks() called in beforeEach() so the read offset counter is properly reset between tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-30 13:55:12 -04:00
Chris Farhood	c71d0e5eec	feat: replace K8s log streaming with PVC filesystem tailing - Replaced streamPodLogs / streamPodLogsOnce / readPodLogs / waitForPodTermination with tailPodLogFile() that polls a shared PVC file path with adaptive cadence (250ms active, 1000ms idle after 5 consecutive empty polls) - Added buildPodLogPath() export and podLogPath to JobBuildResult - Added assertSafePathComponent with [a-zA-Z0-9-:] allowance for UUIDs - Updated Job manifest to tee stdout to /paperclip/instances/default/run-logs/<companyId>/<agentId>/<runId>.pod.ndjson - Added hasOutOfProcessLiveness: true to createServerAdapter (cast required) - Deleted log-dedup.ts and log-dedup.test.ts entirely - Removed all LogLineDedupFilter, Writable, and LOG_STREAM_* constants - Removed completionResult.status workaround (completionWithGrace returns directly) - Test infrastructure: mocked node:fs/promises to prevent unmocked fs.stat hangs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-30 13:55:12 -04:00
Chris Farhood	2d057f085d	refactor: remove PAPERCLIP_DEV_API_KEY runtime hack throughout Cancel poll now uses ctx.authToken exclusively. Remove forwarding of PAPERCLIP_DEV_API_KEY into job pods and all associated tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 07:24:14 -04:00
Chris Farhood	985d55e125	fix(cancel-poll): use ctx.authToken instead of process.env for cancel polling The cancel poll was sending empty Authorization headers because PAPERCLIP_API_KEY is not set on the Paperclip server pod. Use the per-run authToken from ctx instead, which is the JWT issued by Paperclip for this execution. PAPERCLIP_DEV_API_KEY still overrides for dev instances. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 07:11:47 -04:00
Chris Farhood	4fcd3b4547	fix test: stub PAPERCLIP_DEV_API_KEY before each cancel-poll test The cancel-poll test sets PAPERCLIP_API_KEY='test-key' but the actual PAPERCLIP_DEV_API_KEY was leaking through from the harness environment. Since execute.ts prefers PAPERCLIP_DEV_API_KEY over PAPERCLIP_API_KEY, the poll was sending the real dev key instead of 'test-key'. Fix: add beforeEach to set PAPERCLIP_DEV_API_KEY='test-key', and afterEach to clean both env vars. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 23:40:41 +00:00
Chris Farhood	798b80f2f2	test: push coverage to 90%+ on lines for all files except execute.ts (FAR-85) Overall before: 80.36% lines / 79.06% statements Overall after: 94.65% lines / 93.30% statements Per-file lines coverage (all targets ≥90% except execute.ts): \| File \| Before \| After \| \|-------------------\|--------\|--------\| \| ui-parser.ts \| 93.63% \| 99.09% \| \| cli/format-event \| 59.85% \| 99.27% \| \| server/execute \| 81.47% \| 89.64% \| \| server/job-mfst \| 90.30% \| 98.78% \| \| server/k8s-client \| 37.50% \| 95.83% \| \| server/log-dedup \| 97.77% \| 97.77% \| \| server/parse \| 89.85% \| 98.55% \| \| server/skills \| 100% \| 100% \| New tests added: - k8s-client.test.ts: getSelfPodInfo (env-var inheritance, secret volumes, PVC discovery, dnsConfig, all error paths) + kubeconfig file branch - format-event.test.ts: parseStdoutLine (cli) — full event-type matrix, tool_use status branches, errorText fallback paths - ui-parser.test.ts: errorText edge cases, empty event paths - parse.test.ts: errorText fallback to data.message, name, code, JSON - job-manifest.test.ts: workspace context env wiring, linkedIssueIds, paperclipWorkspaces/RuntimeServices JSON, authToken, inherited URLs, prompt-secret + data PVC + secret-volume mount paths - execute.test.ts: parseModelProvider, completionWithGrace, instructionsFilePath read failure, ensureAgentDbPvc throw paths, large-prompt secret create failure, step-limit detection, waitForPod no-pod messaging, init-container ImagePullBackOff / CrashLoopBackOff, main-container CrashLoopBackOff, all-inits-done happy path, skill bundle source loading (SKILL.md + flat-file fallback), SIGTERM handler full body via vi.resetModules() execute.ts remains at 89.64% lines — the residual gap is deep async/timer paths inside streamAndAwaitJob (grace poller, keepalive ticker, log-stream stop-signal/bail timer). Those need fake-timer scaffolding heavier than this batch warrants; tracking separately. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 22:27:04 +00:00
Chris Farhood	3daf2dd676	fix: detect 404 from @kubernetes/client-node v1.x ApiException (FAR-85) The v1.x ApiException exposes the HTTP status as `code`, not `statusCode`. Both `isNotFound` (k8s-client) and `isK8s404` (execute) only checked `statusCode`/`response.statusCode`, so 404s were never recognized: - `getPvc` re-threw the 404 instead of returning null, which bubbled up through `ensureAgentDbPvc` as `k8s_job_create_failed` with the raw "persistentvolumeclaims X not found" body — the symptom in FAR-85. - The PVC was never actually created, because the existence check threw before reaching `createPvc`. Add `code === 404` to both predicates and a regression test for `isK8s404`. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 21:53:59 +00:00
Chris Farhood	139a387508	test: add ensureAgentDbPvc unit tests (FAR-63) Seven direct unit tests for ensureAgentDbPvc covering ephemeral mode, existing PVC (no create), PVC creation with storage class/capacity, missing storage class error, default mode, and agent ID slug derivation. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 12:41:42 +00:00
Chris Farhood	46ce5cc599	feat: dedicated PVC per agent for OPENCODE_DB (FAR-63, Option B) Replaces the Option A shared-PVC path implementation with a long-lived dedicated PVC per agent, mounted at /opencode-db with OPENCODE_DB=/opencode-db. Changes: - k8s-client.ts: add getPvc/createPvc/deletePvc CoreV1Api helpers - execute.ts: add ensureAgentDbPvc() that gets-or-creates a PVC named opencode-db-<agentId> before Job creation; pass agentDbClaimName through to buildJobManifest; return null for ephemeral mode (emptyDir used instead) - job-manifest.ts: accept agentDbClaimName on JobBuildInput; mount dedicated PVC or emptyDir at /opencode-db; set OPENCODE_DB=/opencode-db; revert init container to simple form (no mkdir, no PVC mount) - config-schema.ts: replace opencodeDbMode/opencodeDbPath with agentDbMode (dedicated_pvc\|ephemeral, default dedicated_pvc), agentDbStorageClass (required for dedicated_pvc), agentDbStorageCapacity (default 1Gi) - test.ts: add create/delete RBAC checks for persistentvolumeclaims - pvc.test.ts: unit tests for ensureAgentDbPvc (7 cases incl. error paths) - 289/289 tests pass; typecheck clean - No agent-delete hook exists; opencode-db PVC janitor routine is a deferred follow-up task Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 12:38:54 +00:00
Chris Farhood	5fa9e1396e	fix: poll issue status instead of heartbeat-run for cancel detection (FAR-60) The cancel poller was calling GET /api/heartbeat-runs/{runId} which returned 401 because the adapter key lacks access to the internal heartbeat-runs endpoint. Switch to GET /api/issues/{issueId}, which the adapter key can read. Also tighten the trigger condition from status !== "running" to status === "cancelled" so that other terminal states (done, blocked, etc.) do not abort the K8s job. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 12:25:41 +00:00
Chris Farhood	80d18005f9	fix: wait for concurrent job to finish instead of returning permanent blocked error (FAR-61) When multiple tasks are assigned simultaneously, only one K8s job can run at a time (shared PVC/session guard). Previously, all other tasks received k8s_concurrent_run_blocked immediately and stayed blocked forever. Now the guard retries once: wait for all blocking jobs to complete via waitForJobCompletion, then re-check before proceeding to create a new job. If the re-check still shows a running job, the error is returned as before. The agentCreationMutex already serializes guard-check + job-create, so tasks naturally queue up and execute one at a time without concurrent jobs. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 11:10:27 +00:00
Chris Farhood	2bd8107f1d	fix: skills not bundled and resumeLastSession ignored (FAR-56, FAR-57) Two bugs prevented skill content from reaching K8s Job prompts, and resumeLastSession: false was silently ignored. Skills fix (execute.ts, FAR-57): - Add /paperclip/.claude/skills as additional candidate to readPaperclipRuntimeSkillEntries — the relative candidates in adapter-utils don't resolve to the PVC-mounted skills home - Read entry.source/SKILL.md instead of entry.source (which is a directory path); fall back to source directly for file-based entries - Mock readPaperclipRuntimeSkillEntries in execute.test.ts to prevent real SKILL.md reads from delaying fake-timer registration Session fix (job-manifest.ts, FAR-56): - Gate --session flag on asBoolean(config.resumeLastSession, true) so setting resumeLastSession: false actually stops session resumption - Default true preserves existing behaviour for agents without config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 10:11:47 +00:00
Chris Farhood	38ed261063	fix(test): widen capturedHandler cast to resolve TS2349 never narrowing (FAR-40) TypeScript CFA does not trace the assignment inside the vi.spyOn mockImplementation callback, so it narrows capturedHandler to null at the if-check, making the body unreachable (never). Cast at the call site breaks the false narrowing without changing runtime behaviour. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 00:31:27 +00:00
Chris Farhood	2b4049464c	feat: per-agent mutex, fail-closed guard, SIGTERM cleanup (FAR-40) - Add agentCreationMutex (Map<agentId, Promise>) that serializes guard-check + job-create per agent, eliminating the TOCTOU race where two concurrent execute() calls both pass the list-then-create check. - Change catch {} on listNamespacedJob errors to return errorCode: "k8s_concurrency_guard_unreachable" (fail-closed) instead of silently bypassing the concurrency guard. - Add ensureSigtermHandler() which tracks active Jobs in activeJobs Map and deletes all of them (plus prompt Secrets) on SIGTERM before exit. - Track orphaned-job reattaches in activeJobs for consistent cleanup. - Update execute.test.ts: change "proceeds on list error" test to assert k8s_concurrency_guard_unreachable; add mutex serialization test and SIGTERM handler registration tests. Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-25 00:22:17 +00:00
Chris Farhood	0df00f7d95	test(execute): large-prompt Secret path coverage - Add describe block "execute — large-prompt Secret path" with 5 cases: buildJobManifest called twice (promptSecretName on second call), Secret created before Job, ownerReference patched after Job creation, Secret deleted in finally block, Secret cleaned up on Job create failure - Update vi.mock for job-manifest to export LARGE_PROMPT_THRESHOLD_BYTES - Add createNamespacedSecret/deleteNamespacedSecret/patchNamespacedSecret to makeCoreApi for completeness - Update makeBatchApi to return { metadata: { uid } } so ownerRef tests work Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 22:16:34 +00:00
Chris Farhood	61d2a42a66	feat: inherit valueFrom/envFrom env from Deployment; prefer paperclip container - SelfPodInfo gains inheritedEnvValueFrom (V1EnvVar[]) and inheritedEnvFrom (V1EnvFromSource[]) - Container selection now prefers the container named "paperclip", falls back to first - buildJobManifest appends valueFrom env vars (skipping names already overridden) and sets envFrom on the opencode container when present - Tests updated: mock updated, 5 new cases covering secretKeyRef forwarding, dedup, envFrom passthrough, and empty-envFrom omission Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 22:12:31 +00:00
Chris Farhood	d60afaebcd	feat: pod-failure classification, partial stdout fallback, llm_api_error - Replace getPodExitCode with getPodTerminatedInfo to capture exit code and reason (OOMKilled, Error, etc.) from terminated container state; pod failure description now surfaces in returned errorMessage - Add partial-stdout fallback: readPodLogs is triggered when stdout is non-empty but contains no sessionId (missing session result), not just when stdout is fully empty - Detect empty LLM response: when a session ran but produced 0 output tokens and no messages, return errorCode "llm_api_error" - Add 13 new unit tests covering all three new paths Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-04-24 22:09:33 +00:00

18 Commits