fix: P0+P1 correctness fixes (FAR-107 PR 1-2/3) #3

Merged
farhoodliquor-paperclip[bot] merged 3 commits from fix/p0-correctness-far107 into master 2026-04-20 19:41:17 +00:00

3 Commits

Author SHA1 Message Date
Test User b45cc29787 chore: bump version to 0.1.25 for PR #3
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-20 19:40:26 +00:00
Test User 1e517bb9bb fix: P1 correctness and operational fixes from FAR-104/FAR-105 analysis
5. Cap log stream reconnect attempts at 50 — prevents infinite
   reconnect loops during sustained API partitions.

6. Fire keepalive refresh earlier — tick 1 + every 12 ticks (~3min)
   instead of every 16 ticks (~4min), providing better safety margin
   under the 5-minute reaper window.

7. Catch rejections from onLog inside keepalive — add .catch(() => {})
   to prevent unhandledRejection on SSE backpressure.

8. Prevent sanitized-name collisions — extend slugs to 16 chars each,
   add a 6-char SHA-256 hash suffix, shorten prefix to `ac-` to stay
   well within the 63-char DNS label limit.

10. Fix config-hint parity for nodeSelector and labels — parse both
    `key=value` multiline text and JSON objects, matching what the
    textarea hint promises.

11. Large-prompt fallback via Secret — prompts >256 KiB are staged as a
    K8s Secret and mounted as a volume instead of passed via env var,
    protecting against the ~1 MiB PodSpec limit.

13. Track last-seen log timestamp on reconnect — anchor sinceSeconds at
    the last received log line instead of stream start, fixing FAR-105
    duplicative logs. Belt-and-braces: dedupe assistantTexts at the
    parser boundary in parse.ts.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-20 19:05:07 +00:00
Test User d74b6d34b3 fix: P0 correctness fixes from FAR-104/FAR-105 analysis
1. Inherit envFrom and env.valueFrom from self pod — secrets wired via
   valueFrom.secretKeyRef or envFrom.secretRef are now forwarded to Job
   pods, fixing credentials silently dropped for K8s-idiomatic secret
   patterns (e.g. ANTHROPIC_API_KEY via Secret).

2. Distinguish 404 vs transient errors in keepalive — only mark the
   keepalive as terminal on 404 (Job deleted). Transient 5xx/connection
   errors are logged and retried on the next tick, preventing premature
   reaper kills during API instability.

3. Fail closed on concurrency-guard read failure — a failing
   listNamespacedJob now returns k8s_concurrency_guard_unreachable
   instead of silently proceeding, protecting against zombie Jobs on
   shared PVCs.

4. Bound the waitForJobCompletion re-check — pass a 60s timeout instead
   of polling forever, preventing indefinite hangs when the K8s API is
   degraded.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-20 18:57:16 +00:00