Files
.github/company/agents/savannah-savings/memory/2026-04-04.md
T
Pawla Abdul 3032f2fc0e chore: sync company/ export snapshot with current configuration
- Removes rollback-rhonda (decommissioned agent)
- Adds deal-dottie agent files (AGENTS.md, mcp.json)
- Updates .paperclip.yaml: removes rollback-rhonda, adds deal-dottie
- Updates skills directory to match current export
- Updates all active agent AGENTS.md files and memory/life files

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-04-06 08:59:29 +00:00

625 lines
39 KiB
Markdown

# 2026-04-04
## Heartbeat 1 — UAT TLS Cert Investigation
- Woken for CAR-472 (UAT Regression blocked). Deal Dottie failed UAT regression twice due to `ERR_CERT_COMMON_NAME_INVALID`.
- Investigated cert on `cartsnitch.uat.farh.net:443`:
- Issuer: Let's Encrypt R13
- CN: `*.farh.net`
- SANs: `*.dev.farh.net`, `*.farh.net`, `farh.net`
- **Missing: `*.uat.farh.net`** — wildcard certs only match one subdomain level
- No cert-manager in `cartsnitch/infra` repo — TLS is fully board-managed
- Updated CAR-472 to blocked with root cause, escalated to CEO for board action
- CAR-447 (UAT env setup) also blocked on this — couldn't comment due to run ownership conflict from prior run
- CAR-80 (email receipt ingestion) code-complete, still waiting on UAT — no new context, skipped update
## 04:04 UTC — Heartbeat (timer)
- **UAT TLS cert blocker resolved.** `*.uat.farh.net` wildcard cert now live (Let's Encrypt R12). `cartsnitch.uat.farh.net` returns HTTP 200 with valid SSL.
- Reassigned **CAR-472** (UAT regression for PR #114 — common `email_inbound_token` sync) back to Deal Dottie for retry.
- **CAR-80** (email receipt ingestion) remains `in_progress`, code-complete on `main`. Awaiting UAT regression pass + security review before production promotion.
- No other assigned work this heartbeat.
## ~04:09 UTC — Heartbeat (assignment: CAR-472 returned)
- **CAR-472 returned to CTO.** Deal Dottie's UAT regression attempt found auth 503 ("no healthy upstream") on ALL auth endpoints. All features blocked — can't login.
- **Investigation results:**
- UAT: both API and auth return 503. Frontend (nginx) serves fine. Only backends are down.
- Dev: everything works (auth 200, API 404-as-expected).
- Prod: API works, but auth is ALSO 503.
- **UAT root cause:** CNPG database cluster likely never initialized in `cartsnitch-uat` namespace. Without DB, `secret-generator` Job can't create `cartsnitch-secrets`, so all backend pods crash. UAT Flux Kustomization was only recently added (PR #111).
- **Prod auth root cause:** Base auth image tag `2026.03.30.4` doesn't exist in GHCR. Auth images started from `2026.04.01.x`. Prod overlay has no auth image override.
- **Code bug:** `auth/src/auth.ts` `trustedOrigins` missing `https://cartsnitch.uat.farh.net`.
- Auth build failure in latest CI (run 23960017574) was transient Docker Hub TLS timeout — not code issue.
- **Created tasks:**
- CAR-474 → Betty: add UAT hostname to auth trustedOrigins
- CAR-475 → Betty: fix prod auth image tag + base image reference
- **CAR-472 set to blocked.** Escalated to CEO for board investigation of CNPG in `cartsnitch-uat`.
## ~04:17 UTC — Heartbeat (assignment: CAR-471)
- **Woken for CAR-471** (UAT Regression for PR #114). Deal Dottie reported UAT FAIL — auth 503 on all endpoints.
- **Deep investigation of UAT namespace:**
- 5 pods in `CreateContainerConfigError`: auth, api, email-worker, receiptwitness, pg-initdb
- **Root cause chain:**
1. `cartsnitch-pg-credentials` secret missing → CNPG can't bootstrap Postgres (initdb stuck 6h)
2. No Postgres → `secret-generator` job fails → `cartsnitch-secrets` never created
3. No `cartsnitch-secrets` → all backend pods fail
4. UAT sealed secrets (receiptwitness-resend, receiptwitness-mailgun) encrypted for `cartsnitch-dev` namespace → can't decrypt in `cartsnitch-uat`
- Only frontend + dragonfly pods running
- **Created CAR-476** → Betty (critical): create `cartsnitch-pg-credentials` SealedSecret for UAT, re-seal receiptwitness secrets for correct namespace, update kustomization.yaml
- **CAR-471 and CAR-472** both set to `blocked` pending CAR-476
- **Reviewed and merged PR #115** (auth trustedOrigins fix) — QA approved, clean 1-line change
- **Promoted to UAT** via PR #116 (dev→uat)
- **Reviewed and merged infra PR #112** (prod auth image tag fix) — QA approved
- **CAR-474 and CAR-475** marked done
## ~04:28 UTC — Heartbeat (assignment: CAR-474)
- **Woken for CAR-474** but already done from prior heartbeat. No action needed.
- **CAR-476** (sealed secrets fix): Betty opened infra PR #113, handed off to Charlie for QA review. PR is open and mergeable.
- **CAR-471, CAR-472** remain blocked on CAR-476. Blocked-task dedup applies — no new comments, skipped.
- **CAR-475** (prod auth image fix): done, infra PR #112 merged.
- **CAR-80**: still in_progress, code-complete, awaiting UAT regression.
- **GitHub triage:** no untracked issues or PRs. All items tracked.
- **Next action:** Merge infra PR #113 after Charlie's QA approval, then unblock CAR-471/472 for Deal Dottie.
## ~04:33 UTC — Heartbeat (assignment: CAR-475)
- **Woken for CAR-475** (prod auth image fix) — already done.
- **Reviewed and merged infra PR #113** (UAT sealed secrets) — Charlie QA-approved. CTO approved and merged.
- **UAT recovery operations:**
- CNPG Postgres bootstrapped successfully (cartsnitch-pg-1 Running)
- Flux kustomization stuck on `cilium-config` dependency — manually re-ran secret-generator job
- `cartsnitch-secrets` created (5 keys)
- Restarted all backend deployments
- **Auth, email-worker, receiptwitness, frontend: all Running**
- **API pod still failing:** `alembic-migrate` init container error: `No 'script_location' key found in configuration`
- **Root cause:** API Dockerfile (`api/Dockerfile`) doesn't copy `alembic.ini` or `alembic/` directory into prod image — regression from monorepo migration
- Same issue present in dev (newer `2026.04.03.8` image pods crash, older `2026.04.03` pod still running)
- **Created CAR-477** → Betty (critical): fix API Dockerfile to include alembic config and migrations
- **CAR-471, CAR-472** remain blocked — now on CAR-477 (API pod fix) instead of CAR-476 (sealed secrets, now done)
- **GitHub triage:** no untracked items
- **Next action:** Once CAR-477 PR merges through QA → CTO → dev → uat, restart API pods and reassign UAT regressions to Deal Dottie
## Heartbeat ~04:45 UTC
- Woke for CAR-476 (sealed secrets fix) — already done from prior heartbeat
- Investigated UAT API pod crash: alembic-migrate init container missing config in Docker image
- PR #117 already open by Betty (fix: COPY alembic.ini and alembic/ into prod stage)
- QA approved, CTO reviewed and merged to dev
- Promoted dev→uat via PR #118
- Unblocked CAR-472, reassigned to Deal Dottie for full UAT regression
- CAR-80 still in holding pattern — code-complete, waiting on UAT regression + security review
## Heartbeat ~04:48 UTC
- Woke for CAR-477 (alembic Dockerfile fix) — QA passed PR #117, CTO approval already on record
- PR #117 already merged to dev, PR #118 already promoted to uat — both done in prior heartbeat
- **KEY DISCOVERY: CI pipeline only builds from `main` branch.** The SDLC dev→uat→main flow was never wired into CI. `uat` is 8 commits ahead of `main` with zero image builds. All recent merges to dev/uat are invisible to the deployed environments.
- API still running stale image `2026.04.03.8` (built from main, missing alembic fix)
- Auth returns 500 on sign-up (likely cascading from API/DB being down)
- This is the real root cause of all UAT failures — not individual code bugs
- **Created CAR-479** → Betty: fix CI workflow to build and deploy from dev and uat branches
- **Created CAR-478** → Deal Dottie: UAT regression for alembic fix (immediately set to blocked on CAR-479)
- **CAR-477** marked done
- **CAR-472** updated with root cause analysis, set to blocked on CAR-479
- **CAR-80** updated — still code-complete, all UAT regressions blocked on CAR-479
- **Critical path:** CAR-479 (CI fix) → merge to dev → CI builds from dev → promote to uat → CI builds from uat → UAT images deploy → Deal Dottie runs regression
## Heartbeat ~04:55 UTC
- Woke for CAR-472 (blocked). CAR-478 also blocked. Both on CAR-479 (CI fix).
- Betty opened PR #119 — CI workflow fix for dev/uat branch builds.
- Completed CTO review of PR #119 — approved. Clean, correct changes.
- Created CAR-480 for Charlie to QA review PR #119.
- Deduped blocked comments on CAR-472 and CAR-478 — no new context.
- Next: once Charlie approves, merge PR #119 to dev, promote to uat, create regression task for Dottie.
## Heartbeat ~05:25 UTC
- Woke for CAR-478 (UAT regression, blocked). No new comments since last blocked update — dedup, skipped.
- **CAR-482** (P0: CI sha_tag mismatch) was assigned to me by CEO. Engineering work — delegated to Betty with atomic instructions:
- Fix: change `type=sha,prefix=sha-` to `type=sha,prefix=sha-,format=long` in all four build jobs in `.github/workflows/ci.yml`
- Branch from `dev`, PR against `dev`
- **CAR-80** (email receipt ingestion): in_progress, code-complete, blocked on CAR-482 → CAR-478 chain. Last comment still current.
- No open PRs on cartsnitch/cartsnitch or cartsnitch/infra — Betty hasn't started yet.
- **Critical path:** Betty fixes CAR-482 → QA → CTO merge → promote to uat → Dottie runs CAR-478 regression
## Heartbeat ~05:37 UTC
- Woke for CAR-482 (P0 sha_tag fix). Betty opened PR #121, Charlie QA-approved.
- **CTO reviewed and approved PR #121** — all four build jobs have `format=long`. CI green.
- **Merged PR #121 to dev.**
- **Promoted dev→uat** via PR #122 — merged.
- **Unblocked CAR-478** — reassigned to Deal Dottie with updated context (includes sha_tag fix).
- **CAR-482 marked done.**
- Critical path now: CI builds from uat branch → images deployed → Dottie runs CAR-478 full regression
## Heartbeat ~06:05 UTC
- Woke for CAR-478. Deal Dottie sent back with Flux reconciliation delay note. Updated CAR-80 status — still code-complete, awaiting UAT.
- No new PRs or issues to triage.
## Heartbeat ~06:10 UTC
- **Woke for CAR-478** — Deal Dottie UAT FAIL: auth 500 on `/auth/sign-up/email` and `/auth/sign-in/email`. Site loads, pages render, but auth broken. Different from prior 503.
- **Root cause:** Migration `005_add_email_inbound_token` adds `email_inbound_token` as NOT NULL without a PostgreSQL `server_default`. Better-Auth creates users via raw pg INSERT (bypasses SQLAlchemy ORM defaults) → NOT NULL constraint violation → 500.
- **Created CAR-483** → Betty (critical): new migration 006 to add `server_default` using `gen_random_bytes(16)` encoded as URL-safe base64, plus update `user.py` model.
- **CAR-478** set to `blocked` on CAR-483.
- **CAR-80** updated with new blocker chain.
- **GitHub triage:** No open issues or PRs on cartsnitch/cartsnitch or cartsnitch/infra.
- **Critical path:** Betty CAR-483 → QA → CTO merge → promote to UAT → Dottie regression → Steve security → CEO prod merge
## Heartbeat ~06:29 UTC — CAR-484 (UAT regression returned by Dottie)
- **Woken for CAR-484** — Deal Dottie UAT FAIL: sign-up still returns 500.
- **Root cause investigation:**
- Auth pod logs: `relation "users" does not exist` — tables never created
- API pod: `Init:CrashLoopBackOff` — alembic-migrate init container crashing
- alembic error: `ValueError: invalid interpolation syntax` at position 28 in DB URL
- **Root cause:** CNPG password contains `%` chars (URL-encoded as `%2B`). Python's `configparser.BasicInterpolation` in alembic's `config.set_main_option()` interprets `%` as interpolation syntax → crash
- Both `api/alembic/env.py` and `common/alembic/env.py` have this bug
- The migration 006 fix (server_default) was correct but never had a chance to run
- **Created CAR-485** → Betty (critical): escape `%` as `%%` in `db_url.replace("%", "%%")` before passing to `config.set_main_option()` in both env.py files
- **CAR-484** set to `blocked` on CAR-485
- **Critical path:** Betty CAR-485 → QA → CTO merge → promote to UAT → alembic runs → tables created → Dottie regression
## Heartbeat — 06:37 UTC
- Woke for CAR-485 (issue_assigned) — alembic percent escape fix
- Betty wrote fix, Charlie QA'd and approved PR #125
- CTO reviewed and approved PR #125: correct fix for configparser % interpolation in alembic env.py
- Merged PR #125 to dev
- Created and merged PR #126 (dev→uat promotion)
- Created CAR-486: UAT regression task for Deal Dottie (critical)
- Updated CAR-484: unblocked, awaiting UAT regression
- Updated CAR-478: commented with latest status
- All blocked on Deal Dottie's UAT regression (CAR-486)
## Heartbeat — 06:41 UTC
- Woke for CAR-486 (issue_assigned) — Deal Dottie UAT FAIL: sign-up still 500
- **Root cause: premature test.** CI run #23973377745 (UAT build for PR #126) had `build-and-push-*` jobs queued waiting for runners. Dottie tested against old deployment without the percent escape fix.
- **Freed runners:** Cancelled stale PR branch run (#23973303092, lighthouse on merged branch) and superseded dev run (#23973372216). `build-and-push-api` now `in_progress`.
- **CAR-486** and **CAR-484** both set to `blocked` on CI deployment completing
- Once CI finishes building + deploying, need to reassign CAR-486 to Dottie for retry
- **Critical path:** CI build completes → deploy-uat updates infra → Flux reconciles → Dottie re-runs regression
## Heartbeat ~06:55 UTC — Timer
- CI run #23973377745 completed successfully on uat. Image sha `6f8e5a9` deployed to UAT.
- **Alembic percent escape fix working** — no more `ValueError: invalid interpolation syntax`
- **New error:** `ImportError: libpq.so.5: cannot open shared object file` in API pod
- **Root cause:** Multi-stage Dockerfile: `libpq-dev` in build stage for psycopg2 compilation, but prod stage (`python:3.12-slim`) missing runtime library `libpq5`
- Auth, email-worker, receiptwitness, frontend all Running. Only API broken.
- **Created CAR-487** → Betty (critical): add `RUN apt-get install libpq5` to API Dockerfile prod stage
- **CAR-486** blocked on CAR-487
- **CAR-484, CAR-478** — no new context, dedup applies
- **CAR-80** — still code-complete, blocked on UAT regression chain
- **Critical path:** Betty CAR-487 → QA → CTO merge → promote to uat → API pods recover → Dottie regression
## Heartbeat ~14:52 UTC — Timer
- All tasks still blocked. CAR-487 (libpq5 fix) is `in_review` assigned to Charlie.
- Betty opened PR #127 ~4 hours ago, CI all green, single-line diff confirmed correct.
- Charlie has only CAR-487 in queue but hasn't reviewed yet.
- Nudged Charlie via comment on CAR-487 — critical-path blocker for all UAT regressions.
- **Critical path unchanged:** Charlie QA → CTO merge → promote to uat → CI builds → deploy → Dottie regression
## 15:51 — CAR-488: CTO review + merge + UAT promotion
- QA (Charlie) approved PR #127 (libpq5 Dockerfile fix)
- CTO reviewed: single-line change, all CI green, correct placement in prod stage
- Merged PR #127 to dev
- Created and merged PR #128 (dev→uat promotion)
- Marked CAR-488 done, CAR-487 done
- Created CAR-489: UAT regression task assigned to Deal Dottie
- This fix unblocks all previously-blocked UAT regressions (CAR-486, CAR-484, CAR-478, CAR-471)
## 15:55 — CAR-489: UAT Regression Fail → Root Cause Diagnosed
- Woken for CAR-489 (UAT regression for libpq5 fix). Assigned to me instead of Dottie — Dottie already ran it and reported UAT FAIL.
- Dottie's findings: health endpoint 200, but auth sign-up/sign-in 500 (empty body).
- **Deep investigation:**
- API is 503 (no healthy upstream) — `Init:CrashLoopBackOff` on UAT
- Auth returns 500 on sign-up with `Origin` header
- Dev works fine — auth sign-up succeeds (confirmed by actual test, got user back)
- `kubectl logs` on UAT API init container revealed the real error:
```
psycopg2.errors.UndefinedTable: relation "user_store_accounts" does not exist
[SQL: ALTER TABLE user_store_accounts ALTER COLUMN session_data TYPE TEXT]
```
- **Root cause:** Migration 001 (`encrypt_session_data`) assumes pre-existing tables. UAT database was bootstrapped fresh by CNPG — no tables exist. The entire migration chain (001-006) assumes tables from before alembic was introduced.
- Dev works because dev database had tables created before alembic was introduced.
- **Cascading effect:** alembic crash → API never starts (503) → migrations never complete → `email_inbound_token` has no server_default → Better-Auth INSERT fails → auth 500
- **Also found infra issues (non-blocking):**
- `JWT_SECRET_KEY` in API deployment should be `CARTSNITCH_JWT_SECRET_KEY` (wrong env_prefix)
- `CARTSNITCH_FERNET_KEY` missing from API main container (only in initContainer) — uses default dev key
- **Created CAR-490** → Betty (critical): make all migrations idempotent + add `metadata.create_all(checkfirst=True)` + fix User model nullable mismatch
- **CAR-489** set to blocked on CAR-490
- **Updated CAR-471** with root cause link
- **Critical path:** Betty CAR-490 → QA → CTO merge → promote to UAT → Dottie regression
## Heartbeat — 16:24 UTC
- Woken for CAR-490 (fix alembic migrations for fresh DB, critical)
- QA approved PR #129, but PR has merge conflicts (Dockerfile + user.py) against dev
- Conflicts caused by PRs #125 and #127 merging to dev after branch was created
- Created CAR-491 for Betty to rebase branch on dev and resolve conflicts
- Set CAR-490 to blocked pending CAR-491
- Skipped CAR-489, CAR-471 (blocked, no new context), CAR-80 (low priority, blocked on same chain)
## Heartbeat — 16:43 UTC (PR #129 merge + UAT promotion)
- Betty fixed all 3 guard bugs in PR #129 (commit be75c7f)
- CTO re-reviewed: approved and merged PR #129 to dev
- Promoted to UAT: created and merged PR #130 (dev→uat)
- Created CAR-493 (UAT regression) assigned to Deal Dottie
## Heartbeat — 17:04 UTC (UAT sign-up 500 investigation)
- Woken for CAR-493 (assigned by Dottie after UAT FAIL)
- **Dottie's report:** sign-up returns HTTP 500 (POST /auth/sign-up/email), console error only
- **CTO investigation findings:**
- Health check passes (frontend returns 200 at /health)
- Auth service is UP (/auth/ok → 200, Better-Auth running)
- **API service completely DOWN** (503 "no healthy upstream" on all /api/* routes)
- Sign-up AND sign-in both return 500 with empty body on UAT
- Dev sign-up works perfectly (200, creates user)
- CI deployed correct image (sha-86594e4a8eedf581c5087ff333b3ec28b7cde801 matches uat HEAD)
- Infra repo updated at 16:50 UTC — Dottie tested at 16:43 (before deploy), but retested at 17:04 still fails
- **Root cause:** On fresh UAT DB, migrations 001-006 all skip `users` table operations (idempotent guards). `Base.metadata.create_all()` in env.py is supposed to create it, but the API pod is CrashLoopBackOff (can't determine exact crash reason without pod logs). Without `users` table, auth service INSERT fails → 500.
- **Key insight:** Dev works because it has pre-existing database. UAT is fresh.
- **Fix:** Created CAR-494 for Betty (critical) — new migration 007 creates `users` table with raw SQL, plus try/except hardening on `create_all`
- Set CAR-493 and CAR-490 to blocked on CAR-494
- Skipped CAR-489, CAR-471 (blocked, no new context)
- GitHub triage: no open PRs or issues
## Heartbeat — 17:34 UTC (PR #131 merge + UAT promotion)
- Woken for CAR-494 (fix UAT users table bootstrap). QA (Charlie) approved PR #131.
- CTO reviewed PR #131: verified migration 007 schema against User model (exact match), env.py try/except correct, 2-file change only, CI all green.
- Approved and merged PR #131 to dev.
- Created and merged PR #132 (dev→uat promotion).
- Created CAR-495: UAT regression task assigned to Deal Dottie.
- CAR-494 marked done.
- Awaiting Deal Dottie's UAT regression on CAR-495.
## Heartbeat — 17:40 UTC (CAR-495 UAT regression FAIL — auth DB connectivity)
- Woken for CAR-495 (issue_commented). Deal Dottie reported UAT FAIL: sign-up returns 500.
- **CTO investigation:**
- UAT frontend loads, `/health` returns 200, `/auth/ok` returns 200
- Both `/auth/sign-up/email` AND `/auth/sign-in/email` return 500 (empty body, 4ms response)
- Since even sign-in (SELECT-only) fails, this is NOT a migration issue — it's auth service DB connectivity
- Auth service (`auth/src/auth.ts`) uses `process.env.DATABASE_URL` with fallback to `localhost:5432` — won't work in K8s
- API service gets DB URL from K8s secret `cartsnitch-secrets` key `database-url-pg`, but auth deployment likely doesn't mount this
- **Created CAR-496** → Betty (critical): fix auth service K8s deployment in `cartsnitch/infra` to include `DATABASE_URL` from shared PG secret
- **CAR-495** set to blocked on CAR-496
- **Critical path:** Betty CAR-496 (infra PR) → merge → Flux reconcile → auth service gets DB URL → Dottie re-runs regression
## Heartbeat — 18:07 UTC (CAR-496 — auth DB deep investigation + operational recovery)
- **Woken for CAR-496** (assigned by Charlie, bounced from Betty's handoff)
- Betty had opened infra PR #114 (auth-db-init Job). Charlie bounced it back saying it's infra work, not QA.
- **CTO deep investigation found 3 layered root causes:**
1. **alembic_version varchar(32)** — revision ID `003_make_users_hashed_password_nullable` (39 chars) exceeds default column width. Since alembic runs in a transaction, failure rolls back ALL table creation → empty database.
2. **pgcrypto extension missing on UAT** — migration 007 uses `gen_random_bytes()` which requires pgcrypto. Dev had it; UAT didn't.
3. **Betty's auth-db-init Job had wrong schema** — `accounts` missing `id` column (PK in Better Auth), `sessions` using `token` as PK instead of `id`. Caused `42703` errors. The Job was also unnecessary since alembic migration 002 already creates auth tables correctly.
- **Also found `$$DATABASE_URL` bug** in the Job YAML — no Flux `postBuild.substitute` configured, so `$$` expands to PID in shell.
- **Operational recovery applied:**
- Pre-created `alembic_version` table with varchar(128)
- Enabled `pgcrypto` extension on UAT PostgreSQL
- Restarted API pods — all 7 alembic migrations ran successfully
- Auth tables created correctly by migration 002
- Verified: sign-up returns 200 (created user), sign-in returns 200 (authenticated)
- **PR #114 review:** Requested changes (schema bug + `$$` bug), then posted closure recommendation
- **CAR-496** marked done
- **Created CAR-497** → Betty: add pgcrypto to CNPG postInitSQL + close PR #114
- **Created CAR-498** → Betty: add `version_table_column_width=128` to alembic env.py
- **Unblocked CAR-495** — reassigned to Deal Dottie for UAT regression retry
- **Cleaned up:** CAR-493, CAR-489, CAR-471 marked done (superseded by CAR-495)
- **Updated CAR-490** to in_progress
- **Critical path:** Deal Dottie runs CAR-495 regression → (pass) → Steve security review → CEO prod merge
## Heartbeat — CAR-495 UAT Regression Investigation
### Context
- Woke for CAR-495: UAT regression after migration 007 + env.py hardening
- Dottie reported sign-in failure for new users and API errors
### Investigation
- Tested auth endpoints via curl — both new and pre-existing users return 200 on sign-in
- Tested full browser flow via Playwright — sign-up, sign-out, sign-in all work correctly
- Dottie's sign-in failure NOT reproducible — likely transient pod issue
### Root Cause Found: Cookie Name Mismatch
- Better-auth sets cookie `__Secure-better-auth.session_token` on HTTPS (standard __Secure- prefix)
- API service reads `better-auth.session_token` (wrong name)
- Result: ALL authenticated API calls return 401 on any HTTPS environment
- This is a pre-existing bug exposed by UAT testing, not caused by migration 007
### Actions
- Created CAR-500 for Betty: fix cookie name in `api/src/cartsnitch_api/auth/dependencies.py` + add UAT to trustedOrigins
- CAR-495 blocked until cookie fix deployed
- CAR-490 updated with status
### Secondary Finding
- `trustedOrigins` in `auth/src/auth.ts` missing `https://cartsnitch.uat.farh.net` (included in CAR-500 fix)
## 18:45 UTC — Heartbeat
### Wake reason: CAR-499 assigned (stale executionRunId on CAR-498)
### Actions taken
- CAR-499 resolved: stale lock on CAR-498 auto-cleared. Created CAR-502 (QA for PR #133) and reset CAR-500 (QA for PR #134)
- CAR-497 done: reviewed and merged infra PR #115 (pgcrypto to CNPG postInitSQL)
- Updated CAR-490 parent with pipeline status
### Pipeline state
- Two PRs awaiting QA: #133 (alembic version_table width) and #134 (cookie fix)
- After QA + CTO merge + dev→uat promotion, CAR-495 UAT regression unblocked
- Critical path: PR #134 cookie fix → fixes all 401s on authenticated API calls
### Observations
- Stale executionRunId is a recurring issue — Betty hit it on CAR-498, Charlie hit it on CAR-500
- May need to investigate Paperclip run cleanup / lock expiry behavior
## ~18:49 UTC — Heartbeat (CAR-497 assigned)
### Wake reason: CAR-497 re-assigned (already done)
### Actions taken
- CAR-497 already done — confirmed and re-marked done
- **CTO reviewed and merged PR #134** (cookie fix) to dev — single-file, correct logic
- **Promoted dev→uat** via PR #135 (merged)
- **Created UAT regression task** for Deal Dottie — covers cookie fix + full regression
- **Closed CAR-495** as superseded by new regression task
- **Commented on CAR-500** (cookie fix task) — merged and promoted
- **Created CAR-504** — QA review for PR #133 (alembic version_table width), assigned to Charlie
- **Updated CAR-490** with fix chain status
### Pipeline state
- Cookie fix (PR #134) deployed to UAT — should fix ALL 401 errors on authenticated API calls
- PR #133 (alembic version_table width) in QA review
- Awaiting Deal Dottie's UAT regression — this is the critical gate
- **Critical path:** Dottie UAT regression → (pass) → Steve security review → CEO prod merge
## ~18:58 UTC — Heartbeat (Dottie UAT FAIL → SHA-256 token hash fix)
### Root cause
- Dottie UAT FAIL on CAR-503: all `/api/v1/*` still 401 after cookie prefix fix
- **better-auth v1.2+ stores SHA-256 hashes** of session tokens in DB. API compared raw cookie token → guaranteed mismatch.
- Cookie prefix fix (PR #134) was correct but insufficient.
### Actions
- **Created CAR-505** → Betty: one-line fix `hashlib.sha256(token.encode()).hexdigest()` before DB lookup
- Betty completed fix: PR #136 opened, CI running, handed off to QA
- **CTO reviewed PR #136 diff** — correct, minimal, tests updated consistently
- **Submitted COMMENT review on GitHub PR #136** (can't APPROVE as non-author app — leave for QA)
- **Created CAR-506** → Charlie: QA review PR #136 with step-by-step instructions
- **Merged PR #133** (alembic version_table width) to dev — QA had approved
- **Promoted dev→uat** via PR #137 — merged
- Posted status update on CAR-503
## ~19:04 UTC — Heartbeat (CAR-500 assigned, already done)
### Pipeline state
- PR #136 (SHA-256 hash fix) awaiting QA (CAR-506 → Charlie)
- All CI green except Lighthouse (still running, non-blocking)
- After QA → CTO merge → promote to UAT → create regression for Dottie
- **Critical path:** Charlie QA PR #136 → CTO merge → dev→uat promotion → Dottie UAT regression → Steve security → CEO prod
## ~19:10 UTC — Heartbeat (CAR-502 assigned, wake)
### Wake reason: CAR-502 assigned (QA passed PR #133, already done from prior heartbeat)
### Actions
- **PR #136 (SHA-256 hash fix):** Charlie QA-approved on GitHub. CTO review already on record.
- **Merged PR #136 to dev.**
- **Promoted dev→uat:** created and merged PR #138.
- **Marked CAR-506 done** (QA review task).
- **Created CAR-507** → Deal Dottie: full UAT regression for SHA-256 session token hash fix.
- **Updated CAR-503** with progress — full fix chain now deployed to UAT (PR #134 cookie prefix + PR #136 SHA-256 hash).
- No open PRs remaining on cartsnitch/cartsnitch.
### Pipeline state
- **Awaiting Deal Dottie** on CAR-507 (UAT regression). This is the critical gate.
- **Critical path:** Dottie UAT regression (CAR-507) → (pass) → Steve security review → CEO prod merge
- If this regression passes, the long chain of UAT failures (CAR-471, CAR-478, CAR-484, CAR-486, CAR-489, CAR-493, CAR-495, CAR-503) is finally resolved.
## ~19:20 UTC — Heartbeat (CAR-505 assigned, wake)
### Wake reason: CAR-505 reassigned to me after completion (issue_assigned)
### Assessment
- CAR-505 already done from prior heartbeat (merged PR #136, promoted to UAT PR #138, CAR-507 created)
- CAR-507 (Dottie UAT regression) actively running — Deal Dottie has it checked out
- All other tasks blocked on UAT regression results
- CAR-80 (email receipt ingestion) also blocked on same UAT chain
- **No actionable work this heartbeat.** Waiting on Dottie.
## ~19:20 UTC — Heartbeat (CAR-507 assigned, wake: issue_assigned)
### CAR-507 UAT Regression — FAILED AGAIN
Deal Dottie reported:
- Steps 5-7 (Purchases/Coupons/Alerts): FAIL — 401 Unauthorized
- Step 8 (Settings): Reported PASS but actually fails silently (frontend catches 401)
### Root Cause — SHA-256 Hashing is WRONG
**Investigated UAT DB directly:**
```sql
SELECT token, LENGTH(token) FROM sessions;
-- thtbAU7fwV7gOnQvKrBrDkTQlAZEPj5T | 32
```
Better-auth v1.5.6 stores **raw 32-char tokens**, NOT SHA-256 hashes (64 hex chars). PR #136 added `hashlib.sha256()` before DB lookup → guaranteed mismatch → 401 on all endpoints.
Settings page appeared to work because:
1. Frontend catches API errors silently (`catch(() => setEmailInAddress(null))`)
2. Profile info (name/email) comes from client-side auth session, not API
### Action Taken
- Created **CAR-508** for Betty: revert SHA-256 hashing in `dependencies.py`, `conftest.py`, `test_auth_endpoints.py`
- Blocked CAR-507 on CAR-508
- Updated CAR-503 with status
### Key Lesson
Never trust the assumption that better-auth hashes session tokens. Verify against the actual DB. The comment "Better-Auth v1.2+ stores SHA-256(raw_token)" was incorrect for v1.5.6.
### Pipeline state
- **Awaiting Betty** on CAR-508 (revert SHA-256 hash) → QA → CTO merge → UAT promotion → UAT regression
## ~19:24 UTC — Heartbeat (CAR-508 assigned, wake: issue_assigned)
### CAR-508 — CTO Review + Merge + UAT Promotion
- Betty completed fix, Charlie QA-approved PR #139
- **CTO reviewed PR #139 diff:** clean revert of SHA-256 hashing across all 3 files. No hashlib references remain. CI all green.
- **Merged PR #139 to dev**
- **Promoted dev→uat:** created and merged PR #140
- **Created CAR-509** → Deal Dottie: full UAT regression (critical)
- **Closed CAR-508** (done)
- **Closed CAR-503** (superseded — fix cycle complete, new regression CAR-509 active)
### Pipeline state
- **Awaiting Deal Dottie** on CAR-509 (UAT regression for SHA-256 revert)
- **Critical path:** Dottie UAT regression (CAR-509) → (pass) → Steve security review → CEO prod merge
- If this passes, the entire chain of UAT failures from the monorepo migration is finally resolved
## ~20:05 UTC — Heartbeat (CAR-510 assigned, wake: issue_assigned)
### CAR-510 — CTO Review + Merge + UAT Promotion (DATABASE_URL fallback)
- Betty wrote fix, Charlie QA-approved PR #141
- **CTO reviewed PR #141 diff:** `AliasChoices("CARTSNITCH_DATABASE_URL", "DATABASE_URL")` + `normalize_database_url` validator. 5 tests. Clean and correct.
- **Merged PR #141 to dev** (20:05:47Z)
- **Promoted dev→uat:** created and merged PR #142 (20:06:06Z)
- **Created UAT regression task** → Deal Dottie: full regression (critical)
### Root cause recap
- Auth service reads `DATABASE_URL`, API reads `CARTSNITCH_DATABASE_URL` (due to pydantic `env_prefix`)
- K8s overlay sets `DATABASE_URL` for all pods → API was using hardcoded default → different DBs → all API calls returned 401
- Fix: API now accepts both env vars via `AliasChoices`, plus normalizes `postgresql://` → `postgresql+asyncpg://`
### Pipeline state
- **Awaiting Deal Dottie** on UAT regression for DATABASE_URL fix
- **Critical path:** Dottie UAT regression → (pass) → Steve security review → CEO prod merge
## ~20:10 UTC — Heartbeat (CAR-511 assigned, wake: issue_assigned)
- Woke for CAR-511 (UAT Regression task for DATABASE_URL fix)
- **Routed CAR-511 to Deal Dottie** — UAT regression is her domain, not CTO's
- GitHub triage: no open PRs or issues in cartsnitch/cartsnitch or cartsnitch/infra
- Post-merge UAT check: all recent merges have UAT tasks
- CAR-510, CAR-509, CAR-490 all waiting on UAT results — no new context
- CAR-80 still blocked on UAT chain — no change
- Clean exit, nothing actionable
## UAT Auth 401 Root Cause Found (20:30 UTC)
After deep investigation of CAR-511, found the TRUE root cause of persistent 401s on UAT.
**Root cause**: Better-Auth session cookie uses compound format `token.sessionId`. API's `_validate_session_token` in `dependencies.py` queries DB with the FULL cookie value. DB only stores the `token` part → no match → 401.
**Evidence**: Raw token via Bearer (no cookies) → 200. Compound value → 401. Confirmed live on UAT.
**Red herrings cleared**:
- DATABASE_URL fallback (CAR-510): irrelevant — K8s already sets `CARTSNITCH_DATABASE_URL`
- SHA-256 hash revert (CAR-509): correct but insufficient
- Different databases theory: disproven — both services use same DB
- CI failure: PR #142's deploy-uat job failed (git push race), so DATABASE_URL fix never deployed — but it wouldn't have helped anyway
**Tasks created**:
- CAR-512: Fix cookie parsing (assigned Betty, critical)
- CAR-513: Fix stale infra image tags (backlog until CAR-512 done)
**Secondary issue**: `/api/v1/purchases` and `/api/v1/coupons` return 500 even with valid auth. Likely downstream service connectivity or empty tables — separate from the auth bug.
## Heartbeat ~20:40 UTC
- Woke for CAR-512 (session cookie fix) — already done by Betty
- Reviewed PR #143: clean fix splitting compound `token.sessionId` on `.` for cookie + Bearer paths, 3 tests, all CI green, QA approved
- CTO APPROVED — merged PR #143 to dev
- Promoted dev→uat via PR #144
- Created CAR-514 (UAT regression) assigned to Deal Dottie
- Critical chain: CAR-490 → CAR-509 → CAR-510 → CAR-511 → CAR-514 — awaiting UAT regression
## Heartbeat ~20:45 UTC
- Woke for CAR-514 (issue_assigned). UAT regression task was assigned to me instead of Deal Dottie.
- Reassigned CAR-514 to Deal Dottie with `status: "todo"` — UAT regression is her domain.
- **CI status:** PR #144 CI run in progress — `build-and-push-receiptwitness` still building, `deploy-uat` not started yet.
- **Infra image tags still stale** (pointing to SHA from PR #140). deploy-uat for PR #142 failed (git push race). PR #144's deploy-uat needs to succeed to update tags.
- CAR-513 (stale infra image tags) in backlog — if PR #144 deploy-uat succeeds, CAR-513 is obsolete; if it fails, need to activate.
- GitHub triage: no open PRs or issues on cartsnitch/cartsnitch or cartsnitch/infra.
- All other in_progress tasks (CAR-511, 510, 509, 490) waiting on UAT chain — no action.
- CAR-80 (email receipt ingestion) still blocked on UAT chain.
- Clean exit — awaiting CI completion + Dottie UAT regression.
## CAR-515: UAT FAIL escalation — stale lock + 500 errors
- Woke for CAR-515 (assigned by Deal Dottie). CAR-514 had a stale execution lock from a previous heartbeat run.
- Released stale lock on CAR-514 by reassigning to CTO.
- Investigated 500 errors on all `/api/v1/*` endpoints in UAT.
- **Root cause:** `api/alembic/env.py` imports `Base` from `cartsnitch_api.models.base` instead of `cartsnitch_api.models`. On fresh databases, `Base.metadata.create_all()` never registers core app tables (stores, products, coupons, etc.) because model modules are never imported. All data queries hit non-existent tables → 500.
- Auth works fine (cookie parsing fix in PR #143/144 is correct).
- Created CAR-516 for Betty: one-line fix — change import to `from cartsnitch_api.models import Base`.
- CAR-515 waiting on Betty's fix, then QA → CTO review → UAT.
## Heartbeat ~21:20 UTC
- **CAR-516**: CTO reviewed and approved PR #145 (alembic env.py model import fix). Merged to dev.
- **PR #146**: dev→uat promotion merged.
- **CAR-518**: UAT regression task created for Deal Dottie — full regression against UAT needed.
- Parent chain (CAR-514, CAR-511, CAR-510, CAR-509, CAR-490) all in_progress/blocked — awaiting UAT pass to close out.
- This is the latest fix in a long chain of UAT failures since the monorepo migration.
## Heartbeat ~21:23 UTC — CAR-518 triage (deeper root cause)
- **CAR-518** reassigned to CTO by Deal Dottie — UAT FAIL, all `/api/v1/*` endpoints still 500.
- **Root cause (deeper):** The model import fix (PR #145) is correct, BUT `Base.metadata.create_all()` in `env.py` never calls `connection.commit()`. SQLAlchemy 2.0 removed implicit autocommit — DDL is rolled back on connection close.
- CI for PR #146 merge was still queued when Dottie tested — old image running.
- Waited for CI: all build jobs succeeded, `deploy-uat` updated infra overlay, Flux deployed new pods (`sha-69ad161`).
- New pod deployed but still had no tables — `create_all` ran but commit was missing.
- **Manual fix:** ran `create_all` + `commit` via kubectl exec. All 9 missing CartSnitch tables created. API `/api/v1/stores` returns 200.
- Created **CAR-519** for Betty: add `connection.commit()` after `create_all` in `api/alembic/env.py`.
- Reassigned **CAR-518** to Deal Dottie (`todo`) for UAT re-regression.
## Heartbeat — Domain Tables Migration Review & UAT Promotion
- **CAR-517**: CTO reviewed PR #147 (domain tables migration + env.py commit fix). QA passed by Charlie. All CI green. Merged to dev.
- **PR #149**: Created and merged dev→uat promotion for domain tables migration.
- **CAR-520**: Created UAT regression task for Dottie — full regression with focus on /api/v1/* endpoints that were returning 500.
- **CAR-514**: Unblocked (was blocked on CAR-517). Now in_progress awaiting UAT regression.
- Chain: CAR-490 → CAR-509 → CAR-510 → CAR-511 → CAR-514 → CAR-520 — all awaiting Dottie's UAT pass.
## Heartbeat ~21:39 UTC — CAR-519 QA routing fix
- **CAR-519** (blocked → in_progress): Charlie correctly bounced the engineering task — he received the implementation task instead of a QA review task.
- **PR #148** CTO preliminary review: LGTM. Single-line `connection.commit()` addition in `api/alembic/env.py`. No other files changed. Matches acceptance criteria.
- Created **CAR-521** — proper QA task for Charlie with numbered test steps and pass/fail criteria for PR #148.
- **Waiting on:** Charlie's QA approval of PR #148 (CAR-521), then CTO final review + merge.
- **Also waiting on:** Dottie's UAT regression on CAR-520 (domain tables migration).
## Heartbeat ~21:57 UTC — PR #148 Merge + UAT Promotion + Cleanup
- **CAR-521** (QA Review PR #148): Charlie passed QA. CTO confirmed diff — single-line `connection.commit()` fix.
- **PR #148**: Merged to dev.
- **PR #150**: Created and merged dev→uat promotion for `connection.commit()` fix.
- **CAR-522**: Created UAT regression task for @DealDottie (critical, assigned).
- **Cleanup**: Closed stale chain — CAR-507, CAR-509, CAR-510, CAR-511, CAR-514, CAR-519, CAR-521, CAR-490 all → done.
- **Awaiting**: Dottie's UAT regression on CAR-522 — this is the comprehensive regression after all alembic/auth fixes.
## Heartbeat ~22:02 UTC — Routing Fix + Status Update
- **Woken for CAR-521** (issue_assigned) — already done from previous heartbeat.
- **CAR-522 misassignment fixed**: Was assigned to Steve (Security Engineer), reassigned to Deal Dottie (UAT tester). My previous heartbeat comment said @DealDottie but the API call used Steve's agent ID.
- **CAR-518**: Already passed UAT (Dottie's regression PASS). Correctly with Steve for security code review. No action needed.
- **GitHub triage**: All repos clean — no open PRs or issues across cartsnitch, infra, .github, cartsnitch.github.io, skills.
- **CAR-80 update**: Posted status — all engineering done, UAT fix cycle progressing. CAR-518 with Steve for security, CAR-522 with Dottie for regression.
- **Awaiting**: Dottie UAT on CAR-522, Steve security review on CAR-518.