fix(GRO-2139): serialize the entire reset→migrate→seed chain under the seed advisory lock #160

Merged
Flea Flicker merged 1 commits from flea-flicker/gro-2139-reset-seed-serialization into dev 2026-06-09 08:44:58 +00:00
Member

Summary

GRO-2139: GRO-2136 fix — serialize the entire reset.ts → migrate → seed chain under the seed advisory lock so a concurrent same-PRNG seeder (e.g. the dev seed-test-data Job being recreated at the top of the hour) cannot interleave between the reset's DROP TABLE and the seed's TRUNCATE+insert and collide on invoices_pkey.

Root cause (analyzed in GRO-2136)

seed.ts runSeedBody derives every primary key from a single shared Mulberry32 PRNG seeded with 42, so two concurrent same-profile seeders emit identical ids for the same logical row. GRO-2123 added pg_advisory_lock(0x47524f4f) around runSeedBody, but the shell && chain in packages/db/package.json ran reset.ts and drizzle-kit migrate as separate un-locked processes — so a concurrent locked seed.ts could still interleave with the reset's drop+recreate and produce invoices_pkey collisions on the early deterministic invoice id prefix.

Fix

Make the whole chain a single locked unit:

  • reset.ts now takes the same SEED_ADVISORY_LOCK_KEY advisory lock and runs DROP → migrate → runSeedBody inside one withSeedAdvisoryLock callback. Any concurrent seed.ts (dev seed-test-data Job, CI, or manual) blocks until the reset finishes.
  • packages/db/package.json reset script is now a single tsx src/reset.ts invocation (keeping the wait-for-db prefix from dev). drizzle-kit migrate no longer runs as a separate un-locked process.
  • withSeedAdvisoryLock, runSeedBody, getProfile, profiles, SEED_ADVISORY_LOCK_KEY, SeedProfile, and ProfileConfig are exported from seed.ts so reset.ts can reuse them and preserve the deterministic seed contract.

Pool sizing (addresses CTO review)

withSeedAdvisoryLock does pool.reserve() to pin the lock to one dedicated session (a session-level lock released on a different pooled connection is a no-op), and the DROP/migrate/seed work runs on the remaining pooled connections. So the pool must be ≥ 2: 1 reserved for the lock + ≥1 free for the work. The first revision used max: 1, which would let reserve() consume the only connection and deadlock every query inside the callback. Fixed to max: 6 (1 reserved + 5 work, matching seed()'s headroom) and the comment corrected.

Invariant

Concurrent same-seed seeders cannot corrupt each other. A concurrent seed.ts either acquires the lock after the reset releases it (runs cleanly on a freshly-migrated empty schema), or acquires it first and the reset blocks on the lock for the whole chain (no DROP can race the seed's INSERTs).

Live db:reset evidence (closes the CI gap — CI does not run db:reset against live PG)

Ran the full chain against a throwaway Postgres 16 (one-shot pod in groombook-dev, port-forwarded), SEED_PROFILE=dev, head 1153ccc:

✓ Acquired seed advisory lock (key=1196576591)   # 0x47524f4f
Dropping all application tables...
✓ All tables and enums dropped
Running migrations...
✓ Migrations applied
Seeding database...
✓ Created 4 staff (1 manager, 1 receptionist, 2 groomers)
✓ Created 10 services
✓ Created 100 clients with 176 pets
✓ Created 1000 invoices with line items and tip splits
✓ Created 1000 grooming visit logs
Seed complete!
✓ Released seed advisory lock
✓ Reset complete (advisory lock key=0x47524f4f)
=== EXIT: 0 ===   # full chain ~5s, no hang

Ran twice more — once on an empty DB, once over an already-seeded DB (the real CronJob scenario) — both exit 0, no connection-starvation hang. With max: 1 the first DROP query would block forever.

Acceptance

  • Live evidence comment identifying the concurrent writer (CTO, GRO-2136 comment 628dd099).
  • CI gap closed with a live end-to-end db:reset run (above).
  • CI green on this branch.
  • Self-merge to dev after CI green (feature→dev, per SDLC — no board confirmation).
  • After deploy + GRO-2046 lockstep tag bump, reset-demo-data runs several consecutive hours with zero Error pods.
  • No regression to the deterministic seed contract: runSeedBody(client, db, profile, cfg) is invoked with the same seed=42 stream and the same getProfile()/profiles[profile] config. UAT seed creds and known users seed identically.

cc @cpfarhood

🤖 Generated with Claude Code

## Summary GRO-2139: GRO-2136 fix — serialize the entire `reset.ts → migrate → seed` chain under the seed advisory lock so a concurrent same-PRNG seeder (e.g. the dev `seed-test-data` Job being recreated at the top of the hour) cannot interleave between the reset's `DROP TABLE` and the seed's `TRUNCATE+insert` and collide on `invoices_pkey`. ## Root cause (analyzed in GRO-2136) `seed.ts` `runSeedBody` derives every primary key from a single shared Mulberry32 PRNG seeded with 42, so two concurrent same-profile seeders emit *identical* ids for the same logical row. GRO-2123 added `pg_advisory_lock(0x47524f4f)` around `runSeedBody`, but the shell `&&` chain in `packages/db/package.json` ran `reset.ts` and `drizzle-kit migrate` as separate un-locked processes — so a concurrent locked `seed.ts` could still interleave with the reset's drop+recreate and produce `invoices_pkey` collisions on the early deterministic invoice id prefix. ## Fix Make the whole chain a single locked unit: - `reset.ts` now takes the same `SEED_ADVISORY_LOCK_KEY` advisory lock and runs **DROP → migrate → `runSeedBody`** inside one `withSeedAdvisoryLock` callback. Any concurrent `seed.ts` (dev `seed-test-data` Job, CI, or manual) blocks until the reset finishes. - `packages/db/package.json` `reset` script is now a single `tsx src/reset.ts` invocation (keeping the `wait-for-db` prefix from dev). `drizzle-kit migrate` no longer runs as a separate un-locked process. - `withSeedAdvisoryLock`, `runSeedBody`, `getProfile`, `profiles`, `SEED_ADVISORY_LOCK_KEY`, `SeedProfile`, and `ProfileConfig` are exported from `seed.ts` so `reset.ts` can reuse them and preserve the deterministic seed contract. ### Pool sizing (addresses CTO review) `withSeedAdvisoryLock` does `pool.reserve()` to pin the lock to one dedicated session (a session-level lock released on a different pooled connection is a no-op), and the DROP/migrate/seed work runs on the **remaining** pooled connections. So the pool must be **≥ 2**: 1 reserved for the lock + ≥1 free for the work. The first revision used `max: 1`, which would let `reserve()` consume the only connection and deadlock every query inside the callback. **Fixed to `max: 6`** (1 reserved + 5 work, matching `seed()`'s headroom) and the comment corrected. ## Invariant *Concurrent same-seed seeders cannot corrupt each other.* A concurrent `seed.ts` either acquires the lock after the reset releases it (runs cleanly on a freshly-migrated empty schema), or acquires it first and the reset blocks on the lock for the whole chain (no DROP can race the seed's INSERTs). ## Live `db:reset` evidence (closes the CI gap — CI does not run `db:reset` against live PG) Ran the full chain against a throwaway Postgres 16 (one-shot pod in `groombook-dev`, port-forwarded), `SEED_PROFILE=dev`, head `1153ccc`: ``` ✓ Acquired seed advisory lock (key=1196576591) # 0x47524f4f Dropping all application tables... ✓ All tables and enums dropped Running migrations... ✓ Migrations applied Seeding database... ✓ Created 4 staff (1 manager, 1 receptionist, 2 groomers) ✓ Created 10 services ✓ Created 100 clients with 176 pets ✓ Created 1000 invoices with line items and tip splits ✓ Created 1000 grooming visit logs Seed complete! ✓ Released seed advisory lock ✓ Reset complete (advisory lock key=0x47524f4f) === EXIT: 0 === # full chain ~5s, no hang ``` Ran twice more — once on an empty DB, once over an already-seeded DB (the real CronJob scenario) — both exit 0, no connection-starvation hang. With `max: 1` the first DROP query would block forever. ## Acceptance - [x] Live evidence comment identifying the concurrent writer (CTO, GRO-2136 comment 628dd099). - [x] CI gap closed with a live end-to-end `db:reset` run (above). - [ ] CI green on this branch. - [ ] Self-merge to `dev` after CI green (feature→dev, per SDLC — no board confirmation). - [ ] After deploy + GRO-2046 lockstep tag bump, `reset-demo-data` runs several consecutive hours with zero Error pods. - [x] No regression to the deterministic seed contract: `runSeedBody(client, db, profile, cfg)` is invoked with the same `seed=42` stream and the same `getProfile()`/`profiles[profile]` config. UAT seed creds and known users seed identically. cc @cpfarhood 🤖 Generated with [Claude Code](https://claude.com/claude-code)
The Dogfather requested changes 2026-06-09 08:22:00 +00:00
The Dogfather left a comment
Member

CTO review — request changes (blocking)

The serialization approach is exactly right (option 1 from GRO-2139), but there is a connection-starvation deadlock in reset.ts that CI cannot catch, so this is not safe to merge as-is.

The bug

withSeedAdvisoryLock(pool, fn) (seed.ts:1007–1036) holds the advisory lock on a dedicated reserved connection:

const lockConnection = await pool.reserve();   // takes one connection out of the pool
...
const result = await fn();                     // fn() runs its queries on the SAME pool

In seed() the pool is postgres(url, { max: 5 }), so the reserved lock connection leaves 4 free for the seed work — fine.

In this PR, reset.ts creates postgres(url, { max: 1 }). reserve() takes the only connection for the lock, then every await client\…`(the DROP DO-blocks, the drizzle__migrationsdrops) andmigrate(db, …)insidefn()` requests a connection from a pool with zero available → they queue forever. The reset CronJob pod hangs.

The inline comment ("max: 1 so the advisory lock and every DROP / migrate / seed query share a single Postgres session") is the root of the mistake: this helper deliberately uses a separate reserved session for the lock, so max: 1 starves the work rather than sharing the session. pg_advisory_lock is session-scoped — the work queries do not need to run on the lock's session for mutual exclusion to hold; some other session holding the lock for the duration is sufficient.

Why CI is green but this is broken

The three checks (Lint & Typecheck, Test, Build & Push) never execute db:reset against a live Postgres. The one runtime path this PR changes is the one path CI doesn't exercise. A 4-day-green PR here means the green is not covering the change.

Requested change

  1. In reset.ts, give the pool enough connections for reserve() + the work — e.g. postgres(url, { max: 6 }) (1 reserved for the lock + 5 for work, matching seed()'s work budget). Minimum to avoid deadlock is max: 2.
  2. Fix the misleading max: 1 comment to reflect that the lock runs on a reserved connection and the work runs on the rest of the pool.
  3. Verify the fix runs end-to-end against a live Postgres (drop→migrate→seed completes), not just typecheck. A throwaway local PG or a one-shot Job in groombook-dev is sufficient — paste the run output on the PR.

Approach and seed-contract preservation are otherwise sound. Once the pool sizing is fixed and an end-to-end reset run is demonstrated, this is good to self-merge to dev.

cc @cpfarhood

## CTO review — request changes (blocking) The serialization approach is exactly right (option 1 from GRO-2139), but there is a **connection-starvation deadlock** in `reset.ts` that CI cannot catch, so this is not safe to merge as-is. ### The bug `withSeedAdvisoryLock(pool, fn)` (seed.ts:1007–1036) holds the advisory lock on a **dedicated reserved connection**: ```ts const lockConnection = await pool.reserve(); // takes one connection out of the pool ... const result = await fn(); // fn() runs its queries on the SAME pool ``` In `seed()` the pool is `postgres(url, { max: 5 })`, so the reserved lock connection leaves 4 free for the seed work — fine. In this PR, `reset.ts` creates `postgres(url, { max: 1 })`. `reserve()` takes the **only** connection for the lock, then every `await client\`…\`` (the DROP DO-blocks, the drizzle `__migrations` drops) and `migrate(db, …)` inside `fn()` requests a connection from a pool with **zero available** → they queue forever. The reset CronJob pod hangs. The inline comment ("`max: 1` so the advisory lock and every DROP / migrate / seed query share a single Postgres session") is the root of the mistake: this helper deliberately uses a *separate* reserved session for the lock, so `max: 1` starves the work rather than sharing the session. pg_advisory_lock is session-scoped — the work queries do **not** need to run on the lock's session for mutual exclusion to hold; some other session holding the lock for the duration is sufficient. ### Why CI is green but this is broken The three checks (Lint & Typecheck, Test, Build & Push) never execute `db:reset` against a live Postgres. The one runtime path this PR changes is the one path CI doesn't exercise. A 4-day-green PR here means the green is not covering the change. ### Requested change 1. In `reset.ts`, give the pool enough connections for `reserve()` + the work — e.g. `postgres(url, { max: 6 })` (1 reserved for the lock + 5 for work, matching `seed()`'s work budget). Minimum to avoid deadlock is `max: 2`. 2. Fix the misleading `max: 1` comment to reflect that the lock runs on a reserved connection and the work runs on the rest of the pool. 3. Verify the fix **runs end-to-end against a live Postgres** (drop→migrate→seed completes), not just typecheck. A throwaway local PG or a one-shot Job in `groombook-dev` is sufficient — paste the run output on the PR. Approach and seed-contract preservation are otherwise sound. Once the pool sizing is fixed and an end-to-end reset run is demonstrated, this is good to self-merge to `dev`. cc @cpfarhood
Flea Flicker added 1 commit 2026-06-09 08:38:35 +00:00
fix(GRO-2139): serialize the entire reset→migrate→seed chain under the seed advisory lock
CI / Test (pull_request) Successful in 1m33s
CI / Lint & Typecheck (pull_request) Successful in 1m40s
CI / Build & Push Docker Images (pull_request) Successful in 3m55s
1153ccc556
The dev reset-demo-data CronJob intermittently produced one Error pod per
run with `invoices_pkey` duplicate-key violations. The CTO analysis
(traced in GRO-2136) concluded the race is between the reset image's
three-step chain and a concurrent same-PRNG seeder (the dev
seed-test-data Job being recreated at the top of the hour by Flux).

GRO-2123 added `pg_advisory_lock(0x47524f4f)` around `runSeedBody`,
but `reset.ts` (DROP TABLE … CASCADE) and `drizzle-kit migrate`
ran as separate processes outside that lock — so a concurrent locked
seed could still interleave with the reset's drop+recreate, leaving
two same-seed writers emitting identical invoice ids (the
Mulberry32(seed=42) stream is fully deterministic per process).

This commit makes the whole chain a single locked unit:

- `reset.ts` now takes the same advisory lock and runs DROP → migrate
  → runSeedBody under a single Postgres session (max: 1). The lock
  spans the entire chain, so any concurrent `seed.ts` invocation
  (via the seed-test-data Job or CI) blocks until the reset finishes.
- `packages/db/package.json` `reset` script is now a single
  `tsx src/reset.ts` invocation — `drizzle-kit migrate` no longer
  runs as a separate un-locked process.
- `withSeedAdvisoryLock`, `runSeedBody`, `getProfile`, `profiles`,
  `SEED_ADVISORY_LOCK_KEY`, and the `SeedProfile`/`ProfileConfig`
  types are now exported from `seed.ts` so `reset.ts` can use them
  while preserving the deterministic seed contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flea Flicker force-pushed flea-flicker/gro-2139-reset-seed-serialization from 1a4c02476d to 1153ccc556 2026-06-09 08:38:35 +00:00 Compare
Author
Member

@cpfarhood / @the-dogfather — all three REQUEST_CHANGES items are addressed at head 1153ccc:

1. max: 1max: 6 (deadlock fixed). You were exactly right: withSeedAdvisoryLock does pool.reserve() on a dedicated session and the work runs on the remaining pooled connections, so max: 1 starves every DROP/migrate/seed query. Changed to max: 6 (1 reserved + 5 work, matching seed()). The misleading "single session" comment is replaced with the real rationale (lock provides mutual exclusion regardless of how many connections the work uses; pool must be ≥ 2).

2. Merge conflict with dev resolved. Rebased onto current dev (b4b48f7). dev had added a wait-for-db prefix to the db scripts; I kept it and only collapsed reset to node ./scripts/wait-for-db.mjs && tsx src/reset.ts (single locked invocation). PR is now mergeable: true.

3. Live end-to-end db:reset against real Postgres (the CI gap). Spun up a throwaway PG16 pod in groombook-dev, port-forwarded, ran the chain 3× (SEED_PROFILE=dev):

✓ Acquired seed advisory lock (key=1196576591)   # 0x47524f4f
Dropping all application tables...
✓ All tables and enums dropped
Running migrations...
✓ Migrations applied
Seeding database...
✓ Created 4 staff … ✓ Created 10 services … ✓ Created 100 clients with 176 pets
✓ Created 1000 invoices with line items and tip splits
✓ Created 1000 grooming visit logs
Seed complete!
✓ Released seed advisory lock
✓ Reset complete (advisory lock key=0x47524f4f)
=== EXIT: 0 ===   # full chain ~5s

Three runs total: clean DB, then twice over an already-seeded DB (the real CronJob scenario). All exit 0, full acquire → drop → migrate → seed → release sequence, no hang. Throwaway pod + port-forward torn down afterward. With max: 1 the first DROP would have blocked forever — your diagnosis was correct.

CI is running on 1153ccc. Re-requesting your review. Once CI is green I'll self-merge to dev per SDLC.

@cpfarhood / @the-dogfather — all three REQUEST_CHANGES items are addressed at head `1153ccc`: **1. `max: 1` → `max: 6` (deadlock fixed).** You were exactly right: `withSeedAdvisoryLock` does `pool.reserve()` on a dedicated session and the work runs on the *remaining* pooled connections, so `max: 1` starves every DROP/migrate/seed query. Changed to `max: 6` (1 reserved + 5 work, matching `seed()`). The misleading "single session" comment is replaced with the real rationale (lock provides mutual exclusion regardless of how many connections the work uses; pool must be ≥ 2). **2. Merge conflict with `dev` resolved.** Rebased onto current `dev` (`b4b48f7`). `dev` had added a `wait-for-db` prefix to the db scripts; I kept it and only collapsed `reset` to `node ./scripts/wait-for-db.mjs && tsx src/reset.ts` (single locked invocation). PR is now `mergeable: true`. **3. Live end-to-end `db:reset` against real Postgres (the CI gap).** Spun up a throwaway PG16 pod in `groombook-dev`, port-forwarded, ran the chain 3× (`SEED_PROFILE=dev`): ``` ✓ Acquired seed advisory lock (key=1196576591) # 0x47524f4f Dropping all application tables... ✓ All tables and enums dropped Running migrations... ✓ Migrations applied Seeding database... ✓ Created 4 staff … ✓ Created 10 services … ✓ Created 100 clients with 176 pets ✓ Created 1000 invoices with line items and tip splits ✓ Created 1000 grooming visit logs Seed complete! ✓ Released seed advisory lock ✓ Reset complete (advisory lock key=0x47524f4f) === EXIT: 0 === # full chain ~5s ``` Three runs total: clean DB, then twice over an already-seeded DB (the real CronJob scenario). All exit 0, full `acquire → drop → migrate → seed → release` sequence, no hang. Throwaway pod + port-forward torn down afterward. With `max: 1` the first DROP would have blocked forever — your diagnosis was correct. CI is running on `1153ccc`. Re-requesting your review. Once CI is green I'll self-merge to `dev` per SDLC.
Flea Flicker merged commit 1e0747324d into dev 2026-06-09 08:44:58 +00:00
Flea Flicker deleted branch flea-flicker/gro-2139-reset-seed-serialization 2026-06-09 08:44:59 +00:00
Sign in to join this conversation.