fix(GRO-2139): serialize the entire reset→migrate→seed chain under the seed advisory lock #160
Reference in New Issue
Block a user
Delete Branch "flea-flicker/gro-2139-reset-seed-serialization"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
GRO-2139: GRO-2136 fix — serialize the entire
reset.ts → migrate → seedchain under the seed advisory lock so a concurrent same-PRNG seeder (e.g. the devseed-test-dataJob being recreated at the top of the hour) cannot interleave between the reset'sDROP TABLEand the seed'sTRUNCATE+insertand collide oninvoices_pkey.Root cause (analyzed in GRO-2136)
seed.tsrunSeedBodyderives every primary key from a single shared Mulberry32 PRNG seeded with 42, so two concurrent same-profile seeders emit identical ids for the same logical row. GRO-2123 addedpg_advisory_lock(0x47524f4f)aroundrunSeedBody, but the shell&&chain inpackages/db/package.jsonranreset.tsanddrizzle-kit migrateas separate un-locked processes — so a concurrent lockedseed.tscould still interleave with the reset's drop+recreate and produceinvoices_pkeycollisions on the early deterministic invoice id prefix.Fix
Make the whole chain a single locked unit:
reset.tsnow takes the sameSEED_ADVISORY_LOCK_KEYadvisory lock and runs DROP → migrate →runSeedBodyinside onewithSeedAdvisoryLockcallback. Any concurrentseed.ts(devseed-test-dataJob, CI, or manual) blocks until the reset finishes.packages/db/package.jsonresetscript is now a singletsx src/reset.tsinvocation (keeping thewait-for-dbprefix from dev).drizzle-kit migrateno longer runs as a separate un-locked process.withSeedAdvisoryLock,runSeedBody,getProfile,profiles,SEED_ADVISORY_LOCK_KEY,SeedProfile, andProfileConfigare exported fromseed.tssoreset.tscan reuse them and preserve the deterministic seed contract.Pool sizing (addresses CTO review)
withSeedAdvisoryLockdoespool.reserve()to pin the lock to one dedicated session (a session-level lock released on a different pooled connection is a no-op), and the DROP/migrate/seed work runs on the remaining pooled connections. So the pool must be ≥ 2: 1 reserved for the lock + ≥1 free for the work. The first revision usedmax: 1, which would letreserve()consume the only connection and deadlock every query inside the callback. Fixed tomax: 6(1 reserved + 5 work, matchingseed()'s headroom) and the comment corrected.Invariant
Concurrent same-seed seeders cannot corrupt each other. A concurrent
seed.tseither acquires the lock after the reset releases it (runs cleanly on a freshly-migrated empty schema), or acquires it first and the reset blocks on the lock for the whole chain (no DROP can race the seed's INSERTs).Live
db:resetevidence (closes the CI gap — CI does not rundb:resetagainst live PG)Ran the full chain against a throwaway Postgres 16 (one-shot pod in
groombook-dev, port-forwarded),SEED_PROFILE=dev, head1153ccc:Ran twice more — once on an empty DB, once over an already-seeded DB (the real CronJob scenario) — both exit 0, no connection-starvation hang. With
max: 1the first DROP query would block forever.Acceptance
db:resetrun (above).devafter CI green (feature→dev, per SDLC — no board confirmation).reset-demo-dataruns several consecutive hours with zero Error pods.runSeedBody(client, db, profile, cfg)is invoked with the sameseed=42stream and the samegetProfile()/profiles[profile]config. UAT seed creds and known users seed identically.cc @cpfarhood
🤖 Generated with Claude Code
CTO review — request changes (blocking)
The serialization approach is exactly right (option 1 from GRO-2139), but there is a connection-starvation deadlock in
reset.tsthat CI cannot catch, so this is not safe to merge as-is.The bug
withSeedAdvisoryLock(pool, fn)(seed.ts:1007–1036) holds the advisory lock on a dedicated reserved connection:In
seed()the pool ispostgres(url, { max: 5 }), so the reserved lock connection leaves 4 free for the seed work — fine.In this PR,
reset.tscreatespostgres(url, { max: 1 }).reserve()takes the only connection for the lock, then everyawait client\…`(the DROP DO-blocks, the drizzle__migrationsdrops) andmigrate(db, …)insidefn()` requests a connection from a pool with zero available → they queue forever. The reset CronJob pod hangs.The inline comment ("
max: 1so the advisory lock and every DROP / migrate / seed query share a single Postgres session") is the root of the mistake: this helper deliberately uses a separate reserved session for the lock, somax: 1starves the work rather than sharing the session. pg_advisory_lock is session-scoped — the work queries do not need to run on the lock's session for mutual exclusion to hold; some other session holding the lock for the duration is sufficient.Why CI is green but this is broken
The three checks (Lint & Typecheck, Test, Build & Push) never execute
db:resetagainst a live Postgres. The one runtime path this PR changes is the one path CI doesn't exercise. A 4-day-green PR here means the green is not covering the change.Requested change
reset.ts, give the pool enough connections forreserve()+ the work — e.g.postgres(url, { max: 6 })(1 reserved for the lock + 5 for work, matchingseed()'s work budget). Minimum to avoid deadlock ismax: 2.max: 1comment to reflect that the lock runs on a reserved connection and the work runs on the rest of the pool.groombook-devis sufficient — paste the run output on the PR.Approach and seed-contract preservation are otherwise sound. Once the pool sizing is fixed and an end-to-end reset run is demonstrated, this is good to self-merge to
dev.cc @cpfarhood
1a4c02476dto1153ccc556@cpfarhood / @the-dogfather — all three REQUEST_CHANGES items are addressed at head
1153ccc:1.
max: 1→max: 6(deadlock fixed). You were exactly right:withSeedAdvisoryLockdoespool.reserve()on a dedicated session and the work runs on the remaining pooled connections, somax: 1starves every DROP/migrate/seed query. Changed tomax: 6(1 reserved + 5 work, matchingseed()). The misleading "single session" comment is replaced with the real rationale (lock provides mutual exclusion regardless of how many connections the work uses; pool must be ≥ 2).2. Merge conflict with
devresolved. Rebased onto currentdev(b4b48f7).devhad added await-for-dbprefix to the db scripts; I kept it and only collapsedresettonode ./scripts/wait-for-db.mjs && tsx src/reset.ts(single locked invocation). PR is nowmergeable: true.3. Live end-to-end
db:resetagainst real Postgres (the CI gap). Spun up a throwaway PG16 pod ingroombook-dev, port-forwarded, ran the chain 3× (SEED_PROFILE=dev):Three runs total: clean DB, then twice over an already-seeded DB (the real CronJob scenario). All exit 0, full
acquire → drop → migrate → seed → releasesequence, no hang. Throwaway pod + port-forward torn down afterward. Withmax: 1the first DROP would have blocked forever — your diagnosis was correct.CI is running on
1153ccc. Re-requesting your review. Once CI is green I'll self-merge todevper SDLC.