Compare commits

...

4 Commits

Author SHA1 Message Date
Flea Flicker 323f6d6bcb fix(db): wait for/retry DB DNS resolution before drizzle-kit migrate (GRO-2163)
CI / Test (pull_request) Successful in 10s
CI / Lint & Typecheck (pull_request) Successful in 16s
CI / Build & Push Docker Images (pull_request) Failing after 30m28s
A fresh migrate-schema pod occasionally hits a transient CoreDNS miss
(EAI_AGAIN) on groombook-postgres-rw.<ns>.svc on its first attempt.
With backoffLimit: 2 the retry pod usually wins, but three unlucky
attempts in a row trips BackoffLimitExceeded and the Job is recreated
on every Flux reconcile (3+ Completed events observed in 8 min in uat).

Add packages/db/scripts/wait-for-db.mjs: a tiny no-deps Node 22 script
that parses DATABASE_URL, resolves the hostname via node:dns.promises
with exponential backoff (12 attempts, ~30s total) and only exits 0
once a real IP is returned. EAI_AGAIN / ENOTFOUND / EAI_NODATA are
retried; any other DNS error is surfaced so drizzle-kit gets a clear
message instead of being starved by retries.

Wire it as a pnpm `pre-migrate` (and `pre-seed` / `pre-reset`) hook
in @groombook/db so pnpm auto-runs it before any of the data-plane
commands. Mirrors the belt-and-braces pattern used in GRO-1985
(disable Corepack download fallback): do not try to outsmart CoreDNS,
just do not ask drizzle-kit to perform the very first DNS lookup of a
freshly-scheduled pod.

Defaults are env-tunable (WAIT_FOR_DB_MAX_ATTEMPTS, _BASE_DELAY_MS,
_MAX_DELAY_MS, _SKIP) so a future uat-debug pod can sidestep the
wait if needed.

Refs: GRO-2163, GRO-1985.
2026-06-08 00:31:38 +00:00
Flea Flicker f67b96ddfe Merge pull request 'fix(GRO-2123): serialize seed.ts with Postgres advisory lock' (#155) from flea-flicker/gro-2123-seed-advisory-lock into dev
CI / Test (push) Successful in 11s
CI / Lint & Typecheck (push) Successful in 16s
CI / Build & Push Docker Images (push) Successful in 25s
CI / Test (pull_request) Successful in 10s
CI / Lint & Typecheck (pull_request) Successful in 16s
CI / Build & Push Docker Images (pull_request) Successful in 28s
2026-06-04 11:23:41 +00:00
Flea Flicker d1a68d93de fix(GRO-2123): serialize seed.ts with Postgres advisory lock
CI / Test (pull_request) Successful in 13s
CI / Lint & Typecheck (pull_request) Successful in 15s
CI / Build & Push Docker Images (pull_request) Successful in 58s
The reset-demo-data CronJob in groombook-uat intermittently failed with
FK 23503 on invoice_tip_splits because two pods could run the seed
concurrently: the new pod's TRUNCATE deleted rows the old pod was still
inserting.

Acquire a session-level advisory lock for the full duration of the seed.
CRITICAL: with postgres-js connection pooling, a pg_advisory_lock
acquired on one pooled connection and released on a different one is a
no-op (the lock is bound to the pg-backend that took it). We therefore
reserve a dedicated connection for the lock, take pg_advisory_lock(KEY)
on it, run the seed on the pooled connections, and release the lock +
reserved connection in a try/finally so a thrown seed error cannot leak
the lock or the connection.

Defence-in-depth with the infra PR that switches
concurrencyPolicy: Replace → Forbid on the reset-demo-data CronJob.

- Adds withSeedAdvisoryLock helper and runSeedBody extracted function
- Wraps seed() body in the helper; client.end() runs after the lock
  releases so a reserved connection is not returned to a closed pool
- SEED_ADVISORY_LOCK_KEY = 0x47524f4f ("GROO" in ASCII) — arbitrary
  stable 32-bit key, referenced in runbooks
- UAT_PLAYBOOK.md §3.29 documents the regression check

cc @cpfarhood

Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-06-04 11:12:17 +00:00
Flea Flicker e9f94a2bd7 fix(seed): GRO-2100 run uat-groomer linkage AFTER services seed (regression in #151) (#153)
CI / Test (push) Successful in 12s
CI / Test (pull_request) Successful in 12s
CI / Lint & Typecheck (pull_request) Successful in 15s
CI / Build & Push Docker Images (pull_request) Successful in 29s
CI / Lint & Typecheck (push) Failing after 12m57s
CI / Build & Push Docker Images (push) Has been skipped
fix(seed): GRO-2100 run uat-groomer linkage after services seed (#153)

Co-authored-by: Flea Flicker <flea@groombook.dev>
Co-committed-by: Flea Flicker <flea@groombook.dev>
2026-06-02 20:11:45 +00:00
4 changed files with 224 additions and 7 deletions
+1
View File
@@ -166,6 +166,7 @@ Expected: one row, `role = 'groomer'`. If zero rows return, the request hit the
| TC-API-3.26 | Verify 25-35% medicalAlerts distribution | GET /api/pets (first 30 pets), count how many have non-empty medicalAlerts | Ratio is 25-35% (seed uses rand() < 0.3 for ~30% distribution) |
| TC-API-3.27 | Verify coat_type enum has all seed values | After UAT seed completes, inspect the coat_type enum on the UAT DB — it must contain: short, medium, long, double, wire, silky, curly, hairless | UAT seed jobs (`reset-demo-data`, `seed-test-data`) complete 1/1 with no `enum_in` error; coat_type includes all 8 values used by seed.ts `coatTypePool` |
| TC-API-3.28 | Verify pet_size_category enum has all seed values | After UAT seed completes, inspect the pet_size_category enum on the UAT DB — it must contain: small, medium, large, extra_large | UAT seed jobs (`reset-demo-data`, `seed-test-data`) complete 1/1 with no `enum_in` error; pet_size_category includes all 4 values used by seed.ts `petSizeCategoryPool` (regression for GRO-1999, mirrors TC-API-3.27) |
| TC-API-3.29 | Verify `reset-demo-data` CronJob does not fail with FK 23503 on `invoice_tip_splits` (GRO-2123) | Trigger the CronJob manually: `kubectl create job --from=cronjob/reset-demo-data verify-gro2123 -n groombook-uat`. Wait for pod to terminate. Inspect logs: `kubectl logs -n groombook-uat -l job-name=verify-gro2123` | Pod reaches `Completed` state; logs show `✓ Acquired seed advisory lock` and `✓ Released seed advisory lock` from `seed.ts`; no `PostgresError: … violates foreign key constraint "invoice_tip_splits_invoice_id_invoices_id_fk"` (code 23503); final counts unchanged (500 clients, ~4000 invoices) |
### 4.4 Appointment Scheduling
+3
View File
@@ -18,7 +18,10 @@
"scripts": {
"build": "tsc --project .",
"generate": "drizzle-kit generate",
"pre-migrate": "node ./scripts/wait-for-db.mjs",
"migrate": "drizzle-kit migrate",
"pre-seed": "node ./scripts/wait-for-db.mjs",
"pre-reset": "node ./scripts/wait-for-db.mjs",
"seed": "tsx src/seed.ts",
"reset": "tsx src/reset.ts && drizzle-kit migrate && tsx src/seed.ts",
"studio": "drizzle-kit studio",
+104
View File
@@ -0,0 +1,104 @@
#!/usr/bin/env node
// wait-for-db.mjs
//
// GRO-2163: wait for / retry DNS resolution of the database hostname derived
// from DATABASE_URL before invoking `drizzle-kit migrate`. The first attempt
// of a fresh migrate-schema pod occasionally hits a transient CoreDNS miss
// (EAI_AGAIN) on `groombook-postgres-rw.<ns>.svc`; with backoffLimit: 2 the
// retry pod usually wins, but three unlucky attempts in a row trips
// BackoffLimitExceeded. Resolving once here, with backoff, removes the dice
// roll at the source so the first attempt reliably succeeds.
//
// Mirrors the belt-and-braces pattern used in GRO-1985 (no Corepack
// download fallback): we don't try to outsmart CoreDNS, we just don't ask
// drizzle-kit to do the very first DNS lookup of a freshly-scheduled pod.
//
// Configuration (env):
// WAIT_FOR_DB_MAX_ATTEMPTS default 12 (~30s of total wait at default backoff)
// WAIT_FOR_DB_BASE_DELAY_MS default 500
// WAIT_FOR_DB_MAX_DELAY_MS default 5000
// WAIT_FOR_DB_SKIP default unset; set to "1" to skip (debug only)
//
// On success: exit 0. On exhaustion: exit 1 so the Job's backoff is
// preserved (we don't want to silently mask a real outage by giving up
// after 30s and letting drizzle-kit fail with a less-actionable error).
import { setTimeout as delay } from "node:timers/promises";
import dns from "node:dns/promises";
const MAX_ATTEMPTS = Number(process.env.WAIT_FOR_DB_MAX_ATTEMPTS ?? 12);
const BASE_DELAY_MS = Number(process.env.WAIT_FOR_DB_BASE_DELAY_MS ?? 500);
const MAX_DELAY_MS = Number(process.env.WAIT_FOR_DB_MAX_DELAY_MS ?? 5000);
function parseHost(databaseUrl) {
try {
return new URL(databaseUrl).hostname || null;
} catch {
return null;
}
}
async function resolveOnce(host) {
const start = Date.now();
const result = await dns.lookup(host);
return { address: result.address, ms: Date.now() - start };
}
async function main() {
if (process.env.WAIT_FOR_DB_SKIP === "1") {
console.log("[wait-for-db] WAIT_FOR_DB_SKIP=1, skipping");
return;
}
const databaseUrl = process.env.DATABASE_URL;
if (!databaseUrl) {
// Don't gate the migrate on a misconfigured env — let drizzle-kit fail
// loudly with its own clear error.
console.warn("[wait-for-db] DATABASE_URL not set; skipping");
return;
}
const host = parseHost(databaseUrl);
if (!host) {
console.warn(`[wait-for-db] could not parse hostname from DATABASE_URL; skipping`);
return;
}
console.log(
`[wait-for-db] host=${host} max_attempts=${MAX_ATTEMPTS} ` +
`base_delay_ms=${BASE_DELAY_MS} max_delay_ms=${MAX_DELAY_MS}`,
);
for (let attempt = 1; attempt <= MAX_ATTEMPTS; attempt++) {
try {
const { address, ms } = await resolveOnce(host);
console.log(`[wait-for-db] ok attempt=${attempt} host=${host} -> ${address} (${ms}ms)`);
return;
} catch (err) {
const code = err?.code ?? "UNKNOWN";
const transient = code === "EAI_AGAIN" || code === "ENOTFOUND" || code === "EAI_NODATA";
if (!transient) {
// Hard error (e.g. invalid hostname): surface and let drizzle-kit fail
// with a real error rather than spinning.
console.error(`[wait-for-db] non-transient DNS error attempt=${attempt} code=${code}: ${err.message}`);
process.exit(1);
}
if (attempt === MAX_ATTEMPTS) {
console.error(
`[wait-for-db] exhausted attempts=${MAX_ATTEMPTS} host=${host} last_code=${code}; exiting 1`,
);
process.exit(1);
}
const backoff = Math.min(
MAX_DELAY_MS,
BASE_DELAY_MS * 2 ** (attempt - 1) + Math.floor(Math.random() * BASE_DELAY_MS),
);
console.log(
`[wait-for-db] transient attempt=${attempt} code=${code} retry_in_ms=${backoff}`,
);
await delay(backoff);
}
}
}
main().catch((err) => {
console.error(`[wait-for-db] fatal: ${err?.message ?? err}`);
process.exit(1);
});
+116 -7
View File
@@ -401,7 +401,9 @@ const servicesDef = [
*
* In seedKnownUsers() this replaces the inline UAT-staff block.
*/
async function seedUatStaffAccounts(db: ReturnType<typeof drizzle>) {
async function seedUatStaffAccounts(
db: ReturnType<typeof drizzle>,
): Promise<string | null> {
// ── Staff: UAT Super User (oidcSub from SEED_UAT_SUPER_OIDC_SUB env var) ──
const uatSuperOidcSub = process.env.SEED_UAT_SUPER_OIDC_SUB;
if (uatSuperOidcSub) {
@@ -677,7 +679,12 @@ async function seedUatStaffAccounts(db: ReturnType<typeof drizzle>) {
// We deterministically link the UAT groomer to the UAT customer's first pet
// ("UAT Pup Alpha") and leave the second pet ("UAT Pup Beta") UNLINKED so
// TC-UAT-2 (200) and TC-UAT-3 (403) can both hardcode the stable petIds.
await seedUatGroomerLinkage(db, uatCustomerClientId);
//
// The linkage call itself is performed by the caller AFTER the `services`
// catalogue has been seeded (this helper runs before services exist,
// which previously caused the linkage to be silently skipped on every
// reset). GRO-2100 follow-up.
return uatCustomerClientId;
}
/**
@@ -692,12 +699,18 @@ async function seedUatStaffAccounts(db: ReturnType<typeof drizzle>) {
*/
async function seedUatGroomerLinkage(
db: ReturnType<typeof drizzle>,
customerClientId: string,
customerClientId: string | null,
): Promise<void> {
const uatGroomerEmail = "uat-groomer@groombook.dev";
const LINKED_PET_ID = "c0000001-0000-0000-0000-000000000002"; // UAT Pup Alpha
const APPT_ID = "a0000001-0000-0000-0000-000000000001";
// Skip silently if the UAT Customer client wasn't created (non-UAT seed
// profile, e.g. seedKnownUsers() in an env without the UAT personas).
if (!customerClientId) {
return;
}
// Only run if the UAT groomer staff record actually exists — dev/test seeds
// that don't set SEED_UAT_STAFF_OIDC_SUB should not crash.
const [uatGroomerStaff] = await db
@@ -720,6 +733,19 @@ async function seedUatGroomerLinkage(
return;
}
// Skip if the linked pet hasn't been seeded yet (defensive: caller should
// ensure pets exist; if the helper is re-ordered later we don't want to
// crash here).
const [linkedPet] = await db
.select({ id: schema.pets.id })
.from(schema.pets)
.where(eq(schema.pets.id, LINKED_PET_ID))
.limit(1);
if (!linkedPet) {
console.warn(`⚠ GRO-2100: UAT Pup Alpha (${LINKED_PET_ID}) not found — skipping uat-groomer linkage`);
return;
}
// The "Bath & Brush" service id is stable across the reset; falls back to
// any active service if it has not been seeded yet (e.g. seedKnownUsers
// runs in isolation).
@@ -847,7 +873,7 @@ async function seedKnownUsers() {
// ── UAT staff accounts + Better Auth credentials (shared impl) ──────────────
// Extracted into seedUatStaffAccounts() so it runs in both seedKnownUsers()
// and the full seed() UAT branch.
await seedUatStaffAccounts(db);
const uatCustomerClientId = await seedUatStaffAccounts(db);
// ── Services: idempotent upsert keyed on `id` ─────────────────────────────
// GRO-2064: previously keyed on `services.name` while writing a
@@ -875,6 +901,12 @@ async function seedKnownUsers() {
}
console.log(`✓ Seeded ${demoSvcs.length} services`);
// GRO-2100: deterministic uat-groomer ↔ UAT Pup Alpha linkage. Must run
// AFTER services are seeded (this helper looks up an active service id
// to attach to the appointment; on a fresh reset there are none yet at
// the time seedUatStaffAccounts() returns).
await seedUatGroomerLinkage(db, uatCustomerClientId);
// ── Client: Demo Client ──
const [existingClient] = await db
.select()
@@ -944,6 +976,63 @@ async function seedKnownUsers() {
// ── Main seed ────────────────────────────────────────────────────────────────
// ── GRO-2123: serialize reset+seed with a Postgres advisory lock ────────
// The reset-demo-data CronJob runs on an hourly schedule. With
// concurrencyPolicy=Replace, a new pod can start while the previous one
// is still mid-seed; the new pod's TRUNCATE then deletes rows the old pod
// is still inserting, producing FK 23503 errors non-deterministically
// (see GRO-2123: invoice_tip_splits → invoices).
//
// We hold a session-level advisory lock for the full duration of the
// seed so that overlapping invocations block then proceed in order —
// not skip. The key is a stable 32-bit constant so it can be referenced
// from runbooks without ambiguity and binds to the single-argument
// `pg_advisory_lock(int)` form, which postgres-js serializes as a plain
// number (no bigint type plumbing required).
const SEED_ADVISORY_LOCK_KEY = 0x47524f4f; // "GROO" in ASCII — arbitrary, stable
/**
* Reserve a dedicated connection from `pool`, take the seed advisory lock
* on it, run `fn`, and release the lock + connection in a try/finally.
*
* CRITICAL: with postgres-js connection pooling, a session-level
* `pg_advisory_lock(KEY)` acquired on one pooled connection and released
* on a *different* one is a no-op (the lock is bound to the session /
* pg-backend that took it). We therefore reserve a dedicated connection
* for the lock and release it from the same reserved connection. The
* seed work itself still runs on the pooled connections.
*/
async function withSeedAdvisoryLock<T>(
pool: ReturnType<typeof postgres>,
fn: () => Promise<T>,
): Promise<T> {
const lockConnection = await pool.reserve();
let lockHeld = false;
try {
await lockConnection`SELECT pg_advisory_lock(${SEED_ADVISORY_LOCK_KEY})`;
lockHeld = true;
console.log(`✓ Acquired seed advisory lock (key=${SEED_ADVISORY_LOCK_KEY})`);
const result = await fn();
await lockConnection`SELECT pg_advisory_unlock(${SEED_ADVISORY_LOCK_KEY})`;
lockHeld = false;
console.log(`✓ Released seed advisory lock`);
return result;
} finally {
if (lockHeld) {
try {
await lockConnection`SELECT pg_advisory_unlock(${SEED_ADVISORY_LOCK_KEY})`;
} catch (err) {
console.error("Failed to release seed advisory lock during cleanup:", err);
}
}
try {
lockConnection.release();
} catch (err) {
console.error("Failed to release reserved lock connection:", err);
}
}
}
async function seed() {
const url = process.env.DATABASE_URL;
if (!url) {
@@ -961,6 +1050,22 @@ async function seed() {
const client = postgres(url, { max: 5 });
const db = drizzle(client, { schema });
// GRO-2123: hold the seed advisory lock for the full body of runSeedBody.
// See the withSeedAdvisoryLock comment for why a reserved connection is
// required (postgres-js pooling would silently drop the lock otherwise).
await withSeedAdvisoryLock(client, async () => {
return await runSeedBody(client, db, profile, cfg);
});
await client.end();
}
async function runSeedBody(
client: ReturnType<typeof postgres>,
db: ReturnType<typeof drizzle>,
profile: SeedProfile,
cfg: ProfileConfig,
): Promise<void> {
console.log(`Seeding Groom Book database (profile: ${profile})...\n`);
// ── Staff ──
@@ -1031,7 +1136,7 @@ async function seed() {
// ── UAT staff accounts + Better Auth credentials (shared impl) ──────────────
// Seeds deterministic UAT staff with numeric OIDC subs and Better Auth credentials.
// Must run AFTER random staff are created so upserts land correctly.
await seedUatStaffAccounts(db);
const uatCustomerClientId = await seedUatStaffAccounts(db);
// ── Services ──
// GRO-2064: key the upsert on `services.id` (not `name`) so deterministic
@@ -1058,6 +1163,12 @@ async function seed() {
}
console.log(`✓ Created ${servicesDef.length} services`);
// GRO-2100: deterministic uat-groomer ↔ UAT Pup Alpha linkage. Must run
// AFTER services are seeded (this helper looks up an active service id
// to attach to the appointment; on a fresh reset there are none yet at
// the time seedUatStaffAccounts() returns).
await seedUatGroomerLinkage(db, uatCustomerClientId);
// ── Clients & Pets ──
const now = new Date();
const appointmentsBackDate = new Date(now);
@@ -1576,8 +1687,6 @@ async function seed() {
}
console.log(`✓ Created ${visitLogCount} grooming visit logs`);
console.log("\nSeed complete!");
await client.end();
}
seed().catch((err) => {