fix(db): wait for/retry DB DNS resolution before drizzle-kit migrate (GRO-2163) #161

Merged
Flea Flicker merged 4 commits from fix/gro-2163-migrate-pre-dns-wait into dev 2026-06-08 13:37:30 +00:00
Member

Summary

  • Add packages/db/scripts/wait-for-db.mjs — a no-deps Node 22 script that resolves the database hostname derived from DATABASE_URL via node:dns.promises with exponential backoff (12 attempts, ~30s total) and only exits 0 once a real IP is returned.
  • Wire it as a pnpm pre-migrate (and pre-seed / pre-reset) hook in @groombook/db so pnpm auto-runs it before any data-plane command. pnpm migrate stays drizzle-kit migrate — the wait is added transparently.
  • The companion infra PR (groombook/infra flea/GRO-2149-stagger-dns-config) carries the K8s-side defense-in-depth (backoffLimit: 4, dnsConfig.options += ndots:2 + the prior GRO-2149 single-request-reopen).

Why

The first attempt of a fresh migrate-schema pod occasionally hits a transient CoreDNS miss (EAI_AGAIN) on groombook-postgres-rw.<ns>.svc. With backoffLimit: 2 the retry pod usually wins, but three unlucky attempts in a row trips BackoffLimitExceeded. Resolving once here, with backoff, removes the dice roll at the source so the first attempt reliably succeeds.

Mirrors the belt-and-braces pattern used in GRO-1985 (disable Corepack download fallback): do not try to outsmart CoreDNS, just do not ask drizzle-kit to do the very first DNS lookup of a freshly-scheduled pod.

Test plan

  • node packages/db/scripts/wait-for-db.mjs with DATABASE_URL set to the live paperclip-pg host resolves on attempt 1 in 7ms (sandbox).
  • CI: pnpm --filter @groombook/db typecheck (no TS change).
  • Verify on the next UAT deploy: tail migrate-schema-b5943fb pod logs for a clean [wait-for-db] ok attempt=1 ... line before drizzle-kit migrate output.

Out of scope

  • The "stop Flux from re-creating the Job every reconcile" half of GRO-2163 — that needs a Kustomization-level refactor (run-once gate or per-tag name suffix). Filed as a follow-up on GRO-2163 to keep this PR small and shippable.

Refs

  • GRO-2163
  • GRO-1985 (the pattern this mirrors)
  • GRO-2149 (the K8s-side single-request-reopen fix; carried in the companion infra PR)
## Summary - Add `packages/db/scripts/wait-for-db.mjs` — a no-deps Node 22 script that resolves the database hostname derived from `DATABASE_URL` via `node:dns.promises` with exponential backoff (12 attempts, ~30s total) and only exits 0 once a real IP is returned. - Wire it as a pnpm `pre-migrate` (and `pre-seed` / `pre-reset`) hook in `@groombook/db` so pnpm auto-runs it before any data-plane command. `pnpm migrate` stays `drizzle-kit migrate` — the wait is added transparently. - The companion infra PR (`groombook/infra` `flea/GRO-2149-stagger-dns-config`) carries the K8s-side defense-in-depth (`backoffLimit: 4`, `dnsConfig.options += ndots:2` + the prior GRO-2149 `single-request-reopen`). ## Why The first attempt of a fresh `migrate-schema` pod occasionally hits a transient CoreDNS miss (`EAI_AGAIN`) on `groombook-postgres-rw.<ns>.svc`. With `backoffLimit: 2` the retry pod usually wins, but three unlucky attempts in a row trips `BackoffLimitExceeded`. Resolving once here, with backoff, removes the dice roll at the source so the first attempt reliably succeeds. Mirrors the belt-and-braces pattern used in GRO-1985 (disable Corepack download fallback): do not try to outsmart CoreDNS, just do not ask `drizzle-kit` to do the very first DNS lookup of a freshly-scheduled pod. ## Test plan - [x] `node packages/db/scripts/wait-for-db.mjs` with `DATABASE_URL` set to the live paperclip-pg host resolves on attempt 1 in 7ms (sandbox). - [ ] CI: `pnpm --filter @groombook/db typecheck` (no TS change). - [ ] Verify on the next UAT deploy: tail `migrate-schema-b5943fb` pod logs for a clean `[wait-for-db] ok attempt=1 ...` line before `drizzle-kit migrate` output. ## Out of scope - The "stop Flux from re-creating the Job every reconcile" half of GRO-2163 — that needs a Kustomization-level refactor (run-once gate or per-tag name suffix). Filed as a follow-up on GRO-2163 to keep this PR small and shippable. ## Refs - GRO-2163 - GRO-1985 (the pattern this mirrors) - GRO-2149 (the K8s-side single-request-reopen fix; carried in the companion infra PR)
Flea Flicker added 1 commit 2026-06-08 00:38:57 +00:00
fix(db): wait for/retry DB DNS resolution before drizzle-kit migrate (GRO-2163)
CI / Test (pull_request) Successful in 10s
CI / Lint & Typecheck (pull_request) Successful in 16s
CI / Build & Push Docker Images (pull_request) Failing after 30m28s
323f6d6bcb
A fresh migrate-schema pod occasionally hits a transient CoreDNS miss
(EAI_AGAIN) on groombook-postgres-rw.<ns>.svc on its first attempt.
With backoffLimit: 2 the retry pod usually wins, but three unlucky
attempts in a row trips BackoffLimitExceeded and the Job is recreated
on every Flux reconcile (3+ Completed events observed in 8 min in uat).

Add packages/db/scripts/wait-for-db.mjs: a tiny no-deps Node 22 script
that parses DATABASE_URL, resolves the hostname via node:dns.promises
with exponential backoff (12 attempts, ~30s total) and only exits 0
once a real IP is returned. EAI_AGAIN / ENOTFOUND / EAI_NODATA are
retried; any other DNS error is surfaced so drizzle-kit gets a clear
message instead of being starved by retries.

Wire it as a pnpm `pre-migrate` (and `pre-seed` / `pre-reset`) hook
in @groombook/db so pnpm auto-runs it before any of the data-plane
commands. Mirrors the belt-and-braces pattern used in GRO-1985
(disable Corepack download fallback): do not try to outsmart CoreDNS,
just do not ask drizzle-kit to perform the very first DNS lookup of a
freshly-scheduled pod.

Defaults are env-tunable (WAIT_FOR_DB_MAX_ATTEMPTS, _BASE_DELAY_MS,
_MAX_DELAY_MS, _SKIP) so a future uat-debug pod can sidestep the
wait if needed.

Refs: GRO-2163, GRO-1985.
Flea Flicker added 2 commits 2026-06-08 05:40:29 +00:00
fix(db): run wait-for-db inline in migrate/seed/reset (pnpm skips pre-* hooks)
CI / Test (pull_request) Failing after 10m16s
CI / Lint & Typecheck (pull_request) Failing after 10m23s
CI / Build & Push Docker Images (pull_request) Has been skipped
680cfa2bf5
pnpm 9 does not auto-run npm pre-* lifecycle scripts (enable-pre-post-scripts
defaults to false), so the pre-migrate/pre-seed/pre-reset hooks added in the
prior commit never executed under the Dockerfile entrypoint
`pnpm --filter @groombook/db migrate`. Chain wait-for-db.mjs directly into the
migrate/seed/reset scripts so the DNS pre-resolve actually runs on the real
invocation path. Verified locally that `pnpm --filter @groombook/db migrate`
now runs wait-for-db before drizzle-kit. (GRO-2163)

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Flea Flicker added 1 commit 2026-06-08 13:35:08 +00:00
Merge remote-tracking branch 'origin/dev' into fix/gro-2163-migrate-pre-dns-wait
CI / Test (pull_request) Successful in 27s
CI / Lint & Typecheck (pull_request) Successful in 31s
CI / Build & Push Docker Images (pull_request) Successful in 1m18s
840675e89e
Flea Flicker merged commit b9fc688769 into dev 2026-06-08 13:37:30 +00:00
Sign in to join this conversation.