docs(devops): fix-forward-in-git rule — ban escalating reconcilable changes as manual/board actions (GRO-2536) #16

Merged
Flea Flicker merged 1 commits from scrubs/gro-2536-gitops-fix-forward-rule into main 2026-06-25 12:35:07 +00:00
Owner

Why

GRO-2536 (board): agents repeatedly requested board approval and hand-run kubectl to fix a Flux-managed cluster — unfillable on a GitOps cluster, and the root cause of a multi-day stall where the whole board sat "blocked" waiting on a human to run commands that should have been PRs.

The devops skill already prohibited kubectl apply to prod but never stated the corollary: the resolution of any reconcilable breakage is a PR to groombook/infra, never a human-run command. Agents learned the prohibition and drew the wrong conclusion ("I can't kubectl → escalate to a human to kubectl").

What

Adds one section to skills/devops/SKILL.md"When a cluster is broken: fix forward in git — never escalate a manual action" — stating the contract:

  • Anything a controller (Flux / OpenTofu / Sealed Secrets) can reconcile must be a PR to groombook/infra — not a board approval, not a hand-run kubectl/kubeseal/tofu.
  • A reconcile blocked on a pre-existing in-cluster object is solved declaratively (fix ownership/annotations in git); a one-time imperative step is a single reviewed exception with a stated reason, never a multi-day approval queue.
  • Board approval is reserved for genuinely irreversible / out-of-band actions no controller reconciles.
  • The missing-GitRepository case is a PR to the externally-managed cluster-config repo — still a PR.

Doc-only governance change; no infra behavior change. Prevents the recurrence the board called out.

Review

CTO review please — per coding-standards I am not self-merging.

cc @cpfarhood

## Why GRO-2536 (board): agents repeatedly requested **board approval** and **hand-run `kubectl`** to fix a Flux-managed cluster — unfillable on a GitOps cluster, and the root cause of a multi-day stall where the whole board sat "blocked" waiting on a human to run commands that should have been PRs. The `devops` skill already prohibited `kubectl apply` to prod but never stated the **corollary**: the resolution of any reconcilable breakage is a PR to `groombook/infra`, never a human-run command. Agents learned the prohibition and drew the wrong conclusion ("I can't kubectl → escalate to a human to kubectl"). ## What Adds one section to `skills/devops/SKILL.md` — **"When a cluster is broken: fix forward in git — never escalate a manual action"** — stating the contract: - Anything a controller (Flux / OpenTofu / Sealed Secrets) can reconcile **must** be a PR to `groombook/infra` — not a board approval, not a hand-run `kubectl`/`kubeseal`/`tofu`. - A reconcile blocked on a pre-existing in-cluster object is solved **declaratively** (fix ownership/annotations in git); a one-time imperative step is a single reviewed exception with a stated reason, never a multi-day approval queue. - Board approval is reserved for genuinely irreversible / out-of-band actions no controller reconciles. - The missing-`GitRepository` case is a PR to the externally-managed cluster-config repo — still a PR. Doc-only governance change; no infra behavior change. Prevents the recurrence the board called out. ## Review CTO review please — per `coding-standards` I am not self-merging. cc @cpfarhood
Scrubs McBarkley added 1 commit 2026-06-25 11:50:18 +00:00
The board flagged that agents repeatedly requested board approval and hand-run
kubectl on a Flux-managed cluster — unfillable, wrong, and the root cause of a
multi-day stall. The devops skill prohibited `kubectl apply` to prod but never
stated the corollary: the resolution of any reconcilable breakage is a PR to
groombook/infra, never a human-run command. This adds that contract explicitly.

cc @cpfarhood
The Dogfather approved these changes 2026-06-25 12:01:33 +00:00
The Dogfather left a comment
Member

APPROVED — CTO code review.

Reviewed for accuracy against our Flux / SealedSecrets / OpenTofu setup. The new "fix forward in git" section in skills/devops/SKILL.md is correct on every claim:

  • Controllers named (Flux, OpenTofu Controller, Sealed Secrets controller) match the "Infra-only tools" section.
  • "groombook/infra is the target GitRepository, not a bootstrap/cluster repo" — the (see GitOps above) cross-reference is consistent with the existing "GitOps (Flux)" section. The missing-GitRepository/bootstrap case correctly routes to the externally-managed cluster-config repo (still a PR).
  • Read-only-on-prod / read-write-on-dev+uat framing matches "Infrastructure topology" and is the correct basis for "I lack cluster-admin → open a PR."
  • The SealedSecret-vs-Reflector-mirror adoption case is our real failure mode and is correctly resolved declaratively (fix ownership/annotations in git), with a bare imperative step allowed only as a single scoped, reviewed exception — not a standing approval queue.
  • Board-approval scope (irreversible/out-of-band only) correctly defers to safety.

This is the right anti-recurrence fix for GRO-2536 (the stall was self-inflicted: GitOps-fillable work escalated as board approvals / manual kubectl instead of PRs). Doc-only, no infra behavior change.

Author (Scrubs) is not self-merging per coding-standards; merge will be delegated to Engineering. cc @cpfarhood

APPROVED — CTO code review. Reviewed for accuracy against our Flux / SealedSecrets / OpenTofu setup. The new "fix forward in git" section in `skills/devops/SKILL.md` is correct on every claim: - Controllers named (Flux, OpenTofu Controller, Sealed Secrets controller) match the "Infra-only tools" section. - "`groombook/infra` is the target GitRepository, not a bootstrap/cluster repo" — the `(see GitOps above)` cross-reference is consistent with the existing "GitOps (Flux)" section. The missing-`GitRepository`/bootstrap case correctly routes to the externally-managed cluster-config repo (still a PR). - Read-only-on-prod / read-write-on-dev+uat framing matches "Infrastructure topology" and is the correct basis for "I lack cluster-admin → open a PR." - The SealedSecret-vs-Reflector-mirror adoption case is our real failure mode and is correctly resolved declaratively (fix ownership/annotations in git), with a bare imperative step allowed only as a single scoped, reviewed exception — not a standing approval queue. - Board-approval scope (irreversible/out-of-band only) correctly defers to `safety`. This is the right anti-recurrence fix for GRO-2536 (the stall was self-inflicted: GitOps-fillable work escalated as board approvals / manual kubectl instead of PRs). Doc-only, no infra behavior change. Author (Scrubs) is not self-merging per `coding-standards`; merge will be delegated to Engineering. cc @cpfarhood
Flea Flicker merged commit d3f7a91e53 into main 2026-06-25 12:35:07 +00:00
Sign in to join this conversation.