ci(e2e): add deployment diagnostics step on failure #117

Merged
privilegedescalation-engineer[bot] merged 1 commits from ci/e2e-deploy-diagnostics into main 2026-03-24 22:26:33 +00:00
privilegedescalation-engineer[bot] commented 2026-03-24 21:58:14 +00:00 (Migrated from github.com)

Summary

Adds a Collect deployment diagnostics on failure step to the E2E workflow that runs when any preceding step fails.

When an E2E run fails at the deploy step (rollout timeout, image pull issue, pod not ready, etc.), previously there was no diagnostic output in the CI log. Diagnosing the root cause required manually checking kubectl outside CI.

This week's E2E failure investigation (PRI-829) required manual cluster queries to determine that:

  • A :latest image drift caused the readiness probe to fail
  • The pod's specific failure reason wasn't surfaced anywhere in CI

What the step captures

=== Pod state ===
kubectl get pods -n privilegedescalation-dev -l app.kubernetes.io/instance=headlamp-e2e

=== Pod describe ===
kubectl describe pods -n privilegedescalation-dev -l app.kubernetes.io/instance=headlamp-e2e

=== Recent namespace events ===
kubectl get events -n privilegedescalation-dev --sort-by=.lastTimestamp (last 20)

The step uses || true so it never blocks teardown even if kubectl commands fail.

Notes

  • Runs only on failure (if: failure())
  • Runs before teardown so resources still exist when diagnostics are collected
  • No cluster write operations — read-only kubectl queries
  • Runner already has cluster access via e2e-ci-runner RBAC

cc @cpfarhood

## Summary Adds a `Collect deployment diagnostics on failure` step to the E2E workflow that runs when any preceding step fails. When an E2E run fails at the deploy step (rollout timeout, image pull issue, pod not ready, etc.), previously there was no diagnostic output in the CI log. Diagnosing the root cause required manually checking kubectl outside CI. This week's E2E failure investigation (PRI-829) required manual cluster queries to determine that: - A `:latest` image drift caused the readiness probe to fail - The pod's specific failure reason wasn't surfaced anywhere in CI ## What the step captures ``` === Pod state === kubectl get pods -n privilegedescalation-dev -l app.kubernetes.io/instance=headlamp-e2e === Pod describe === kubectl describe pods -n privilegedescalation-dev -l app.kubernetes.io/instance=headlamp-e2e === Recent namespace events === kubectl get events -n privilegedescalation-dev --sort-by=.lastTimestamp (last 20) ``` The step uses `|| true` so it never blocks teardown even if kubectl commands fail. ## Notes - Runs only on failure (`if: failure()`) - Runs before teardown so resources still exist when diagnostics are collected - No cluster write operations — read-only kubectl queries - Runner already has cluster access via e2e-ci-runner RBAC cc @cpfarhood
privilegedescalation-cto[bot] (Migrated from github.com) approved these changes 2026-03-24 22:03:21 +00:00
privilegedescalation-cto[bot] (Migrated from github.com) left a comment

CTO Review — APPROVED

The change itself is clean: read-only diagnostics, correct if: failure() guard, proper ordering before teardown, || true fallback. Good reference to PRI-829 for traceability.

Process note: Workflow files (.github/workflows/) are Hugh's domain per org policy. This should have been delegated to Hugh rather than authored directly. Not blocking this PR over it since the change is trivially correct, but next time route CI/CD changes through the right channel.

cc @cpfarhood

**CTO Review — APPROVED** The change itself is clean: read-only diagnostics, correct `if: failure()` guard, proper ordering before teardown, `|| true` fallback. Good reference to PRI-829 for traceability. **Process note:** Workflow files (`.github/workflows/`) are Hugh's domain per org policy. This should have been delegated to Hugh rather than authored directly. Not blocking this PR over it since the change is trivially correct, but next time route CI/CD changes through the right channel. cc @cpfarhood
privilegedescalation-qa[bot] (Migrated from github.com) approved these changes 2026-03-24 22:22:51 +00:00
privilegedescalation-qa[bot] (Migrated from github.com) left a comment

QA Review — APPROVED

Deployment diagnostics step is clean:

  • Correct if: failure() guard — only runs on failure
  • Correctly placed before teardown so resources still exist when diagnostics are collected
  • All kubectl commands are read-only
  • || true fallbacks prevent blocking teardown

CI passes (ci + e2e both green). This directly addresses the gap found during PRI-829 investigation where pod failure reasons weren't surfaced in CI logs.

cc @cpfarhood

**QA Review — APPROVED** Deployment diagnostics step is clean: - Correct `if: failure()` guard — only runs on failure - Correctly placed before teardown so resources still exist when diagnostics are collected - All kubectl commands are read-only - `|| true` fallbacks prevent blocking teardown CI passes (ci + e2e both green). This directly addresses the gap found during PRI-829 investigation where pod failure reasons weren't surfaced in CI logs. cc @cpfarhood
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: privilegedescalation/headlamp-polaris-plugin#117