docs: add architecture decision records

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 13:49:59 +00:00
parent 076fa29995
commit d39a48a7d0
6 changed files with 325 additions and 0 deletions
@@ -0,0 +1,59 @@
+# ADR 001: React Context for Shared CSI Driver State
+
+**Status**: Accepted
+
+**Date**: 2026-03-05
+
+**Deciders**: Development Team
+
+---
+
+## Context
+
+The TNS CSI plugin needs to share data across multiple views: Overview, StorageClasses, Volumes, Snapshots, Metrics, and Benchmark pages, plus detail view sections for PVC, PV, and Pod. Data comes from three tracks:
+
+1. **Headlamp `useList()` hooks** — StorageClass, PersistentVolume, PersistentVolumeClaim
+2. **`ApiProxy.request()`** — CSIDriver resource, controller/node pods, VolumeSnapshotClasses, and VolumeSnapshots
+3. **TrueNAS WebSocket API** — Pool capacity stats (optional, when API key is configured in settings)
+
+The context exposes: `csiDriver`, `driverInstalled`, `storageClasses`, `persistentVolumes`, `persistentVolumeClaims`, `controllerPods`, `nodePods`, `volumeSnapshots`, `volumeSnapshotClasses`, `snapshotCrdAvailable`, `poolStats`, `poolStatsError`, `loading`, `error`, `refresh`.
+
+---
+
+## Decision
+
+Use a single `TnsCsiDataProvider` React Context wrapping all routes. Three-track data fetching:
+
+1. `useList()` for standard Kubernetes resources (StorageClass, PV, PVC)
+2. `ApiProxy.request()` in `useEffect` for CSI-specific resources and snapshots
+3. TrueNAS WebSocket client for pool capacity stats (only when API key is configured in settings)
+
+---
+
+## Consequences
+
+- ✅ Single fetch point eliminates duplicate API calls
+- ✅ All views share consistent data — no stale data across pages
+- ✅ Three-track strategy handles different API requirements cleanly
+- ✅ TrueNAS integration is opt-in — plugin works without it
+- ⚠️ Large context with many fields increases cognitive overhead
+- ⚠️ TrueNAS WebSocket adds complexity to the data layer
+- ⚠️ All consumers re-render on any data change — mitigated by infrequent updates (polling interval)
+
+---
+
+## Alternatives Considered
+
+1. **Separate contexts per data domain** — Rejected. Data is cross-referenced (PVCs filter by StorageClass provisioner), so splitting contexts would require cross-context coordination.
+
+2. **Custom hooks without context** — Rejected. Would duplicate fetches across 6 pages, leading to redundant API calls and inconsistent data.
+
+3. **Redux/Zustand** — Rejected. Not available in the Headlamp plugin environment.
+
+---
+
+## Changelog
+
+| Date | Change |
+|------|--------|
+| 2026-03-05 | Initial decision |
@@ -0,0 +1,60 @@
+# ADR 002: Read-Only Plugin with Benchmark Exception
+
+**Status**: Accepted
+
+**Date**: 2026-03-05
+
+**Deciders**: Development Team
+
+---
+
+## Context
+
+The plugin is primarily a read-only observability tool for TNS CSI storage. However, it includes a Benchmark feature that runs kbench (FIO-based storage benchmarks) against storage classes. Running benchmarks requires creating temporary Kubernetes resources: a PVC for the test volume and a Job running the kbench container.
+
+These resources are tagged with `app.kubernetes.io/managed-by=headlamp-tns-csi-plugin` for lifecycle tracking. The benchmark workflow includes:
+
+1. `buildPvcManifest()` — Create PVC spec for test volume
+2. `createPvc()` — Create the PVC in the cluster
+3. `buildJobManifest()` — Create Job spec for kbench container
+4. `createJob()` — Create the Job in the cluster
+5. Poll for Job completion
+6. `fetchKbenchLogs()` — Retrieve benchmark output from pod logs
+7. `parseKbenchLog()` — Parse FIO results from kbench output
+8. `deleteJob()` — Clean up the benchmark Job
+9. `deletePvc()` — Clean up the test PVC
+
+---
+
+## Decision
+
+The plugin is read-only for all storage observability features. The sole exception is the Benchmark feature, which creates and deletes temporary PVC + Job resources. All created resources are labeled for identification and cleaned up after benchmark completion. The benchmark is triggered explicitly by user action (button on StorageClass detail page via `registerDetailsViewHeaderAction`).
+
+---
+
+## Consequences
+
+- ✅ Minimal RBAC requirements for normal operation (read-only)
+- ✅ Benchmark is opt-in and requires explicit user action
+- ✅ Resources are auto-cleaned after benchmark completion
+- ✅ `managed-by` label enables easy identification of plugin-created resources
+- ⚠️ Requires additional RBAC permissions (create/delete Jobs and PVCs) for benchmark feature
+- ⚠️ Failed cleanup leaves orphaned resources — mitigated by `listKbenchJobs()` which finds orphaned resources by label for manual cleanup
+
+---
+
+## Alternatives Considered
+
+1. **No benchmark feature (fully read-only)** — Rejected. Storage performance testing is a key use case for storage administrators evaluating CSI drivers.
+
+2. **External benchmark tool with results import** — Rejected. Poor user experience requiring context-switching between tools.
+
+3. **Benchmark as a separate plugin** — Rejected. Benchmark results are tied to storage class context and benefit from shared data in the plugin.
+
+---
+
+## Changelog
+
+| Date | Change |
+|------|--------|
+| 2026-03-05 | Initial decision |
@@ -0,0 +1,55 @@
+# ADR 003: Graceful Degradation for Optional CRDs
+
+**Status**: Accepted
+
+**Date**: 2026-03-05
+
+**Deciders**: Development Team
+
+---
+
+## Context
+
+The plugin uses VolumeSnapshot and VolumeSnapshotClass CRDs from `snapshot.storage.k8s.io/v1`. These CRDs are part of the Kubernetes Volume Snapshot feature, which is optional — not all clusters have the snapshot controller installed.
+
+The plugin should work on clusters without snapshot support, showing storage classes, volumes, metrics, and benchmarks without the snapshots page. The CRD fetch is wrapped in `try/catch`; if it fails, the `snapshotCrdAvailable` flag is set to `false`.
+
+---
+
+## Decision
+
+Implement graceful degradation for optional CRDs. The snapshot API calls are wrapped in `try/catch` within the data context. When the snapshot CRDs are not installed:
+
+- `snapshotCrdAvailable` is set to `false`
+- Snapshot-related data arrays are empty
+- The Snapshots page shows an informational message rather than an error
+- All other plugin features remain fully functional
+
+---
+
+## Consequences
+
+- ✅ Plugin works on clusters without snapshot CRDs installed
+- ✅ No error state for missing optional features — clean informational messaging
+- ✅ Clear user feedback about what features are available
+- ✅ Core features (volumes, storage classes, metrics, benchmarks) always work
+- ⚠️ Two code paths (with/without snapshots) to maintain and test
+- ⚠️ Snapshot data might silently fail for reasons other than missing CRDs (e.g., RBAC issues)
+
+---
+
+## Alternatives Considered
+
+1. **Require snapshot CRDs (hard dependency)** — Rejected. Too restrictive; many clusters do not have the snapshot controller installed.
+
+2. **Feature detection via API discovery before fetching** — Considered, but `try/catch` on the actual fetch is simpler and catches all failure modes including RBAC restrictions.
+
+3. **Disable snapshots page entirely when CRDs missing** — Rejected. Showing an informational message explaining how to enable snapshots is better UX than silently hiding the page.
+
+---
+
+## Changelog
+
+| Date | Change |
+|------|--------|
+| 2026-03-05 | Initial decision |
@@ -0,0 +1,54 @@
+# ADR 004: URL Hash-Based Detail Panel State
+
+**Status**: Accepted
+
+**Date**: 2026-03-05
+
+**Deciders**: Development Team
+
+---
+
+## Context
+
+Several pages need to show detail panels for selected resources (e.g., clicking a PVC row shows PVC details). The detail panel state (which resource is selected) needs to be shareable via URL and survive page refresh. Options include:
+
+- **React state** — Lost on refresh, not shareable
+- **URL query parameters** — May cause full page reload, potential conflicts with Headlamp routing
+- **URL hash fragments** — Client-side only, no reload, compatible with SPA routing
+
+---
+
+## Decision
+
+Use URL hash fragments to encode detail panel state. When a user selects a resource, the hash is updated (e.g., `#pvc/namespace/name`). On page load, the hash is parsed to restore the selected resource. This enables deep-linking to specific resource details and browser back/forward navigation.
+
+---
+
+## Consequences
+
+- ✅ Deep-linkable resource details — users can share URLs pointing to specific resources
+- ✅ Survives page refresh without losing selected resource
+- ✅ Browser back/forward navigation works naturally
+- ✅ No server round-trip — hash changes are purely client-side
+- ✅ Compatible with Headlamp's client-side routing
+- ⚠️ Hash-based state is not a standard React pattern — requires team familiarity
+- ⚠️ Requires manual hash parsing and updating logic
+- ⚠️ Hash changes don't trigger React re-renders by default — requires `hashchange` event listener
+
+---
+
+## Alternatives Considered
+
+1. **React state only** — Rejected. State is lost on refresh and cannot be shared via URL.
+
+2. **URL query parameters** — Rejected. May conflict with Headlamp's routing and could trigger unintended navigation behavior.
+
+3. **Separate detail routes** — Rejected. Too heavyweight for inline detail panels; would require full page transitions for what should be a panel toggle.
+
+---
+
+## Changelog
+
+| Date | Change |
+|------|--------|
+| 2026-03-05 | Initial decision |
@@ -0,0 +1,53 @@
+# ADR 005: Prometheus Metrics via Pod Proxy
+
+**Status**: Accepted
+
+**Date**: 2026-03-05
+
+**Deciders**: Development Team
+
+---
+
+## Context
+
+The plugin displays CSI driver metrics (operation latencies, error rates, volume stats). The CSI driver pods expose a Prometheus metrics endpoint on port 8080 in the standard text exposition format. The plugin needs to fetch and parse these metrics. Options:
+
+- **Query a Prometheus server** — Requires Prometheus to be installed in the cluster
+- **Scrape the pod directly via Kubernetes pod proxy** — No additional dependencies
+- **Use a metrics aggregation service** — Requires additional infrastructure
+
+---
+
+## Decision
+
+Fetch metrics directly from the CSI driver pod's `/metrics` endpoint via Kubernetes pod proxy (`ApiProxy.request` to `/api/v1/namespaces/{ns}/pods/{pod}:8080/proxy/metrics`). Parse the Prometheus text exposition format in-browser using a custom parser in `metrics.ts`. No dependency on a Prometheus server installation.
+
+---
+
+## Consequences
+
+- ✅ Works without Prometheus server installed — no additional infrastructure dependency
+- ✅ Direct from source with no aggregation delay — metrics are always current
+- ✅ Leverages existing Kubernetes API authentication and authorization
+- ✅ No additional service dependencies to configure or maintain
+- ⚠️ Custom Prometheus text format parser to maintain — mitigated by the parser being well-tested
+- ⚠️ Only gets metrics from one pod at a time (no aggregation across replicas) — acceptable since CSI controller typically runs one replica
+- ⚠️ No historical data (point-in-time only) — users needing historical trends should use a full Prometheus setup
+
+---
+
+## Alternatives Considered
+
+1. **Query Prometheus server via service proxy** (like the intel-gpu plugin) — Rejected. Would require Prometheus to be installed, adding a hard infrastructure dependency.
+
+2. **Use a metrics library (prom-client) for parsing** — Rejected. Adds a runtime dependency for a relatively simple parsing task.
+
+3. **JSON metrics endpoint instead of Prometheus format** — Rejected. The CSI driver only exposes Prometheus text format; a JSON endpoint would require changes to the driver itself.
+
+---
+
+## Changelog
+
+| Date | Change |
+|------|--------|
+| 2026-03-05 | Initial decision |
@@ -0,0 +1,44 @@
+# Architecture Decision Records
+
+## What is an ADR?
+
+An Architecture Decision Record (ADR) captures an important architectural decision made along with its context and consequences. ADRs are a lightweight way to document the "why" behind technical choices, ensuring that future contributors understand the reasoning behind the current architecture.
+
+## Format
+
+This project uses the [Nygard-style ADR format](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions):
+
+- **Title**: Short noun phrase describing the decision
+- **Status**: Proposed | Accepted | Deprecated | Superseded
+- **Date**: When the decision was made
+- **Context**: What is the issue that we're seeing that motivates this decision?
+- **Decision**: What is the change that we're proposing and/or doing?
+- **Consequences**: What becomes easier or more difficult to do because of this change?
+- **Alternatives Considered**: What other options were evaluated?
+
+## Index
+
+| ADR | Title | Status | Date |
+|-----|-------|--------|------|
+| [001](001-react-context-state.md) | React Context for Shared CSI Driver State | Accepted | 2026-03-05 |
+| [002](002-read-only-benchmark-exception.md) | Read-Only Plugin with Benchmark Exception | Accepted | 2026-03-05 |
+| [003](003-optional-crd-degradation.md) | Graceful Degradation for Optional CRDs | Accepted | 2026-03-05 |
+| [004](004-url-hash-detail-panels.md) | URL Hash-Based Detail Panel State | Accepted | 2026-03-05 |
+| [005](005-prometheus-pod-proxy.md) | Prometheus Metrics via Pod Proxy | Accepted | 2026-03-05 |
+
+## Creating New ADRs
+
+1. Copy an existing ADR as a template
+2. Assign the next sequential number (e.g., `006-your-title.md`)
+3. Fill in all sections: Status, Date, Context, Decision, Consequences, Alternatives
+4. Set the status to `Proposed` until reviewed
+5. Update this README index table
+6. Submit as part of a pull request for review
+
+ADRs should not be deleted. If a decision is reversed, create a new ADR that supersedes the old one and update the old ADR's status to `Superseded by [ADR NNN](NNN-title.md)`.
+
+## References
+
+- [Michael Nygard - Documenting Architecture Decisions](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions)
+- [ADR GitHub Organization](https://adr.github.io/)
+- [Joel Parker Henderson - Architecture Decision Record](https://github.com/joelparkerhenderson/architecture-decision-record)