docs: add architecture decision records

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
DevContainer User
2026-03-05 13:49:59 +00:00
parent 076fa29995
commit d39a48a7d0
6 changed files with 325 additions and 0 deletions
@@ -0,0 +1,59 @@
# ADR 001: React Context for Shared CSI Driver State
**Status**: Accepted
**Date**: 2026-03-05
**Deciders**: Development Team
---
## Context
The TNS CSI plugin needs to share data across multiple views: Overview, StorageClasses, Volumes, Snapshots, Metrics, and Benchmark pages, plus detail view sections for PVC, PV, and Pod. Data comes from three tracks:
1. **Headlamp `useList()` hooks** — StorageClass, PersistentVolume, PersistentVolumeClaim
2. **`ApiProxy.request()`** — CSIDriver resource, controller/node pods, VolumeSnapshotClasses, and VolumeSnapshots
3. **TrueNAS WebSocket API** — Pool capacity stats (optional, when API key is configured in settings)
The context exposes: `csiDriver`, `driverInstalled`, `storageClasses`, `persistentVolumes`, `persistentVolumeClaims`, `controllerPods`, `nodePods`, `volumeSnapshots`, `volumeSnapshotClasses`, `snapshotCrdAvailable`, `poolStats`, `poolStatsError`, `loading`, `error`, `refresh`.
---
## Decision
Use a single `TnsCsiDataProvider` React Context wrapping all routes. Three-track data fetching:
1. `useList()` for standard Kubernetes resources (StorageClass, PV, PVC)
2. `ApiProxy.request()` in `useEffect` for CSI-specific resources and snapshots
3. TrueNAS WebSocket client for pool capacity stats (only when API key is configured in settings)
---
## Consequences
- ✅ Single fetch point eliminates duplicate API calls
- ✅ All views share consistent data — no stale data across pages
- ✅ Three-track strategy handles different API requirements cleanly
- ✅ TrueNAS integration is opt-in — plugin works without it
- ⚠️ Large context with many fields increases cognitive overhead
- ⚠️ TrueNAS WebSocket adds complexity to the data layer
- ⚠️ All consumers re-render on any data change — mitigated by infrequent updates (polling interval)
---
## Alternatives Considered
1. **Separate contexts per data domain** — Rejected. Data is cross-referenced (PVCs filter by StorageClass provisioner), so splitting contexts would require cross-context coordination.
2. **Custom hooks without context** — Rejected. Would duplicate fetches across 6 pages, leading to redundant API calls and inconsistent data.
3. **Redux/Zustand** — Rejected. Not available in the Headlamp plugin environment.
---
## Changelog
| Date | Change |
|------|--------|
| 2026-03-05 | Initial decision |
@@ -0,0 +1,60 @@
# ADR 002: Read-Only Plugin with Benchmark Exception
**Status**: Accepted
**Date**: 2026-03-05
**Deciders**: Development Team
---
## Context
The plugin is primarily a read-only observability tool for TNS CSI storage. However, it includes a Benchmark feature that runs kbench (FIO-based storage benchmarks) against storage classes. Running benchmarks requires creating temporary Kubernetes resources: a PVC for the test volume and a Job running the kbench container.
These resources are tagged with `app.kubernetes.io/managed-by=headlamp-tns-csi-plugin` for lifecycle tracking. The benchmark workflow includes:
1. `buildPvcManifest()` — Create PVC spec for test volume
2. `createPvc()` — Create the PVC in the cluster
3. `buildJobManifest()` — Create Job spec for kbench container
4. `createJob()` — Create the Job in the cluster
5. Poll for Job completion
6. `fetchKbenchLogs()` — Retrieve benchmark output from pod logs
7. `parseKbenchLog()` — Parse FIO results from kbench output
8. `deleteJob()` — Clean up the benchmark Job
9. `deletePvc()` — Clean up the test PVC
---
## Decision
The plugin is read-only for all storage observability features. The sole exception is the Benchmark feature, which creates and deletes temporary PVC + Job resources. All created resources are labeled for identification and cleaned up after benchmark completion. The benchmark is triggered explicitly by user action (button on StorageClass detail page via `registerDetailsViewHeaderAction`).
---
## Consequences
- ✅ Minimal RBAC requirements for normal operation (read-only)
- ✅ Benchmark is opt-in and requires explicit user action
- ✅ Resources are auto-cleaned after benchmark completion
-`managed-by` label enables easy identification of plugin-created resources
- ⚠️ Requires additional RBAC permissions (create/delete Jobs and PVCs) for benchmark feature
- ⚠️ Failed cleanup leaves orphaned resources — mitigated by `listKbenchJobs()` which finds orphaned resources by label for manual cleanup
---
## Alternatives Considered
1. **No benchmark feature (fully read-only)** — Rejected. Storage performance testing is a key use case for storage administrators evaluating CSI drivers.
2. **External benchmark tool with results import** — Rejected. Poor user experience requiring context-switching between tools.
3. **Benchmark as a separate plugin** — Rejected. Benchmark results are tied to storage class context and benefit from shared data in the plugin.
---
## Changelog
| Date | Change |
|------|--------|
| 2026-03-05 | Initial decision |
@@ -0,0 +1,55 @@
# ADR 003: Graceful Degradation for Optional CRDs
**Status**: Accepted
**Date**: 2026-03-05
**Deciders**: Development Team
---
## Context
The plugin uses VolumeSnapshot and VolumeSnapshotClass CRDs from `snapshot.storage.k8s.io/v1`. These CRDs are part of the Kubernetes Volume Snapshot feature, which is optional — not all clusters have the snapshot controller installed.
The plugin should work on clusters without snapshot support, showing storage classes, volumes, metrics, and benchmarks without the snapshots page. The CRD fetch is wrapped in `try/catch`; if it fails, the `snapshotCrdAvailable` flag is set to `false`.
---
## Decision
Implement graceful degradation for optional CRDs. The snapshot API calls are wrapped in `try/catch` within the data context. When the snapshot CRDs are not installed:
- `snapshotCrdAvailable` is set to `false`
- Snapshot-related data arrays are empty
- The Snapshots page shows an informational message rather than an error
- All other plugin features remain fully functional
---
## Consequences
- ✅ Plugin works on clusters without snapshot CRDs installed
- ✅ No error state for missing optional features — clean informational messaging
- ✅ Clear user feedback about what features are available
- ✅ Core features (volumes, storage classes, metrics, benchmarks) always work
- ⚠️ Two code paths (with/without snapshots) to maintain and test
- ⚠️ Snapshot data might silently fail for reasons other than missing CRDs (e.g., RBAC issues)
---
## Alternatives Considered
1. **Require snapshot CRDs (hard dependency)** — Rejected. Too restrictive; many clusters do not have the snapshot controller installed.
2. **Feature detection via API discovery before fetching** — Considered, but `try/catch` on the actual fetch is simpler and catches all failure modes including RBAC restrictions.
3. **Disable snapshots page entirely when CRDs missing** — Rejected. Showing an informational message explaining how to enable snapshots is better UX than silently hiding the page.
---
## Changelog
| Date | Change |
|------|--------|
| 2026-03-05 | Initial decision |
@@ -0,0 +1,54 @@
# ADR 004: URL Hash-Based Detail Panel State
**Status**: Accepted
**Date**: 2026-03-05
**Deciders**: Development Team
---
## Context
Several pages need to show detail panels for selected resources (e.g., clicking a PVC row shows PVC details). The detail panel state (which resource is selected) needs to be shareable via URL and survive page refresh. Options include:
- **React state** — Lost on refresh, not shareable
- **URL query parameters** — May cause full page reload, potential conflicts with Headlamp routing
- **URL hash fragments** — Client-side only, no reload, compatible with SPA routing
---
## Decision
Use URL hash fragments to encode detail panel state. When a user selects a resource, the hash is updated (e.g., `#pvc/namespace/name`). On page load, the hash is parsed to restore the selected resource. This enables deep-linking to specific resource details and browser back/forward navigation.
---
## Consequences
- ✅ Deep-linkable resource details — users can share URLs pointing to specific resources
- ✅ Survives page refresh without losing selected resource
- ✅ Browser back/forward navigation works naturally
- ✅ No server round-trip — hash changes are purely client-side
- ✅ Compatible with Headlamp's client-side routing
- ⚠️ Hash-based state is not a standard React pattern — requires team familiarity
- ⚠️ Requires manual hash parsing and updating logic
- ⚠️ Hash changes don't trigger React re-renders by default — requires `hashchange` event listener
---
## Alternatives Considered
1. **React state only** — Rejected. State is lost on refresh and cannot be shared via URL.
2. **URL query parameters** — Rejected. May conflict with Headlamp's routing and could trigger unintended navigation behavior.
3. **Separate detail routes** — Rejected. Too heavyweight for inline detail panels; would require full page transitions for what should be a panel toggle.
---
## Changelog
| Date | Change |
|------|--------|
| 2026-03-05 | Initial decision |
@@ -0,0 +1,53 @@
# ADR 005: Prometheus Metrics via Pod Proxy
**Status**: Accepted
**Date**: 2026-03-05
**Deciders**: Development Team
---
## Context
The plugin displays CSI driver metrics (operation latencies, error rates, volume stats). The CSI driver pods expose a Prometheus metrics endpoint on port 8080 in the standard text exposition format. The plugin needs to fetch and parse these metrics. Options:
- **Query a Prometheus server** — Requires Prometheus to be installed in the cluster
- **Scrape the pod directly via Kubernetes pod proxy** — No additional dependencies
- **Use a metrics aggregation service** — Requires additional infrastructure
---
## Decision
Fetch metrics directly from the CSI driver pod's `/metrics` endpoint via Kubernetes pod proxy (`ApiProxy.request` to `/api/v1/namespaces/{ns}/pods/{pod}:8080/proxy/metrics`). Parse the Prometheus text exposition format in-browser using a custom parser in `metrics.ts`. No dependency on a Prometheus server installation.
---
## Consequences
- ✅ Works without Prometheus server installed — no additional infrastructure dependency
- ✅ Direct from source with no aggregation delay — metrics are always current
- ✅ Leverages existing Kubernetes API authentication and authorization
- ✅ No additional service dependencies to configure or maintain
- ⚠️ Custom Prometheus text format parser to maintain — mitigated by the parser being well-tested
- ⚠️ Only gets metrics from one pod at a time (no aggregation across replicas) — acceptable since CSI controller typically runs one replica
- ⚠️ No historical data (point-in-time only) — users needing historical trends should use a full Prometheus setup
---
## Alternatives Considered
1. **Query Prometheus server via service proxy** (like the intel-gpu plugin) — Rejected. Would require Prometheus to be installed, adding a hard infrastructure dependency.
2. **Use a metrics library (prom-client) for parsing** — Rejected. Adds a runtime dependency for a relatively simple parsing task.
3. **JSON metrics endpoint instead of Prometheus format** — Rejected. The CSI driver only exposes Prometheus text format; a JSON endpoint would require changes to the driver itself.
---
## Changelog
| Date | Change |
|------|--------|
| 2026-03-05 | Initial decision |
+44
View File
@@ -0,0 +1,44 @@
# Architecture Decision Records
## What is an ADR?
An Architecture Decision Record (ADR) captures an important architectural decision made along with its context and consequences. ADRs are a lightweight way to document the "why" behind technical choices, ensuring that future contributors understand the reasoning behind the current architecture.
## Format
This project uses the [Nygard-style ADR format](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions):
- **Title**: Short noun phrase describing the decision
- **Status**: Proposed | Accepted | Deprecated | Superseded
- **Date**: When the decision was made
- **Context**: What is the issue that we're seeing that motivates this decision?
- **Decision**: What is the change that we're proposing and/or doing?
- **Consequences**: What becomes easier or more difficult to do because of this change?
- **Alternatives Considered**: What other options were evaluated?
## Index
| ADR | Title | Status | Date |
|-----|-------|--------|------|
| [001](001-react-context-state.md) | React Context for Shared CSI Driver State | Accepted | 2026-03-05 |
| [002](002-read-only-benchmark-exception.md) | Read-Only Plugin with Benchmark Exception | Accepted | 2026-03-05 |
| [003](003-optional-crd-degradation.md) | Graceful Degradation for Optional CRDs | Accepted | 2026-03-05 |
| [004](004-url-hash-detail-panels.md) | URL Hash-Based Detail Panel State | Accepted | 2026-03-05 |
| [005](005-prometheus-pod-proxy.md) | Prometheus Metrics via Pod Proxy | Accepted | 2026-03-05 |
## Creating New ADRs
1. Copy an existing ADR as a template
2. Assign the next sequential number (e.g., `006-your-title.md`)
3. Fill in all sections: Status, Date, Context, Decision, Consequences, Alternatives
4. Set the status to `Proposed` until reviewed
5. Update this README index table
6. Submit as part of a pull request for review
ADRs should not be deleted. If a decision is reversed, create a new ADR that supersedes the old one and update the old ADR's status to `Superseded by [ADR NNN](NNN-title.md)`.
## References
- [Michael Nygard - Documenting Architecture Decisions](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions)
- [ADR GitHub Organization](https://adr.github.io/)
- [Joel Parker Henderson - Architecture Decision Record](https://github.com/joelparkerhenderson/architecture-decision-record)