docs: add architecture decision records
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,59 @@
|
||||
# ADR 001: React Context for Shared CSI Driver State
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Date**: 2026-03-05
|
||||
|
||||
**Deciders**: Development Team
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
The TNS CSI plugin needs to share data across multiple views: Overview, StorageClasses, Volumes, Snapshots, Metrics, and Benchmark pages, plus detail view sections for PVC, PV, and Pod. Data comes from three tracks:
|
||||
|
||||
1. **Headlamp `useList()` hooks** — StorageClass, PersistentVolume, PersistentVolumeClaim
|
||||
2. **`ApiProxy.request()`** — CSIDriver resource, controller/node pods, VolumeSnapshotClasses, and VolumeSnapshots
|
||||
3. **TrueNAS WebSocket API** — Pool capacity stats (optional, when API key is configured in settings)
|
||||
|
||||
The context exposes: `csiDriver`, `driverInstalled`, `storageClasses`, `persistentVolumes`, `persistentVolumeClaims`, `controllerPods`, `nodePods`, `volumeSnapshots`, `volumeSnapshotClasses`, `snapshotCrdAvailable`, `poolStats`, `poolStatsError`, `loading`, `error`, `refresh`.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Use a single `TnsCsiDataProvider` React Context wrapping all routes. Three-track data fetching:
|
||||
|
||||
1. `useList()` for standard Kubernetes resources (StorageClass, PV, PVC)
|
||||
2. `ApiProxy.request()` in `useEffect` for CSI-specific resources and snapshots
|
||||
3. TrueNAS WebSocket client for pool capacity stats (only when API key is configured in settings)
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- ✅ Single fetch point eliminates duplicate API calls
|
||||
- ✅ All views share consistent data — no stale data across pages
|
||||
- ✅ Three-track strategy handles different API requirements cleanly
|
||||
- ✅ TrueNAS integration is opt-in — plugin works without it
|
||||
- ⚠️ Large context with many fields increases cognitive overhead
|
||||
- ⚠️ TrueNAS WebSocket adds complexity to the data layer
|
||||
- ⚠️ All consumers re-render on any data change — mitigated by infrequent updates (polling interval)
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1. **Separate contexts per data domain** — Rejected. Data is cross-referenced (PVCs filter by StorageClass provisioner), so splitting contexts would require cross-context coordination.
|
||||
|
||||
2. **Custom hooks without context** — Rejected. Would duplicate fetches across 6 pages, leading to redundant API calls and inconsistent data.
|
||||
|
||||
3. **Redux/Zustand** — Rejected. Not available in the Headlamp plugin environment.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-03-05 | Initial decision |
|
||||
@@ -0,0 +1,60 @@
|
||||
# ADR 002: Read-Only Plugin with Benchmark Exception
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Date**: 2026-03-05
|
||||
|
||||
**Deciders**: Development Team
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
The plugin is primarily a read-only observability tool for TNS CSI storage. However, it includes a Benchmark feature that runs kbench (FIO-based storage benchmarks) against storage classes. Running benchmarks requires creating temporary Kubernetes resources: a PVC for the test volume and a Job running the kbench container.
|
||||
|
||||
These resources are tagged with `app.kubernetes.io/managed-by=headlamp-tns-csi-plugin` for lifecycle tracking. The benchmark workflow includes:
|
||||
|
||||
1. `buildPvcManifest()` — Create PVC spec for test volume
|
||||
2. `createPvc()` — Create the PVC in the cluster
|
||||
3. `buildJobManifest()` — Create Job spec for kbench container
|
||||
4. `createJob()` — Create the Job in the cluster
|
||||
5. Poll for Job completion
|
||||
6. `fetchKbenchLogs()` — Retrieve benchmark output from pod logs
|
||||
7. `parseKbenchLog()` — Parse FIO results from kbench output
|
||||
8. `deleteJob()` — Clean up the benchmark Job
|
||||
9. `deletePvc()` — Clean up the test PVC
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
The plugin is read-only for all storage observability features. The sole exception is the Benchmark feature, which creates and deletes temporary PVC + Job resources. All created resources are labeled for identification and cleaned up after benchmark completion. The benchmark is triggered explicitly by user action (button on StorageClass detail page via `registerDetailsViewHeaderAction`).
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- ✅ Minimal RBAC requirements for normal operation (read-only)
|
||||
- ✅ Benchmark is opt-in and requires explicit user action
|
||||
- ✅ Resources are auto-cleaned after benchmark completion
|
||||
- ✅ `managed-by` label enables easy identification of plugin-created resources
|
||||
- ⚠️ Requires additional RBAC permissions (create/delete Jobs and PVCs) for benchmark feature
|
||||
- ⚠️ Failed cleanup leaves orphaned resources — mitigated by `listKbenchJobs()` which finds orphaned resources by label for manual cleanup
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1. **No benchmark feature (fully read-only)** — Rejected. Storage performance testing is a key use case for storage administrators evaluating CSI drivers.
|
||||
|
||||
2. **External benchmark tool with results import** — Rejected. Poor user experience requiring context-switching between tools.
|
||||
|
||||
3. **Benchmark as a separate plugin** — Rejected. Benchmark results are tied to storage class context and benefit from shared data in the plugin.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-03-05 | Initial decision |
|
||||
@@ -0,0 +1,55 @@
|
||||
# ADR 003: Graceful Degradation for Optional CRDs
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Date**: 2026-03-05
|
||||
|
||||
**Deciders**: Development Team
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
The plugin uses VolumeSnapshot and VolumeSnapshotClass CRDs from `snapshot.storage.k8s.io/v1`. These CRDs are part of the Kubernetes Volume Snapshot feature, which is optional — not all clusters have the snapshot controller installed.
|
||||
|
||||
The plugin should work on clusters without snapshot support, showing storage classes, volumes, metrics, and benchmarks without the snapshots page. The CRD fetch is wrapped in `try/catch`; if it fails, the `snapshotCrdAvailable` flag is set to `false`.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Implement graceful degradation for optional CRDs. The snapshot API calls are wrapped in `try/catch` within the data context. When the snapshot CRDs are not installed:
|
||||
|
||||
- `snapshotCrdAvailable` is set to `false`
|
||||
- Snapshot-related data arrays are empty
|
||||
- The Snapshots page shows an informational message rather than an error
|
||||
- All other plugin features remain fully functional
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- ✅ Plugin works on clusters without snapshot CRDs installed
|
||||
- ✅ No error state for missing optional features — clean informational messaging
|
||||
- ✅ Clear user feedback about what features are available
|
||||
- ✅ Core features (volumes, storage classes, metrics, benchmarks) always work
|
||||
- ⚠️ Two code paths (with/without snapshots) to maintain and test
|
||||
- ⚠️ Snapshot data might silently fail for reasons other than missing CRDs (e.g., RBAC issues)
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1. **Require snapshot CRDs (hard dependency)** — Rejected. Too restrictive; many clusters do not have the snapshot controller installed.
|
||||
|
||||
2. **Feature detection via API discovery before fetching** — Considered, but `try/catch` on the actual fetch is simpler and catches all failure modes including RBAC restrictions.
|
||||
|
||||
3. **Disable snapshots page entirely when CRDs missing** — Rejected. Showing an informational message explaining how to enable snapshots is better UX than silently hiding the page.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-03-05 | Initial decision |
|
||||
@@ -0,0 +1,54 @@
|
||||
# ADR 004: URL Hash-Based Detail Panel State
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Date**: 2026-03-05
|
||||
|
||||
**Deciders**: Development Team
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
Several pages need to show detail panels for selected resources (e.g., clicking a PVC row shows PVC details). The detail panel state (which resource is selected) needs to be shareable via URL and survive page refresh. Options include:
|
||||
|
||||
- **React state** — Lost on refresh, not shareable
|
||||
- **URL query parameters** — May cause full page reload, potential conflicts with Headlamp routing
|
||||
- **URL hash fragments** — Client-side only, no reload, compatible with SPA routing
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Use URL hash fragments to encode detail panel state. When a user selects a resource, the hash is updated (e.g., `#pvc/namespace/name`). On page load, the hash is parsed to restore the selected resource. This enables deep-linking to specific resource details and browser back/forward navigation.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- ✅ Deep-linkable resource details — users can share URLs pointing to specific resources
|
||||
- ✅ Survives page refresh without losing selected resource
|
||||
- ✅ Browser back/forward navigation works naturally
|
||||
- ✅ No server round-trip — hash changes are purely client-side
|
||||
- ✅ Compatible with Headlamp's client-side routing
|
||||
- ⚠️ Hash-based state is not a standard React pattern — requires team familiarity
|
||||
- ⚠️ Requires manual hash parsing and updating logic
|
||||
- ⚠️ Hash changes don't trigger React re-renders by default — requires `hashchange` event listener
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1. **React state only** — Rejected. State is lost on refresh and cannot be shared via URL.
|
||||
|
||||
2. **URL query parameters** — Rejected. May conflict with Headlamp's routing and could trigger unintended navigation behavior.
|
||||
|
||||
3. **Separate detail routes** — Rejected. Too heavyweight for inline detail panels; would require full page transitions for what should be a panel toggle.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-03-05 | Initial decision |
|
||||
@@ -0,0 +1,53 @@
|
||||
# ADR 005: Prometheus Metrics via Pod Proxy
|
||||
|
||||
**Status**: Accepted
|
||||
|
||||
**Date**: 2026-03-05
|
||||
|
||||
**Deciders**: Development Team
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
The plugin displays CSI driver metrics (operation latencies, error rates, volume stats). The CSI driver pods expose a Prometheus metrics endpoint on port 8080 in the standard text exposition format. The plugin needs to fetch and parse these metrics. Options:
|
||||
|
||||
- **Query a Prometheus server** — Requires Prometheus to be installed in the cluster
|
||||
- **Scrape the pod directly via Kubernetes pod proxy** — No additional dependencies
|
||||
- **Use a metrics aggregation service** — Requires additional infrastructure
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
Fetch metrics directly from the CSI driver pod's `/metrics` endpoint via Kubernetes pod proxy (`ApiProxy.request` to `/api/v1/namespaces/{ns}/pods/{pod}:8080/proxy/metrics`). Parse the Prometheus text exposition format in-browser using a custom parser in `metrics.ts`. No dependency on a Prometheus server installation.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- ✅ Works without Prometheus server installed — no additional infrastructure dependency
|
||||
- ✅ Direct from source with no aggregation delay — metrics are always current
|
||||
- ✅ Leverages existing Kubernetes API authentication and authorization
|
||||
- ✅ No additional service dependencies to configure or maintain
|
||||
- ⚠️ Custom Prometheus text format parser to maintain — mitigated by the parser being well-tested
|
||||
- ⚠️ Only gets metrics from one pod at a time (no aggregation across replicas) — acceptable since CSI controller typically runs one replica
|
||||
- ⚠️ No historical data (point-in-time only) — users needing historical trends should use a full Prometheus setup
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1. **Query Prometheus server via service proxy** (like the intel-gpu plugin) — Rejected. Would require Prometheus to be installed, adding a hard infrastructure dependency.
|
||||
|
||||
2. **Use a metrics library (prom-client) for parsing** — Rejected. Adds a runtime dependency for a relatively simple parsing task.
|
||||
|
||||
3. **JSON metrics endpoint instead of Prometheus format** — Rejected. The CSI driver only exposes Prometheus text format; a JSON endpoint would require changes to the driver itself.
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-03-05 | Initial decision |
|
||||
@@ -0,0 +1,44 @@
|
||||
# Architecture Decision Records
|
||||
|
||||
## What is an ADR?
|
||||
|
||||
An Architecture Decision Record (ADR) captures an important architectural decision made along with its context and consequences. ADRs are a lightweight way to document the "why" behind technical choices, ensuring that future contributors understand the reasoning behind the current architecture.
|
||||
|
||||
## Format
|
||||
|
||||
This project uses the [Nygard-style ADR format](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions):
|
||||
|
||||
- **Title**: Short noun phrase describing the decision
|
||||
- **Status**: Proposed | Accepted | Deprecated | Superseded
|
||||
- **Date**: When the decision was made
|
||||
- **Context**: What is the issue that we're seeing that motivates this decision?
|
||||
- **Decision**: What is the change that we're proposing and/or doing?
|
||||
- **Consequences**: What becomes easier or more difficult to do because of this change?
|
||||
- **Alternatives Considered**: What other options were evaluated?
|
||||
|
||||
## Index
|
||||
|
||||
| ADR | Title | Status | Date |
|
||||
|-----|-------|--------|------|
|
||||
| [001](001-react-context-state.md) | React Context for Shared CSI Driver State | Accepted | 2026-03-05 |
|
||||
| [002](002-read-only-benchmark-exception.md) | Read-Only Plugin with Benchmark Exception | Accepted | 2026-03-05 |
|
||||
| [003](003-optional-crd-degradation.md) | Graceful Degradation for Optional CRDs | Accepted | 2026-03-05 |
|
||||
| [004](004-url-hash-detail-panels.md) | URL Hash-Based Detail Panel State | Accepted | 2026-03-05 |
|
||||
| [005](005-prometheus-pod-proxy.md) | Prometheus Metrics via Pod Proxy | Accepted | 2026-03-05 |
|
||||
|
||||
## Creating New ADRs
|
||||
|
||||
1. Copy an existing ADR as a template
|
||||
2. Assign the next sequential number (e.g., `006-your-title.md`)
|
||||
3. Fill in all sections: Status, Date, Context, Decision, Consequences, Alternatives
|
||||
4. Set the status to `Proposed` until reviewed
|
||||
5. Update this README index table
|
||||
6. Submit as part of a pull request for review
|
||||
|
||||
ADRs should not be deleted. If a decision is reversed, create a new ADR that supersedes the old one and update the old ADR's status to `Superseded by [ADR NNN](NNN-title.md)`.
|
||||
|
||||
## References
|
||||
|
||||
- [Michael Nygard - Documenting Architecture Decisions](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions)
|
||||
- [ADR GitHub Organization](https://adr.github.io/)
|
||||
- [Joel Parker Henderson - Architecture Decision Record](https://github.com/joelparkerhenderson/architecture-decision-record)
|
||||
Reference in New Issue
Block a user