diff --git a/docs/architecture/adr/001-react-context-state.md b/docs/architecture/adr/001-react-context-state.md new file mode 100644 index 0000000..bb80674 --- /dev/null +++ b/docs/architecture/adr/001-react-context-state.md @@ -0,0 +1,59 @@ +# ADR 001: React Context for Shared CSI Driver State + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The TNS CSI plugin needs to share data across multiple views: Overview, StorageClasses, Volumes, Snapshots, Metrics, and Benchmark pages, plus detail view sections for PVC, PV, and Pod. Data comes from three tracks: + +1. **Headlamp `useList()` hooks** — StorageClass, PersistentVolume, PersistentVolumeClaim +2. **`ApiProxy.request()`** — CSIDriver resource, controller/node pods, VolumeSnapshotClasses, and VolumeSnapshots +3. **TrueNAS WebSocket API** — Pool capacity stats (optional, when API key is configured in settings) + +The context exposes: `csiDriver`, `driverInstalled`, `storageClasses`, `persistentVolumes`, `persistentVolumeClaims`, `controllerPods`, `nodePods`, `volumeSnapshots`, `volumeSnapshotClasses`, `snapshotCrdAvailable`, `poolStats`, `poolStatsError`, `loading`, `error`, `refresh`. + +--- + +## Decision + +Use a single `TnsCsiDataProvider` React Context wrapping all routes. Three-track data fetching: + +1. `useList()` for standard Kubernetes resources (StorageClass, PV, PVC) +2. `ApiProxy.request()` in `useEffect` for CSI-specific resources and snapshots +3. TrueNAS WebSocket client for pool capacity stats (only when API key is configured in settings) + +--- + +## Consequences + +- ✅ Single fetch point eliminates duplicate API calls +- ✅ All views share consistent data — no stale data across pages +- ✅ Three-track strategy handles different API requirements cleanly +- ✅ TrueNAS integration is opt-in — plugin works without it +- ⚠️ Large context with many fields increases cognitive overhead +- ⚠️ TrueNAS WebSocket adds complexity to the data layer +- ⚠️ All consumers re-render on any data change — mitigated by infrequent updates (polling interval) + +--- + +## Alternatives Considered + +1. **Separate contexts per data domain** — Rejected. Data is cross-referenced (PVCs filter by StorageClass provisioner), so splitting contexts would require cross-context coordination. + +2. **Custom hooks without context** — Rejected. Would duplicate fetches across 6 pages, leading to redundant API calls and inconsistent data. + +3. **Redux/Zustand** — Rejected. Not available in the Headlamp plugin environment. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision | diff --git a/docs/architecture/adr/002-read-only-benchmark-exception.md b/docs/architecture/adr/002-read-only-benchmark-exception.md new file mode 100644 index 0000000..e7d0784 --- /dev/null +++ b/docs/architecture/adr/002-read-only-benchmark-exception.md @@ -0,0 +1,60 @@ +# ADR 002: Read-Only Plugin with Benchmark Exception + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The plugin is primarily a read-only observability tool for TNS CSI storage. However, it includes a Benchmark feature that runs kbench (FIO-based storage benchmarks) against storage classes. Running benchmarks requires creating temporary Kubernetes resources: a PVC for the test volume and a Job running the kbench container. + +These resources are tagged with `app.kubernetes.io/managed-by=headlamp-tns-csi-plugin` for lifecycle tracking. The benchmark workflow includes: + +1. `buildPvcManifest()` — Create PVC spec for test volume +2. `createPvc()` — Create the PVC in the cluster +3. `buildJobManifest()` — Create Job spec for kbench container +4. `createJob()` — Create the Job in the cluster +5. Poll for Job completion +6. `fetchKbenchLogs()` — Retrieve benchmark output from pod logs +7. `parseKbenchLog()` — Parse FIO results from kbench output +8. `deleteJob()` — Clean up the benchmark Job +9. `deletePvc()` — Clean up the test PVC + +--- + +## Decision + +The plugin is read-only for all storage observability features. The sole exception is the Benchmark feature, which creates and deletes temporary PVC + Job resources. All created resources are labeled for identification and cleaned up after benchmark completion. The benchmark is triggered explicitly by user action (button on StorageClass detail page via `registerDetailsViewHeaderAction`). + +--- + +## Consequences + +- ✅ Minimal RBAC requirements for normal operation (read-only) +- ✅ Benchmark is opt-in and requires explicit user action +- ✅ Resources are auto-cleaned after benchmark completion +- ✅ `managed-by` label enables easy identification of plugin-created resources +- ⚠️ Requires additional RBAC permissions (create/delete Jobs and PVCs) for benchmark feature +- ⚠️ Failed cleanup leaves orphaned resources — mitigated by `listKbenchJobs()` which finds orphaned resources by label for manual cleanup + +--- + +## Alternatives Considered + +1. **No benchmark feature (fully read-only)** — Rejected. Storage performance testing is a key use case for storage administrators evaluating CSI drivers. + +2. **External benchmark tool with results import** — Rejected. Poor user experience requiring context-switching between tools. + +3. **Benchmark as a separate plugin** — Rejected. Benchmark results are tied to storage class context and benefit from shared data in the plugin. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision | diff --git a/docs/architecture/adr/003-optional-crd-degradation.md b/docs/architecture/adr/003-optional-crd-degradation.md new file mode 100644 index 0000000..264b8e4 --- /dev/null +++ b/docs/architecture/adr/003-optional-crd-degradation.md @@ -0,0 +1,55 @@ +# ADR 003: Graceful Degradation for Optional CRDs + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The plugin uses VolumeSnapshot and VolumeSnapshotClass CRDs from `snapshot.storage.k8s.io/v1`. These CRDs are part of the Kubernetes Volume Snapshot feature, which is optional — not all clusters have the snapshot controller installed. + +The plugin should work on clusters without snapshot support, showing storage classes, volumes, metrics, and benchmarks without the snapshots page. The CRD fetch is wrapped in `try/catch`; if it fails, the `snapshotCrdAvailable` flag is set to `false`. + +--- + +## Decision + +Implement graceful degradation for optional CRDs. The snapshot API calls are wrapped in `try/catch` within the data context. When the snapshot CRDs are not installed: + +- `snapshotCrdAvailable` is set to `false` +- Snapshot-related data arrays are empty +- The Snapshots page shows an informational message rather than an error +- All other plugin features remain fully functional + +--- + +## Consequences + +- ✅ Plugin works on clusters without snapshot CRDs installed +- ✅ No error state for missing optional features — clean informational messaging +- ✅ Clear user feedback about what features are available +- ✅ Core features (volumes, storage classes, metrics, benchmarks) always work +- ⚠️ Two code paths (with/without snapshots) to maintain and test +- ⚠️ Snapshot data might silently fail for reasons other than missing CRDs (e.g., RBAC issues) + +--- + +## Alternatives Considered + +1. **Require snapshot CRDs (hard dependency)** — Rejected. Too restrictive; many clusters do not have the snapshot controller installed. + +2. **Feature detection via API discovery before fetching** — Considered, but `try/catch` on the actual fetch is simpler and catches all failure modes including RBAC restrictions. + +3. **Disable snapshots page entirely when CRDs missing** — Rejected. Showing an informational message explaining how to enable snapshots is better UX than silently hiding the page. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision | diff --git a/docs/architecture/adr/004-url-hash-detail-panels.md b/docs/architecture/adr/004-url-hash-detail-panels.md new file mode 100644 index 0000000..06154e0 --- /dev/null +++ b/docs/architecture/adr/004-url-hash-detail-panels.md @@ -0,0 +1,54 @@ +# ADR 004: URL Hash-Based Detail Panel State + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +Several pages need to show detail panels for selected resources (e.g., clicking a PVC row shows PVC details). The detail panel state (which resource is selected) needs to be shareable via URL and survive page refresh. Options include: + +- **React state** — Lost on refresh, not shareable +- **URL query parameters** — May cause full page reload, potential conflicts with Headlamp routing +- **URL hash fragments** — Client-side only, no reload, compatible with SPA routing + +--- + +## Decision + +Use URL hash fragments to encode detail panel state. When a user selects a resource, the hash is updated (e.g., `#pvc/namespace/name`). On page load, the hash is parsed to restore the selected resource. This enables deep-linking to specific resource details and browser back/forward navigation. + +--- + +## Consequences + +- ✅ Deep-linkable resource details — users can share URLs pointing to specific resources +- ✅ Survives page refresh without losing selected resource +- ✅ Browser back/forward navigation works naturally +- ✅ No server round-trip — hash changes are purely client-side +- ✅ Compatible with Headlamp's client-side routing +- ⚠️ Hash-based state is not a standard React pattern — requires team familiarity +- ⚠️ Requires manual hash parsing and updating logic +- ⚠️ Hash changes don't trigger React re-renders by default — requires `hashchange` event listener + +--- + +## Alternatives Considered + +1. **React state only** — Rejected. State is lost on refresh and cannot be shared via URL. + +2. **URL query parameters** — Rejected. May conflict with Headlamp's routing and could trigger unintended navigation behavior. + +3. **Separate detail routes** — Rejected. Too heavyweight for inline detail panels; would require full page transitions for what should be a panel toggle. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision | diff --git a/docs/architecture/adr/005-prometheus-pod-proxy.md b/docs/architecture/adr/005-prometheus-pod-proxy.md new file mode 100644 index 0000000..76ba722 --- /dev/null +++ b/docs/architecture/adr/005-prometheus-pod-proxy.md @@ -0,0 +1,53 @@ +# ADR 005: Prometheus Metrics via Pod Proxy + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The plugin displays CSI driver metrics (operation latencies, error rates, volume stats). The CSI driver pods expose a Prometheus metrics endpoint on port 8080 in the standard text exposition format. The plugin needs to fetch and parse these metrics. Options: + +- **Query a Prometheus server** — Requires Prometheus to be installed in the cluster +- **Scrape the pod directly via Kubernetes pod proxy** — No additional dependencies +- **Use a metrics aggregation service** — Requires additional infrastructure + +--- + +## Decision + +Fetch metrics directly from the CSI driver pod's `/metrics` endpoint via Kubernetes pod proxy (`ApiProxy.request` to `/api/v1/namespaces/{ns}/pods/{pod}:8080/proxy/metrics`). Parse the Prometheus text exposition format in-browser using a custom parser in `metrics.ts`. No dependency on a Prometheus server installation. + +--- + +## Consequences + +- ✅ Works without Prometheus server installed — no additional infrastructure dependency +- ✅ Direct from source with no aggregation delay — metrics are always current +- ✅ Leverages existing Kubernetes API authentication and authorization +- ✅ No additional service dependencies to configure or maintain +- ⚠️ Custom Prometheus text format parser to maintain — mitigated by the parser being well-tested +- ⚠️ Only gets metrics from one pod at a time (no aggregation across replicas) — acceptable since CSI controller typically runs one replica +- ⚠️ No historical data (point-in-time only) — users needing historical trends should use a full Prometheus setup + +--- + +## Alternatives Considered + +1. **Query Prometheus server via service proxy** (like the intel-gpu plugin) — Rejected. Would require Prometheus to be installed, adding a hard infrastructure dependency. + +2. **Use a metrics library (prom-client) for parsing** — Rejected. Adds a runtime dependency for a relatively simple parsing task. + +3. **JSON metrics endpoint instead of Prometheus format** — Rejected. The CSI driver only exposes Prometheus text format; a JSON endpoint would require changes to the driver itself. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision | diff --git a/docs/architecture/adr/README.md b/docs/architecture/adr/README.md new file mode 100644 index 0000000..6f28b8c --- /dev/null +++ b/docs/architecture/adr/README.md @@ -0,0 +1,44 @@ +# Architecture Decision Records + +## What is an ADR? + +An Architecture Decision Record (ADR) captures an important architectural decision made along with its context and consequences. ADRs are a lightweight way to document the "why" behind technical choices, ensuring that future contributors understand the reasoning behind the current architecture. + +## Format + +This project uses the [Nygard-style ADR format](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions): + +- **Title**: Short noun phrase describing the decision +- **Status**: Proposed | Accepted | Deprecated | Superseded +- **Date**: When the decision was made +- **Context**: What is the issue that we're seeing that motivates this decision? +- **Decision**: What is the change that we're proposing and/or doing? +- **Consequences**: What becomes easier or more difficult to do because of this change? +- **Alternatives Considered**: What other options were evaluated? + +## Index + +| ADR | Title | Status | Date | +|-----|-------|--------|------| +| [001](001-react-context-state.md) | React Context for Shared CSI Driver State | Accepted | 2026-03-05 | +| [002](002-read-only-benchmark-exception.md) | Read-Only Plugin with Benchmark Exception | Accepted | 2026-03-05 | +| [003](003-optional-crd-degradation.md) | Graceful Degradation for Optional CRDs | Accepted | 2026-03-05 | +| [004](004-url-hash-detail-panels.md) | URL Hash-Based Detail Panel State | Accepted | 2026-03-05 | +| [005](005-prometheus-pod-proxy.md) | Prometheus Metrics via Pod Proxy | Accepted | 2026-03-05 | + +## Creating New ADRs + +1. Copy an existing ADR as a template +2. Assign the next sequential number (e.g., `006-your-title.md`) +3. Fill in all sections: Status, Date, Context, Decision, Consequences, Alternatives +4. Set the status to `Proposed` until reviewed +5. Update this README index table +6. Submit as part of a pull request for review + +ADRs should not be deleted. If a decision is reversed, create a new ADR that supersedes the old one and update the old ADR's status to `Superseded by [ADR NNN](NNN-title.md)`. + +## References + +- [Michael Nygard - Documenting Architecture Decisions](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions) +- [ADR GitHub Organization](https://adr.github.io/) +- [Joel Parker Henderson - Architecture Decision Record](https://github.com/joelparkerhenderson/architecture-decision-record)