diff --git a/docs/architecture/adr/001-react-context-state.md b/docs/architecture/adr/001-react-context-state.md new file mode 100644 index 0000000..f753a58 --- /dev/null +++ b/docs/architecture/adr/001-react-context-state.md @@ -0,0 +1,52 @@ +# ADR 001: React Context for Centralized GPU State + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The Intel GPU plugin needs to share GPU-related data across 5 page views (Overview, DevicePlugins, Nodes, Pods, Metrics) and 2 detail view sections (Node, Pod). Data includes GPU nodes (identified by node labels and capacity fields), GPU pods, GpuDevicePlugin CRD instances, and plugin DaemonSet pods. + +The `IntelGpuDataProvider` context holds all derived GPU state. Child components access data via `useIntelGpuContext()`. The context collects errors from three streams (node hook error, pod hook error, async CRD fetch error) into a `string[]` joined with `';'` into a single error string. + +--- + +## Decision + +Use a single `IntelGpuDataProvider` React Context that wraps every route and every `registerDetailsViewSection` call in `index.tsx`. All GPU-derived state is computed in the provider and exposed via context. + +--- + +## Consequences + +- ✅ Single source of truth for all GPU data +- ✅ All views share consistent state +- ✅ Error aggregation from multiple sources into a unified error string +- ✅ Refresh mechanism updates everything atomically +- ⚠️ All consumers re-render on any data change +- ⚠️ Monolithic provider couples all GPU state together + +The negative consequences are mitigated by the fact that GPU data updates infrequently in practice, so unnecessary re-renders are rare. + +--- + +## Alternatives Considered + +1. **Per-page data fetching** — Rejected. Would duplicate complex GPU node/pod filtering logic across each of the 5 pages and 2 detail sections. + +2. **Multiple contexts (NodesContext, PodsContext, CRDContext)** — Rejected. GPU data is highly cross-referenced (e.g., GPU pods reference GPU nodes, CRD instances relate to DaemonSet pods). Splitting contexts would require complex cross-context coordination. + +3. **External state library (Redux, Zustand, etc.)** — Rejected. External state libraries are not available in the Headlamp plugin runtime environment. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision accepted | diff --git a/docs/architecture/adr/002-dual-data-fetching.md b/docs/architecture/adr/002-dual-data-fetching.md new file mode 100644 index 0000000..7d6c8ab --- /dev/null +++ b/docs/architecture/adr/002-dual-data-fetching.md @@ -0,0 +1,59 @@ +# ADR 002: Dual Data Fetching Strategy (Hooks + ApiProxy) + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The plugin needs data from two categories of Kubernetes resources: + +- **Standard resources**: Nodes and Pods, for which Headlamp provides reactive `useList()` hooks via built-in resource classes. +- **Custom resources**: GpuDevicePlugin CRD (under `deviceplugin.intel.com/v1`) and DaemonSet pods with specific labels, for which Headlamp does not have built-in support. + +Headlamp provides reactive `useList()` hooks for standard resource classes but does not have built-in support for custom CRDs. The plugin uses three possible label selectors for DaemonSet pod discovery to handle different deployment configurations. + +--- + +## Decision + +Implement a two-track data fetching strategy within the context provider: + +1. **Track 1 (Reactive)**: Use `K8s.ResourceClasses.Node.useList()` and `K8s.ResourceClasses.Pod.useList({namespace:''})` for standard resources. These are reactive to cluster changes and automatically update when resources are created, modified, or deleted. + +2. **Track 2 (Imperative)**: Use `ApiProxy.request()` inside a `useEffect` keyed on `refreshKey` for GpuDevicePlugin CRDs and DaemonSet pods. The `refreshKey` is incremented by the `refresh()` function exposed through the context. + +--- + +## Consequences + +- ✅ Leverages Headlamp's reactive hooks for standard resources with automatic updates +- ✅ Flexible `ApiProxy` for custom CRDs without needing to register custom resource classes +- ✅ Refresh mechanism provides manual control over imperative fetches +- ✅ Clean separation of reactive vs imperative data sources +- ⚠️ Two different update mechanisms (hooks auto-update vs manual refresh for CRDs) +- ⚠️ CRD data may lag behind hook data between refreshes + +The negative consequences are mitigated by providing a manual refresh button in the UI, allowing users to force an update of imperative data when needed. + +--- + +## Alternatives Considered + +1. **All ApiProxy (no hooks)** — Rejected. Loses reactivity for standard resources, meaning Node and Pod changes would not be reflected until a manual refresh. + +2. **All hooks (register CRD as custom resource class)** — Rejected. Headlamp's `KubeObject` registration is complex for read-only CRD access and would add unnecessary coupling to Headlamp internals. + +3. **Single useEffect for everything** — Rejected. Loses the reactivity benefit for Nodes and Pods, and would require manual refresh for all data instead of just CRDs. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision accepted | diff --git a/docs/architecture/adr/003-graceful-crd-degradation.md b/docs/architecture/adr/003-graceful-crd-degradation.md new file mode 100644 index 0000000..11fd11d --- /dev/null +++ b/docs/architecture/adr/003-graceful-crd-degradation.md @@ -0,0 +1,53 @@ +# ADR 003: Graceful CRD Degradation + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The GpuDevicePlugin CRD (`deviceplugin.intel.com/v1`) is only present when the Intel GPU device plugin operator is installed. However, Intel GPUs can be present in a cluster without the operator — the device plugin can be deployed as a plain DaemonSet. + +The plugin should still detect and display GPU resources even without the CRD. GPU nodes are identifiable by node labels (e.g., `intel.feature.node.kubernetes.io/gpu`) and capacity fields (e.g., `gpu.intel.com/i915`). GPU pods are identifiable by resource requests/limits for Intel GPU resources. + +--- + +## Decision + +Wrap the GpuDevicePlugin CRD fetch in its own `try/catch`. If the fetch fails (CRD not installed), set `crdAvailable` to `false` and continue. GPU nodes and pods are still discovered via node labels, capacity fields, and pod resource requests — independent of the CRD. + +The CRD data enriches the view when available but is not required for core functionality. + +--- + +## Consequences + +- ✅ Plugin works on any cluster with Intel GPUs regardless of operator installation +- ✅ Progressive enhancement when CRD is available +- ✅ No error displayed to the user for a missing CRD +- ⚠️ Two code paths (with/without CRD data) increase testing surface +- ⚠️ DevicePlugins page is empty without the CRD + +The negative consequences are mitigated by clear messaging on the DevicePlugins page when the CRD is unavailable, informing users that the operator is not installed. + +--- + +## Alternatives Considered + +1. **Require CRD (hard dependency)** — Rejected. Too restrictive; many clusters run the device plugin as a plain DaemonSet without the operator and its CRD. + +2. **API discovery check before fetch** — Considered, but `try/catch` is simpler and handles all failure modes (CRD not installed, API server errors, permission issues) uniformly. + +3. **Disable plugin entirely without CRD** — Rejected. Core GPU monitoring (node detection, pod resource tracking) works without the CRD and provides significant value on its own. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision accepted | diff --git a/docs/architecture/adr/004-native-view-integration.md b/docs/architecture/adr/004-native-view-integration.md new file mode 100644 index 0000000..1f540c5 --- /dev/null +++ b/docs/architecture/adr/004-native-view-integration.md @@ -0,0 +1,61 @@ +# ADR 004: Headlamp View Integration via Detail Sections and Column Processors + +**Status**: Accepted + +**Date**: 2026-03-05 + +**Deciders**: Development Team + +--- + +## Context + +The plugin provides its own pages (Overview, Nodes, Pods, etc.) but also needs to enhance Headlamp's native views. Users browsing the standard Nodes list should see GPU information without navigating to the plugin. + +Headlamp offers two integration mechanisms: + +- `registerDetailsViewSection` for injecting sections into resource detail pages. +- `registerResourceTableColumnsProcessor` for adding columns to resource list tables. + +--- + +## Decision + +Use both integration mechanisms: + +1. **Detail sections**: `registerDetailsViewSection` injects GPU information into Node and Pod detail pages. Resource-kind guards ensure sections only render for the correct resource type. + +2. **Column processors**: `registerResourceTableColumnsProcessor` appends "GPU Type" and "GPU Devices" columns to the native `headlamp-nodes` table. + +Both integration points consume data from the shared `IntelGpuDataProvider` context, so they benefit from the same cached data as the plugin's own pages. + +--- + +## Consequences + +- ✅ GPU data visible in native Headlamp views without navigation +- ✅ Seamless user experience for users already familiar with Headlamp +- ✅ Uses Headlamp's official extension APIs for forward compatibility +- ✅ Shared context means no duplicate data fetches +- ⚠️ Detail sections render for all Nodes/Pods (guard needed to check GPU relevance) +- ⚠️ Column processors add columns even when no GPU nodes exist in the cluster + +The negative consequences are mitigated by resource-kind guards and conditional rendering that hide GPU sections when a resource has no GPU relevance. + +--- + +## Alternatives Considered + +1. **Plugin pages only (no native view integration)** — Rejected. Users would miss GPU info when browsing standard Headlamp views, reducing discoverability. + +2. **Override native views entirely** — Rejected. Not supported by Headlamp's plugin API and would conflict with other plugins. + +3. **App bar notification only** — Rejected. Insufficient detail for node-level and pod-level GPU information; only suitable for cluster-wide summaries. + +--- + +## Changelog + +| Date | Change | +|------|--------| +| 2026-03-05 | Initial decision accepted | diff --git a/docs/architecture/adr/README.md b/docs/architecture/adr/README.md new file mode 100644 index 0000000..04c1d01 --- /dev/null +++ b/docs/architecture/adr/README.md @@ -0,0 +1,42 @@ +# Architecture Decision Records + +## What is an ADR? + +An Architecture Decision Record (ADR) captures an important architectural decision made along with its context and consequences. ADRs are used to document the reasoning behind significant technical choices so that future contributors can understand why the system is built the way it is. + +## Format + +This project follows the [Nygard-style ADR format](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions): + +- **Title**: Short noun phrase describing the decision +- **Status**: Proposed, Accepted, Deprecated, or Superseded +- **Date**: When the decision was made +- **Deciders**: Who was involved in making the decision +- **Context**: What is the issue that motivated the decision +- **Decision**: What is the change that was decided +- **Consequences**: What becomes easier or more difficult as a result +- **Alternatives Considered**: What other options were evaluated + +## Index + +| ADR | Title | Status | Date | +|-----|-------|--------|------| +| [001](001-react-context-state.md) | React Context for Centralized GPU State | Accepted | 2026-03-05 | +| [002](002-dual-data-fetching.md) | Dual Data Fetching Strategy (Hooks + ApiProxy) | Accepted | 2026-03-05 | +| [003](003-graceful-crd-degradation.md) | Graceful CRD Degradation | Accepted | 2026-03-05 | +| [004](004-native-view-integration.md) | Headlamp View Integration via Detail Sections and Column Processors | Accepted | 2026-03-05 | + +## Creating New ADRs + +1. Copy an existing ADR as a template. +2. Assign the next sequential number (e.g., `005`). +3. Fill in all sections: Status, Date, Deciders, Context, Decision, Consequences, and Alternatives Considered. +4. Set the status to `Proposed` until the team reviews and accepts the decision. +5. Update this README index table with the new entry. +6. Submit as part of a pull request for team review. + +## References + +- [Michael Nygard - Documenting Architecture Decisions](https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions) +- [ADR GitHub Organization](https://adr.github.io/) +- [Headlamp Plugin Development](https://headlamp.dev/docs/latest/development/plugins/)