[GH-ISSUE #211] [RFC] AMD GPU Support #119

New issue

Open

opened 2026-05-05 03:26:11 -06:00 by gitea-mirror · 1 comment

gitea-mirror commented

2026-05-05 03:26:11 -06:00

Owner

Originally created by @jamesETsmith on GitHub (Apr 29, 2026).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/211

RFC: AMD GPU Support for nvitop

Branch (Option A — unified Device): feature/amd_impl
Branch (Option B — explicit API, recommended): feature/amd_impl_option_b
Author: James Smith

Problem statement

First off, thanks for making nvitop. I think it's a great tool! I work a lot with AMD GPUs and would love to be able to use nvitop rather than another GPU monitor. The AMD/ROCm ecosystem is moving toward a pip install style for its tools with theRock and I think nvitop would fit nicely into that stack.

This RFC proposes adding AMD GPU support and I propose two different approaches to add support, which are already implemented in the branches listed above. Let me know what you think!

Constraints

The implementation must not break or regress the existing NVIDIA path.
amdsmi must be an optional dependency — never required on systems without AMD hardware.
The two new files (libamdsmi.py, amd_device.py) are shared between both options. The question is only how they connect to the rest of nvitop.

The design question

The CLI and TUI today call Device.count(), Device.all(), and Device.driver_version() to enumerate devices. How should they gain awareness of AMD devices?

Option A: Unified `Device` class

Branch: feature/amd_impl
Diff vs upstream: 10 files, +1743 / -37 lines; device.py has 8 hunks (+195 / -28)

Device.count(), Device.from_indices(), Device.driver_version(), and Device.is_available() are modified to be AMD-aware. Existing call sites require no changes.

devices = Device.all()             # returns PhysicalDevice + AmdPhysicalDevice
count   = Device.count()           # NVIDIA count + AMD count
version = Device.driver_version()  # NVIDIA driver version, falls back to amdgpu version

Strengths

Zero changes required in downstream user code — Device.all() just works.
The CLI and TUI changes are minimal; they already call the right entry points.

Weaknesses

device.py is the most sensitive file in the codebase. Reviewers must reason about whether the NVIDIA path is unchanged.
_safe_nvml_query suppresses logging during probing; this may hide legitimate NVML errors on mixed systems.
Device.all() now returns a heterogeneous list (PhysicalDevice + AmdPhysicalDevice) — a silent API contract change that may break downstream code assuming homogeneous types.
driver_version() falling back from "NVIDIA display driver version" to "amdgpu kernel module version" is not a semantically equivalent substitution.
The unified index N..N+M-1 for AMD devices means AMD GPU indices shift if NVIDIA GPUs are added or removed.

Option B: Explicit AMD-aware entry points (recommended)

Branch: feature/amd_impl_option_b
Diff vs upstream: 9 files, +1607 / -9 lines; device.py has zero diff

device.py is left completely untouched. Four new functions are added to nvitop/api/__init__.py. The CLI and TUI are updated to call these.

from nvitop.api import get_all_devices, get_device_count, get_driver_version, get_amd_runtime_version

devices = get_all_devices()           # PhysicalDevice + AmdPhysicalDevice
count   = get_device_count()          # NVIDIA + AMD
version = get_driver_version()        # NVIDIA driver, falls back to amdgpu driver
rocm    = get_amd_runtime_version()   # ROCm version (AMD-specific)

# Existing API — unchanged, NVIDIA only
nvidia_only = Device.all()

Strengths

device.py has zero diff — the core NVIDIA API contract is completely untouched.
All NVML error handling is unchanged; AMD errors are caught independently.
The API surface added is small and self-explanatory.
Incrementally mergeable: libamdsmi.py and amd_device.py could land first as a no-op, with the integration wired in a follow-up.

Weaknesses

Device.all() and get_all_devices() coexist — the split is intentional but will confuse new users.
The TUI still overrides Device.from_indices() to show a unified device list, so the TUI layer and API layer have different semantics for "all devices."
Downstream users who want AMD support must update call sites.

Open questions for the maintainer

Which integration option is preferred? Option A (transparent unification via Device) or Option B (explicit get_all_devices() API)? Or is there a third approach you'd prefer?
AmdPhysicalDevice initialization. The current implementation bypasses Device.__init__ because that call invokes NVML. Is the maintainer comfortable with this for a v1 PR, or is refactoring Device.__init__ to separate NVML initialization from attribute setup a prerequisite for merge?
Long-term backend abstraction. The architecturally cleanest design would be a proper backend abstraction (NvidiaBackend, AmdBackend) with Device as a vendor-agnostic base class. This is a significant refactor we would prefer to coordinate on before investing effort. Is this a direction the project wants to pursue?
amdsmi version range. The implementation targets amdsmi >= 5.5.0 (ROCm 5.5+). The API surface changed between ROCm 5.x and 6.x. Should we add explicit runtime version detection and raise a clear error for unsupported versions?
Platform guard. Should libamdsmi.py include an explicit sys.platform != 'linux' guard at module load, or is the current approach (init fails silently on non-Linux) sufficient?
__getattr__ permissiveness. AmdPhysicalDevice.__getattr__ returns a callable yielding NA for any unrecognized attribute. This prevents crashes but may mask programming errors. Would the maintainer prefer a stricter allowlist?

Testing plan

Unit tests using unittest.mock.patch to mock amdsmi module calls, covering enumeration, the ImportError path, process listing, and oneshot() caching.
Integration smoke test: get_all_devices() on an NVIDIA-only system (no amdsmi) returns only NVIDIA devices with no errors.

Question for the maintainer: Is mock-based unit testing (no real AMD hardware required) acceptable?

Originally created by @jamesETsmith on GitHub (Apr 29, 2026). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/211 # RFC: AMD GPU Support for nvitop **Branch (Option A — unified Device):** [`feature/amd_impl`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl) **Branch (Option B — explicit API, recommended):** [`feature/amd_impl_option_b`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl_option_b) **Author:** James Smith --- ## Problem statement First off, thanks for making `nvitop`. I think it's a great tool! I work a lot with AMD GPUs and would love to be able to use `nvitop` rather than another GPU monitor. The AMD/ROCm ecosystem is moving toward a `pip install` style for its tools with [theRock](https://github.com/ROCm/TheRock) and I think `nvitop` would fit nicely into that stack. This RFC proposes adding AMD GPU support and I propose two different approaches to add support, which are already implemented in the branches listed above. Let me know what you think! --- ## Constraints - The implementation must not break or regress the existing NVIDIA path. - `amdsmi` must be an optional dependency — never required on systems without AMD hardware. - The two new files (`libamdsmi.py`, `amd_device.py`) are shared between both options. The question is only how they connect to the rest of nvitop. --- ## The design question The CLI and TUI today call `Device.count()`, `Device.all()`, and `Device.driver_version()` to enumerate devices. How should they gain awareness of AMD devices? --- ## Option A: Unified `Device` class **Branch:** [`feature/amd_impl`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl) **Diff vs upstream:** 10 files, +1743 / -37 lines; `device.py` has 8 hunks (+195 / -28) `Device.count()`, `Device.from_indices()`, `Device.driver_version()`, and `Device.is_available()` are modified to be AMD-aware. Existing call sites require no changes. ```python devices = Device.all() # returns PhysicalDevice + AmdPhysicalDevice count = Device.count() # NVIDIA count + AMD count version = Device.driver_version() # NVIDIA driver version, falls back to amdgpu version ``` ### Strengths - Zero changes required in downstream user code — `Device.all()` just works. - The CLI and TUI changes are minimal; they already call the right entry points. ### Weaknesses - `device.py` is the most sensitive file in the codebase. Reviewers must reason about whether the NVIDIA path is unchanged. - `_safe_nvml_query` suppresses logging during probing; this may hide legitimate NVML errors on mixed systems. - `Device.all()` now returns a heterogeneous list (`PhysicalDevice` + `AmdPhysicalDevice`) — a silent API contract change that may break downstream code assuming homogeneous types. - `driver_version()` falling back from "NVIDIA display driver version" to "amdgpu kernel module version" is not a semantically equivalent substitution. - The unified index `N..N+M-1` for AMD devices means AMD GPU indices shift if NVIDIA GPUs are added or removed. --- ## Option B: Explicit AMD-aware entry points (recommended) **Branch:** [`feature/amd_impl_option_b`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl_option_b) **Diff vs upstream:** 9 files, +1607 / -9 lines; **`device.py` has zero diff** `device.py` is left completely untouched. Four new functions are added to `nvitop/api/__init__.py`. The CLI and TUI are updated to call these. ```python from nvitop.api import get_all_devices, get_device_count, get_driver_version, get_amd_runtime_version devices = get_all_devices() # PhysicalDevice + AmdPhysicalDevice count = get_device_count() # NVIDIA + AMD version = get_driver_version() # NVIDIA driver, falls back to amdgpu driver rocm = get_amd_runtime_version() # ROCm version (AMD-specific) # Existing API — unchanged, NVIDIA only nvidia_only = Device.all() ``` ### Strengths - `device.py` has zero diff — the core NVIDIA API contract is completely untouched. - All NVML error handling is unchanged; AMD errors are caught independently. - The API surface added is small and self-explanatory. - Incrementally mergeable: `libamdsmi.py` and `amd_device.py` could land first as a no-op, with the integration wired in a follow-up. ### Weaknesses - `Device.all()` and `get_all_devices()` coexist — the split is intentional but will confuse new users. - The TUI still overrides `Device.from_indices()` to show a unified device list, so the TUI layer and API layer have different semantics for "all devices." - Downstream users who want AMD support must update call sites. --- ## Open questions for the maintainer 1. **Which integration option is preferred?** Option A (transparent unification via `Device`) or Option B (explicit `get_all_devices()` API)? Or is there a third approach you'd prefer? 2. **`AmdPhysicalDevice` initialization.** The current implementation bypasses `Device.__init__` because that call invokes NVML. Is the maintainer comfortable with this for a v1 PR, or is refactoring `Device.__init__` to separate NVML initialization from attribute setup a prerequisite for merge? 3. **Long-term backend abstraction.** The architecturally cleanest design would be a proper backend abstraction (`NvidiaBackend`, `AmdBackend`) with `Device` as a vendor-agnostic base class. This is a significant refactor we would prefer to coordinate on before investing effort. Is this a direction the project wants to pursue? 4. **`amdsmi` version range.** The implementation targets `amdsmi >= 5.5.0` (ROCm 5.5+). The API surface changed between ROCm 5.x and 6.x. Should we add explicit runtime version detection and raise a clear error for unsupported versions? 5. **Platform guard.** Should `libamdsmi.py` include an explicit `sys.platform != 'linux'` guard at module load, or is the current approach (init fails silently on non-Linux) sufficient? 6. **`__getattr__` permissiveness.** `AmdPhysicalDevice.__getattr__` returns a callable yielding `NA` for any unrecognized attribute. This prevents crashes but may mask programming errors. Would the maintainer prefer a stricter allowlist? --- ## Testing plan - Unit tests using `unittest.mock.patch` to mock `amdsmi` module calls, covering enumeration, the `ImportError` path, process listing, and `oneshot()` caching. - Integration smoke test: `get_all_devices()` on an NVIDIA-only system (no `amdsmi`) returns only NVIDIA devices with no errors. **Question for the maintainer:** Is mock-based unit testing (no real AMD hardware required) acceptable?

gitea-mirror commented

2026-05-05 03:26:13 -06:00

Author

Owner

@jamesETsmith commented on GitHub (May 4, 2026):

@XuehaiPan just a nudge on this

@jamesETsmith commented on GitHub (May 4, 2026): @XuehaiPan just a nudge on this