[GH-ISSUE #211] [RFC] AMD GPU Support #119

Open
opened 2026-05-05 03:26:11 -06:00 by gitea-mirror · 1 comment
Owner

Originally created by @jamesETsmith on GitHub (Apr 29, 2026).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/211

RFC: AMD GPU Support for nvitop

Branch (Option A — unified Device): feature/amd_impl
Branch (Option B — explicit API, recommended): feature/amd_impl_option_b
Author: James Smith


Problem statement

First off, thanks for making nvitop. I think it's a great tool! I work a lot with AMD GPUs and would love to be able to use nvitop rather than another GPU monitor. The AMD/ROCm ecosystem is moving toward a pip install style for its tools with theRock and I think nvitop would fit nicely into that stack.

This RFC proposes adding AMD GPU support and I propose two different approaches to add support, which are already implemented in the branches listed above. Let me know what you think!


Constraints

  • The implementation must not break or regress the existing NVIDIA path.
  • amdsmi must be an optional dependency — never required on systems without AMD hardware.
  • The two new files (libamdsmi.py, amd_device.py) are shared between both options. The question is only how they connect to the rest of nvitop.

The design question

The CLI and TUI today call Device.count(), Device.all(), and Device.driver_version() to enumerate devices. How should they gain awareness of AMD devices?


Option A: Unified Device class

Branch: feature/amd_impl
Diff vs upstream: 10 files, +1743 / -37 lines; device.py has 8 hunks (+195 / -28)

Device.count(), Device.from_indices(), Device.driver_version(), and Device.is_available() are modified to be AMD-aware. Existing call sites require no changes.

devices = Device.all()             # returns PhysicalDevice + AmdPhysicalDevice
count   = Device.count()           # NVIDIA count + AMD count
version = Device.driver_version()  # NVIDIA driver version, falls back to amdgpu version

Strengths

  • Zero changes required in downstream user code — Device.all() just works.
  • The CLI and TUI changes are minimal; they already call the right entry points.

Weaknesses

  • device.py is the most sensitive file in the codebase. Reviewers must reason about whether the NVIDIA path is unchanged.
  • _safe_nvml_query suppresses logging during probing; this may hide legitimate NVML errors on mixed systems.
  • Device.all() now returns a heterogeneous list (PhysicalDevice + AmdPhysicalDevice) — a silent API contract change that may break downstream code assuming homogeneous types.
  • driver_version() falling back from "NVIDIA display driver version" to "amdgpu kernel module version" is not a semantically equivalent substitution.
  • The unified index N..N+M-1 for AMD devices means AMD GPU indices shift if NVIDIA GPUs are added or removed.

Option B: Explicit AMD-aware entry points (recommended)

Branch: feature/amd_impl_option_b
Diff vs upstream: 9 files, +1607 / -9 lines; device.py has zero diff

device.py is left completely untouched. Four new functions are added to nvitop/api/__init__.py. The CLI and TUI are updated to call these.

from nvitop.api import get_all_devices, get_device_count, get_driver_version, get_amd_runtime_version

devices = get_all_devices()           # PhysicalDevice + AmdPhysicalDevice
count   = get_device_count()          # NVIDIA + AMD
version = get_driver_version()        # NVIDIA driver, falls back to amdgpu driver
rocm    = get_amd_runtime_version()   # ROCm version (AMD-specific)

# Existing API — unchanged, NVIDIA only
nvidia_only = Device.all()

Strengths

  • device.py has zero diff — the core NVIDIA API contract is completely untouched.
  • All NVML error handling is unchanged; AMD errors are caught independently.
  • The API surface added is small and self-explanatory.
  • Incrementally mergeable: libamdsmi.py and amd_device.py could land first as a no-op, with the integration wired in a follow-up.

Weaknesses

  • Device.all() and get_all_devices() coexist — the split is intentional but will confuse new users.
  • The TUI still overrides Device.from_indices() to show a unified device list, so the TUI layer and API layer have different semantics for "all devices."
  • Downstream users who want AMD support must update call sites.

Open questions for the maintainer

  1. Which integration option is preferred? Option A (transparent unification via Device) or Option B (explicit get_all_devices() API)? Or is there a third approach you'd prefer?

  2. AmdPhysicalDevice initialization. The current implementation bypasses Device.__init__ because that call invokes NVML. Is the maintainer comfortable with this for a v1 PR, or is refactoring Device.__init__ to separate NVML initialization from attribute setup a prerequisite for merge?

  3. Long-term backend abstraction. The architecturally cleanest design would be a proper backend abstraction (NvidiaBackend, AmdBackend) with Device as a vendor-agnostic base class. This is a significant refactor we would prefer to coordinate on before investing effort. Is this a direction the project wants to pursue?

  4. amdsmi version range. The implementation targets amdsmi >= 5.5.0 (ROCm 5.5+). The API surface changed between ROCm 5.x and 6.x. Should we add explicit runtime version detection and raise a clear error for unsupported versions?

  5. Platform guard. Should libamdsmi.py include an explicit sys.platform != 'linux' guard at module load, or is the current approach (init fails silently on non-Linux) sufficient?

  6. __getattr__ permissiveness. AmdPhysicalDevice.__getattr__ returns a callable yielding NA for any unrecognized attribute. This prevents crashes but may mask programming errors. Would the maintainer prefer a stricter allowlist?


Testing plan

  • Unit tests using unittest.mock.patch to mock amdsmi module calls, covering enumeration, the ImportError path, process listing, and oneshot() caching.
  • Integration smoke test: get_all_devices() on an NVIDIA-only system (no amdsmi) returns only NVIDIA devices with no errors.

Question for the maintainer: Is mock-based unit testing (no real AMD hardware required) acceptable?

Originally created by @jamesETsmith on GitHub (Apr 29, 2026). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/211 # RFC: AMD GPU Support for nvitop **Branch (Option A — unified Device):** [`feature/amd_impl`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl) **Branch (Option B — explicit API, recommended):** [`feature/amd_impl_option_b`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl_option_b) **Author:** James Smith --- ## Problem statement First off, thanks for making `nvitop`. I think it's a great tool! I work a lot with AMD GPUs and would love to be able to use `nvitop` rather than another GPU monitor. The AMD/ROCm ecosystem is moving toward a `pip install` style for its tools with [theRock](https://github.com/ROCm/TheRock) and I think `nvitop` would fit nicely into that stack. This RFC proposes adding AMD GPU support and I propose two different approaches to add support, which are already implemented in the branches listed above. Let me know what you think! --- ## Constraints - The implementation must not break or regress the existing NVIDIA path. - `amdsmi` must be an optional dependency — never required on systems without AMD hardware. - The two new files (`libamdsmi.py`, `amd_device.py`) are shared between both options. The question is only how they connect to the rest of nvitop. --- ## The design question The CLI and TUI today call `Device.count()`, `Device.all()`, and `Device.driver_version()` to enumerate devices. How should they gain awareness of AMD devices? --- ## Option A: Unified `Device` class **Branch:** [`feature/amd_impl`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl) **Diff vs upstream:** 10 files, +1743 / -37 lines; `device.py` has 8 hunks (+195 / -28) `Device.count()`, `Device.from_indices()`, `Device.driver_version()`, and `Device.is_available()` are modified to be AMD-aware. Existing call sites require no changes. ```python devices = Device.all() # returns PhysicalDevice + AmdPhysicalDevice count = Device.count() # NVIDIA count + AMD count version = Device.driver_version() # NVIDIA driver version, falls back to amdgpu version ``` ### Strengths - Zero changes required in downstream user code — `Device.all()` just works. - The CLI and TUI changes are minimal; they already call the right entry points. ### Weaknesses - `device.py` is the most sensitive file in the codebase. Reviewers must reason about whether the NVIDIA path is unchanged. - `_safe_nvml_query` suppresses logging during probing; this may hide legitimate NVML errors on mixed systems. - `Device.all()` now returns a heterogeneous list (`PhysicalDevice` + `AmdPhysicalDevice`) — a silent API contract change that may break downstream code assuming homogeneous types. - `driver_version()` falling back from "NVIDIA display driver version" to "amdgpu kernel module version" is not a semantically equivalent substitution. - The unified index `N..N+M-1` for AMD devices means AMD GPU indices shift if NVIDIA GPUs are added or removed. --- ## Option B: Explicit AMD-aware entry points (recommended) **Branch:** [`feature/amd_impl_option_b`](https://github.com/jamesETsmith/nvitop/tree/feature/amd_impl_option_b) **Diff vs upstream:** 9 files, +1607 / -9 lines; **`device.py` has zero diff** `device.py` is left completely untouched. Four new functions are added to `nvitop/api/__init__.py`. The CLI and TUI are updated to call these. ```python from nvitop.api import get_all_devices, get_device_count, get_driver_version, get_amd_runtime_version devices = get_all_devices() # PhysicalDevice + AmdPhysicalDevice count = get_device_count() # NVIDIA + AMD version = get_driver_version() # NVIDIA driver, falls back to amdgpu driver rocm = get_amd_runtime_version() # ROCm version (AMD-specific) # Existing API — unchanged, NVIDIA only nvidia_only = Device.all() ``` ### Strengths - `device.py` has zero diff — the core NVIDIA API contract is completely untouched. - All NVML error handling is unchanged; AMD errors are caught independently. - The API surface added is small and self-explanatory. - Incrementally mergeable: `libamdsmi.py` and `amd_device.py` could land first as a no-op, with the integration wired in a follow-up. ### Weaknesses - `Device.all()` and `get_all_devices()` coexist — the split is intentional but will confuse new users. - The TUI still overrides `Device.from_indices()` to show a unified device list, so the TUI layer and API layer have different semantics for "all devices." - Downstream users who want AMD support must update call sites. --- ## Open questions for the maintainer 1. **Which integration option is preferred?** Option A (transparent unification via `Device`) or Option B (explicit `get_all_devices()` API)? Or is there a third approach you'd prefer? 2. **`AmdPhysicalDevice` initialization.** The current implementation bypasses `Device.__init__` because that call invokes NVML. Is the maintainer comfortable with this for a v1 PR, or is refactoring `Device.__init__` to separate NVML initialization from attribute setup a prerequisite for merge? 3. **Long-term backend abstraction.** The architecturally cleanest design would be a proper backend abstraction (`NvidiaBackend`, `AmdBackend`) with `Device` as a vendor-agnostic base class. This is a significant refactor we would prefer to coordinate on before investing effort. Is this a direction the project wants to pursue? 4. **`amdsmi` version range.** The implementation targets `amdsmi >= 5.5.0` (ROCm 5.5+). The API surface changed between ROCm 5.x and 6.x. Should we add explicit runtime version detection and raise a clear error for unsupported versions? 5. **Platform guard.** Should `libamdsmi.py` include an explicit `sys.platform != 'linux'` guard at module load, or is the current approach (init fails silently on non-Linux) sufficient? 6. **`__getattr__` permissiveness.** `AmdPhysicalDevice.__getattr__` returns a callable yielding `NA` for any unrecognized attribute. This prevents crashes but may mask programming errors. Would the maintainer prefer a stricter allowlist? --- ## Testing plan - Unit tests using `unittest.mock.patch` to mock `amdsmi` module calls, covering enumeration, the `ImportError` path, process listing, and `oneshot()` caching. - Integration smoke test: `get_all_devices()` on an NVIDIA-only system (no `amdsmi`) returns only NVIDIA devices with no errors. **Question for the maintainer:** Is mock-based unit testing (no real AMD hardware required) acceptable?
Author
Owner

@jamesETsmith commented on GitHub (May 4, 2026):

@XuehaiPan just a nudge on this

<!-- gh-comment-id:4372546028 --> @jamesETsmith commented on GitHub (May 4, 2026): @XuehaiPan just a nudge on this
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#119
No description provided.