mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 06:06:12 -06:00
[GH-ISSUE #211] [RFC] AMD GPU Support #119
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#119
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jamesETsmith on GitHub (Apr 29, 2026).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/211
RFC: AMD GPU Support for nvitop
Branch (Option A — unified Device):
feature/amd_implBranch (Option B — explicit API, recommended):
feature/amd_impl_option_bAuthor: James Smith
Problem statement
First off, thanks for making
nvitop. I think it's a great tool! I work a lot with AMD GPUs and would love to be able to usenvitoprather than another GPU monitor. The AMD/ROCm ecosystem is moving toward apip installstyle for its tools with theRock and I thinknvitopwould fit nicely into that stack.This RFC proposes adding AMD GPU support and I propose two different approaches to add support, which are already implemented in the branches listed above. Let me know what you think!
Constraints
amdsmimust be an optional dependency — never required on systems without AMD hardware.libamdsmi.py,amd_device.py) are shared between both options. The question is only how they connect to the rest of nvitop.The design question
The CLI and TUI today call
Device.count(),Device.all(), andDevice.driver_version()to enumerate devices. How should they gain awareness of AMD devices?Option A: Unified
DeviceclassBranch:
feature/amd_implDiff vs upstream: 10 files, +1743 / -37 lines;
device.pyhas 8 hunks (+195 / -28)Device.count(),Device.from_indices(),Device.driver_version(), andDevice.is_available()are modified to be AMD-aware. Existing call sites require no changes.Strengths
Device.all()just works.Weaknesses
device.pyis the most sensitive file in the codebase. Reviewers must reason about whether the NVIDIA path is unchanged._safe_nvml_querysuppresses logging during probing; this may hide legitimate NVML errors on mixed systems.Device.all()now returns a heterogeneous list (PhysicalDevice+AmdPhysicalDevice) — a silent API contract change that may break downstream code assuming homogeneous types.driver_version()falling back from "NVIDIA display driver version" to "amdgpu kernel module version" is not a semantically equivalent substitution.N..N+M-1for AMD devices means AMD GPU indices shift if NVIDIA GPUs are added or removed.Option B: Explicit AMD-aware entry points (recommended)
Branch:
feature/amd_impl_option_bDiff vs upstream: 9 files, +1607 / -9 lines;
device.pyhas zero diffdevice.pyis left completely untouched. Four new functions are added tonvitop/api/__init__.py. The CLI and TUI are updated to call these.Strengths
device.pyhas zero diff — the core NVIDIA API contract is completely untouched.libamdsmi.pyandamd_device.pycould land first as a no-op, with the integration wired in a follow-up.Weaknesses
Device.all()andget_all_devices()coexist — the split is intentional but will confuse new users.Device.from_indices()to show a unified device list, so the TUI layer and API layer have different semantics for "all devices."Open questions for the maintainer
Which integration option is preferred? Option A (transparent unification via
Device) or Option B (explicitget_all_devices()API)? Or is there a third approach you'd prefer?AmdPhysicalDeviceinitialization. The current implementation bypassesDevice.__init__because that call invokes NVML. Is the maintainer comfortable with this for a v1 PR, or is refactoringDevice.__init__to separate NVML initialization from attribute setup a prerequisite for merge?Long-term backend abstraction. The architecturally cleanest design would be a proper backend abstraction (
NvidiaBackend,AmdBackend) withDeviceas a vendor-agnostic base class. This is a significant refactor we would prefer to coordinate on before investing effort. Is this a direction the project wants to pursue?amdsmiversion range. The implementation targetsamdsmi >= 5.5.0(ROCm 5.5+). The API surface changed between ROCm 5.x and 6.x. Should we add explicit runtime version detection and raise a clear error for unsupported versions?Platform guard. Should
libamdsmi.pyinclude an explicitsys.platform != 'linux'guard at module load, or is the current approach (init fails silently on non-Linux) sufficient?__getattr__permissiveness.AmdPhysicalDevice.__getattr__returns a callable yieldingNAfor any unrecognized attribute. This prevents crashes but may mask programming errors. Would the maintainer prefer a stricter allowlist?Testing plan
unittest.mock.patchto mockamdsmimodule calls, covering enumeration, theImportErrorpath, process listing, andoneshot()caching.get_all_devices()on an NVIDIA-only system (noamdsmi) returns only NVIDIA devices with no errors.Question for the maintainer: Is mock-based unit testing (no real AMD hardware required) acceptable?
@jamesETsmith commented on GitHub (May 4, 2026):
@XuehaiPan just a nudge on this