mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 14:15:55 -06:00
[GH-ISSUE #193] [BUG] NVML cannot get device memory info for NVIDIA DGX Spark due to unified memory #117
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#117
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @FlorinAndrei on GitHub (Nov 20, 2025).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/193
Originally assigned to: @XuehaiPan on GitHub.
Required prerequisites
What version of nvitop are you using?
1.6.0
Operating system and version
Ubuntu 24.04 LTS
NVIDIA driver version
580.95.05
NVIDIA-SMI
Python environment
3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] linux
nvidia-ml-py==13.580.82
nvitop==1.6.0
Problem description
On the NVIDIA DGX Spark, the app seems to work fine, except for memory usage.
Spark has unified memory, which is a little unusual.
nvidia-smiitself has issues with reporting memory.I just wanted to raise awareness on this issue. Feel free to convert this to a discussion instead.
Steps to Reproduce
The Python snippets (if any):
Command lines:
Traceback
Logs
Expected behavior
No response
Additional context
No response
@FlorinAndrei commented on GitHub (Nov 20, 2025):
I should add: system memory does seem to be reported correctly. And that's all the memory this system has, because this is unified memory, like on a Mac. It's just the GPU MEM that's shown as N/A.
I'm not sure what's the best strategy here.
One idea would be to copy system memory metrics to the GPU memory graphs, just duplicate that info.
Another idea is something I heard
nvtopmay have in the code now, but has not made a release yet: I heard now they take total system memory, subtract from it the memory used by CPU-only processes, and that difference becomes the "total memory" available to GPU processes. So on the Spark, the total GPU memory becomes variable. And then they do the total mem usage for GPU processes, and display that, relative to the new "total GPU memory".The unified memory paradigm changed everything for these tools. Assumptions made in the past are no longer always valid now.
@MaxwellDPS commented on GitHub (Nov 29, 2025):
Was fixed on NVTOP like so if this helps https://github.com/Syllo/nvtop/pull/411/files
@thewh1teagle commented on GitHub (Nov 30, 2025):
I experience the same issue with DGX Spark. how can I install the new version with the new patch? thanks.
@MaxwellDPS commented on GitHub (Dec 1, 2025):
@XuehaiPan Got some changes on this fork that fixes it by calculating based off the total memory, but kind of messy still
https://github.com/MaxwellDPS/nvitop
Exporter image is built and working ghcr.io/cha0s-corp/nvitop-exporter:latest
@XuehaiPan commented on GitHub (Dec 1, 2025):
Hi everyone, thanks for the information! Before we patch for unified memory support, I want to ensure some details.
device.memory_total(): Instead of having a dynamic total GPU memory by subtracting the memory used by CPU processes, I'd like to show the fixed total system memory instead, i.e.,host.virtual_memory().total.device.memory_used(): There can be two approaches:host.virtual_memory().used.I prefer the second approach because we can get the available memory with a simple subtraction
available = total - used.I need some more information to distinguish if the GPU is lost or if it is a Spark device that uses unified memory. I'm wondering if someone could help to run the following snippet in Python REPL:
If I understand correctly, for unified memory devices,nvmlDeviceGetMemoryInfo(handle)will returnNVML_SUCCESSwithmemory_info.total == 0. Is that right?I implemented the above patch in PR #195 based on my current understanding of the unified memory. You can try it via:
I haven't tested it because I do not have the resources. You are welcome to share any console output or screenshot for the patch. Thanks in advance!
@FlorinAndrei commented on GitHub (Dec 1, 2025):
I've tried
Device(1), etc but I get errors.@FlorinAndrei commented on GitHub (Dec 1, 2025):
This is with your unified memory branch:
This is with the latest nvitop release:
This is htop:
This is nvidia-smi:
Let me know what else I should run.
@XuehaiPan commented on GitHub (Dec 2, 2025):
Based on the output in https://github.com/XuehaiPan/nvitop/issues/193#issuecomment-3597911469
@FlorinAndrei Thanks for the information! Seems this assumption is not always hold for unified memory devices (where
NVML_ERROR_NOT_SUPPORTEDis returned). I have updated the implementation in PR #195. Hope that works.@FlorinAndrei commented on GitHub (Dec 2, 2025):
@XuehaiPan commented on GitHub (Dec 2, 2025):
@FlorinAndrei Updated.
@FlorinAndrei commented on GitHub (Dec 2, 2025):
nvitop new branch:
htop:
nvidia-smi:
@XuehaiPan commented on GitHub (Dec 2, 2025):
Thanks @FlorinAndrei. Looks good so far, except the memory bandwidth utilization might always return
0. I would like to hold the PR for several days to gather more feedback.@FlorinAndrei commented on GitHub (Dec 2, 2025):
@XuehaiPan thanks a lot for your excellent work!
Regarding that one field that shows N/A, it seems like it's not memory bandwidth, but rather the memory clock. To get the actual bandwidth from the clock is a bit tricky and seems to depend on the GPU (bus width, various multipliers, etc).
I think that field should be renamed in the app, to show that it's actually a clock frequency: the frequency of the memory clock. It's measured in MHz, which is a unit for frequency. Call it MCK or something.
Regardless, on the Spark the unified memory always operates at a fixed clock, so
nvidia-smidoes not report it. Here are some outputs from various tools on the Spark:@FlorinAndrei commented on GitHub (Dec 3, 2025):
Let me re-iterate: the branch via uvx is super-useful. I just started another round of evals, and it's very nice to have all the metrics on a single screen there.
I've added nvitop to my toolbox. It's great!
@XuehaiPan commented on GitHub (Dec 8, 2025):
The patch is released in the latest version. You can try it with: