[PR #208] Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX … #212

New issue

Open

opened 2026-05-05 03:27:54 -06:00 by gitea-mirror · 0 comments

gitea-mirror commented

2026-05-05 03:27:54 -06:00

Owner

📋 Pull Request Information

Original PR: https://github.com/XuehaiPan/nvitop/pull/208
Author: @parallelArchitect
Created: 4/16/2026
Status: 🔄 Open

Base: main ← Head: fix/gb10-coherent-uma-memory-reporting

📝 Commits (2)

de09aeb Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark)
2ca5797 fix: replace UMA acronym in comment to pass spell check

📊 Changes

1 file changed (+15 additions, -7 deletions)

View changed files

📝 nvitop/api/device.py (+15 -7)

📄 Description

Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark)

On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total equal to system MemTotal (~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory.

The existing NVMLError_NotSupported path correctly handles some UMA platforms, but GB10 returns NVML_SUCCESS — not NOT_SUPPORTED — so it falls through to the discrete GPU path and displays wrong values.

Issue Type

Bug fix

Description

Detect coherent UMA by comparing NVML-reported total against system virtual memory total. If total >= 90% of system RAM, classify as unified memory and use system virtual memory (MemAvailable) for display instead.

Preserves existing behavior for discrete GPUs.

Motivation and Context

Same root cause documented and fixed in:

nvtop PR: https://github.com/Syllo/nvtop/pull/463
btop PR: https://github.com/aristocratos/btop/pull/1611
NVML shim workaround: https://github.com/parallelArchitect/nvml-unified-shim

Note

Requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/XuehaiPan/nvitop/pull/208 **Author:** [@parallelArchitect](https://github.com/parallelArchitect) **Created:** 4/16/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/gb10-coherent-uma-memory-reporting` --- ### 📝 Commits (2) - [`de09aeb`](https://github.com/XuehaiPan/nvitop/commit/de09aeb9f018aefa850a7fb7377374c4ca6368d2) Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark) - [`2ca5797`](https://github.com/XuehaiPan/nvitop/commit/2ca5797f3ad07ddc33901808500b0c0f5b803606) fix: replace UMA acronym in comment to pass spell check ### 📊 Changes **1 file changed** (+15 additions, -7 deletions) <details> <summary>View changed files</summary> 📝 `nvitop/api/device.py` (+15 -7) </details> ### 📄 Description Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark) On GB10 / DGX Spark, `nvmlDeviceGetMemoryInfo` returns `NVML_SUCCESS` with `total` equal to system `MemTotal` (~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory. The existing `NVMLError_NotSupported` path correctly handles some UMA platforms, but GB10 returns `NVML_SUCCESS` — not `NOT_SUPPORTED` — so it falls through to the discrete GPU path and displays wrong values. #### Issue Type - Bug fix #### Description Detect coherent UMA by comparing NVML-reported `total` against system virtual memory total. If total >= 90% of system RAM, classify as unified memory and use system virtual memory (`MemAvailable`) for display instead. Preserves existing behavior for discrete GPUs. #### Motivation and Context Same root cause documented and fixed in: - nvtop PR: https://github.com/Syllo/nvtop/pull/463 - btop PR: https://github.com/aristocratos/btop/pull/1611 - NVML shim workaround: https://github.com/parallelArchitect/nvml-unified-shim #### Note Requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>