Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX Spark)

On GB10 / DGX Spark, nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total == system MemTotal (~121GB). This causes nvitop to display full system RAM as GPU memory instead of actually allocatable memory. Fix: detect UMA by comparing NVML total against system virtual memory total. If total >= 90% of system RAM, treat as unified memory and use system virtual memory (MemAvailable) for display instead. Preserves existing behavior for discrete GPUs. Note: requires validation on GB10 / DGX Spark hardware. The fix has not been independently validated on a coherent UMA system.
2026-05-15 14:15:55 -06:00 · 2026-04-16 04:10:25 -04:00 · 2026-04-16 04:10:25 -04:00 · de09aeb9f0
commit de09aeb9f0
parent a6761eb5c4
1 changed files with 15 additions and 7 deletions
--- a/nvitop/api/device.py
+++ b/nvitop/api/device.py
@ -985,13 +985,21 @@ class Device:  # pylint: disable=too-many-instance-attributes,too-many-public-me
                memory_info = NA
            if libnvml.nvmlCheckReturn(memory_info):
                if memory_info.total > 0:
-                    return MemoryInfo(
-                        total=memory_info.total,
-                        free=memory_info.free,
-                        used=memory_info.used,
-                        reserved=getattr(memory_info, 'reserved', NA),
-                    )
-                has_unified_memory = True
+                    # Detect coherent UMA platforms (e.g. GB10 Grace Blackwell):
+                    # nvmlDeviceGetMemoryInfo returns NVML_SUCCESS with total == system MemTotal (~121GB).
+                    # If total >= 90% of system RAM, treat as unified memory and use MemAvailable instead.
+                    vm = host.virtual_memory()
+                    if vm.total > 0 and memory_info.total >= vm.total * 9 // 10:
+                        has_unified_memory = True
+                    else:
+                        return MemoryInfo(
+                            total=memory_info.total,
+                            free=memory_info.free,
+                            used=memory_info.used,
+                            reserved=getattr(memory_info, 'reserved', NA),
+                        )
+                else:
+                    has_unified_memory = True
            if has_unified_memory:
                # Device with unified memory
                # Use system virtual memory as these devices share host memory