[GH-ISSUE #29] [Enhancement] Backward compatible NVML Python bindings #21

Closed
opened 2026-05-05 03:22:13 -06:00 by gitea-mirror · 1 comment
Owner

Originally created by @XuehaiPan on GitHub (Jul 23, 2022).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/29

Originally assigned to: @XuehaiPan on GitHub.

Runtime Environment

  • Operating system and version: Ubuntu 20.04 LTS
  • Terminal emulator and version: GNOME Terminal 3.36.2
  • Python version: 3.9.13
  • NVML version (driver version): 470.129.06
  • nvitop version or commit: v0.7.1
  • python-ml-py version: 11.450.51
  • Locale: en_US.UTF-8

Context

The official NVML Python bindings (PyPI package nvidia-ml-py) do not guarantee backward compatibility for different NVIDIA drivers. For example, NVML added nvmlDeviceGetComputeRunningProcesses_v2 and nvmlDeviceGetGraphicsRunningProcesses_v2 in CUDA 11.x drivers (R450+). But the package nvidia-ml-py arbitrary call the latest version of the function in the unversioned function:

def nvmlDeviceGetComputeRunningProcesses_v2(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);

This will cause NVMLError_FunctionNotFound error on CUDA 10.x drivers (e.g. R430).

Now there are the v3 version of nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses functions come with the R510+ drivers. E.g., in nvidia-ml-py==11.515.48:

def nvmlDeviceGetComputeRunningProcesses_v3(handle):
    # first call to get the size
    c_count = c_uint(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
    ret = fn(handle, byref(c_count), None)

    ...

def nvmlDeviceGetComputeRunningProcesses(handle):
    return nvmlDeviceGetComputeRunningProcesses_v3(handle)

The v2 version of c_nvmlMemory_v2_t is appearing on the horizon (not found in R510 driver yet). This causes issue #13.

class c_nvmlMemory_t(_PrintableStructure):
    _fields_ = [
        ('total', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

class c_nvmlMemory_v2_t(_PrintableStructure):
    _fields_ = [
        ('version', c_uint),
        ('total', c_ulonglong),
        ('reserved', c_ulonglong),
        ('free', c_ulonglong),
        ('used', c_ulonglong),
    ]
    _fmt_ = {'<default>': "%d B"}

nvmlMemory_v2 = 0x02000028
def nvmlDeviceGetMemoryInfo(handle, version=None):
    if not version:
        c_memory = c_nvmlMemory_t()
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
    else:
        c_memory = c_nvmlMemory_v2_t()
        c_memory.version = version
        fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
    ret = fn(handle, byref(c_memory))
    _nvmlCheckReturn(ret)
    return c_memory

Possible Solutions

  1. Determine the best dependency version of nvidia-ml-py during installation.

    This requires the user to install the NVIDIA driver first, which may not be fulfilled on a freshly installed system. Besides, it's hard to list this driver dependency in the package metadata.

  2. Wait for the PyPI package nvidia-ml-py to become backward compatible.

    The package NVIDIA/go-nvml offers backward compatible APIs:

    The API is designed to be backwards compatible, so the latest bindings should work with any version of libnvidia-ml.so installed on your system.

    I posted this on the NVIDIA developer forums [PyPI/nvidia-ml-py] Issue Reports for nvidia-ml-py but did not get any official response yet.

  3. Vender the nvidia-ml-py in nvitop. (Note: nvidia-ml-py is released under the BSD License)

    This requires bumping the vendered version and making a minor release of nvitop each time a new version of nvidia-ml-py comes out.

  4. Automatically patch the pynvml module when the first call fails when calling the versioned APIs. This can achieve by manipulating the __dict__ attribute or the module.__class__ attribute.

    The goal of this solution is not to make fully backward-compatible Python bindings. That may be out of the scope of nvitop, e.g. ExcludedDeviceInfo -> BlacklistDeviceInfo. Also, note that this solution may cause performance issues for a much deeper call stack.

Originally created by @XuehaiPan on GitHub (Jul 23, 2022). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/29 Originally assigned to: @XuehaiPan on GitHub. #### Runtime Environment - Operating system and version: Ubuntu 20.04 LTS - Terminal emulator and version: GNOME Terminal 3.36.2 - Python version: `3.9.13` - NVML version (driver version): `470.129.06` - `nvitop` version or commit: `v0.7.1` - `python-ml-py` version: `11.450.51` - Locale: `en_US.UTF-8` #### Context The official NVML Python bindings ([PyPI package `nvidia-ml-py`](https://pypi.org/project/nvidia-ml-py/)) do not guarantee backward compatibility for different NVIDIA drivers. For example, NVML added `nvmlDeviceGetComputeRunningProcesses_v2` and `nvmlDeviceGetGraphicsRunningProcesses_v2` in CUDA 11.x drivers (R450+). But the package `nvidia-ml-py` arbitrary call the latest version of the function in the unversioned function: ```python def nvmlDeviceGetComputeRunningProcesses_v2(handle): # first call to get the size c_count = c_uint(0) fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2") ret = fn(handle, byref(c_count), None) ... def nvmlDeviceGetComputeRunningProcesses(handle): return nvmlDeviceGetComputeRunningProcesses_v2(handle); ``` This will cause `NVMLError_FunctionNotFound` error on CUDA 10.x drivers (e.g. R430). Now there are the `v3` version of `nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses` functions come with the R510+ drivers. E.g., in `nvidia-ml-py==11.515.48`: ```python def nvmlDeviceGetComputeRunningProcesses_v3(handle): # first call to get the size c_count = c_uint(0) fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3") ret = fn(handle, byref(c_count), None) ... def nvmlDeviceGetComputeRunningProcesses(handle): return nvmlDeviceGetComputeRunningProcesses_v3(handle) ``` The `v2` version of `c_nvmlMemory_v2_t` is appearing on the horizon (not found in R510 driver yet). This causes issue #13. ```python class c_nvmlMemory_t(_PrintableStructure): _fields_ = [ ('total', c_ulonglong), ('free', c_ulonglong), ('used', c_ulonglong), ] _fmt_ = {'<default>': "%d B"} class c_nvmlMemory_v2_t(_PrintableStructure): _fields_ = [ ('version', c_uint), ('total', c_ulonglong), ('reserved', c_ulonglong), ('free', c_ulonglong), ('used', c_ulonglong), ] _fmt_ = {'<default>': "%d B"} nvmlMemory_v2 = 0x02000028 ``` ```python def nvmlDeviceGetMemoryInfo(handle, version=None): if not version: c_memory = c_nvmlMemory_t() fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo") else: c_memory = c_nvmlMemory_v2_t() c_memory.version = version fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2") ret = fn(handle, byref(c_memory)) _nvmlCheckReturn(ret) return c_memory ``` #### Possible Solutions 1. Determine the best dependency version of `nvidia-ml-py` during installation. This requires the user to install the NVIDIA driver first, which may not be fulfilled on a freshly installed system. Besides, it's hard to list this driver dependency in the package metadata. 2. Wait for the [PyPI package `nvidia-ml-py`](https://pypi.org/project/nvidia-ml-py/) to become backward compatible. The package [`NVIDIA/go-nvml`](https://github.com/NVIDIA/go-nvml) offers backward compatible APIs: > The API is designed to be backwards compatible, so the latest bindings should work with any version of `libnvidia-ml.so` installed on your system. I posted this on the NVIDIA developer forums [[PyPI/nvidia-ml-py] Issue Reports for `nvidia-ml-py`](https://forums.developer.nvidia.com/t/pypi-nvidia-ml-py-issue-reports-for-nvidia-ml-py/196506) but did not get any official response yet. 3. Vender the `nvidia-ml-py` in `nvitop`. (Note: `nvidia-ml-py` is released under the BSD License) This requires bumping the vendered version and making a minor release of `nvitop` each time a new version of `nvidia-ml-py` comes out. 4. Automatically patch the `pynvml` module when the first call fails when calling the versioned APIs. This can achieve by manipulating the `__dict__` attribute or the `module.__class__` attribute. The goal of this solution is not to make fully backward-compatible Python bindings. That may be out of the scope of `nvitop`, e.g. `ExcludedDeviceInfo -> BlacklistDeviceInfo`. Also, note that this solution may cause performance issues for a much deeper call stack.
gitea-mirror 2026-05-05 03:22:13 -06:00
Author
Owner

@wookayin commented on GitHub (Oct 17, 2022):

This is a great job. gpustat will have a conflicting dependency of nvidia-ml-py as it is still pinning at older versions, so I will also have to catch up to make them compatible.

<!-- gh-comment-id:1281527598 --> @wookayin commented on GitHub (Oct 17, 2022): This is a great job. gpustat will have a conflicting dependency of `nvidia-ml-py` as it is still pinning at older versions, so I will also have to catch up to make them compatible.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#21
No description provided.