[PR #79] [MERGED] fix(api/libnvml): fix process info support for NVIDIA R535 driver (CUDA 12.2+) #150

Closed
opened 2026-05-05 03:26:49 -06:00 by gitea-mirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/XuehaiPan/nvitop/pull/79
Author: @XuehaiPan
Created: 7/14/2023
Status: Merged
Merged: 7/16/2023
Merged by: @XuehaiPan

Base: mainHead: fix-r535-driver


📝 Commits (9)

  • ecb23a6 fix(api/libnvml): fix process info support for NVIDIA R535 driver
  • 788a1fb docs(api/libnvml): add comments for type struct fields
  • 29b047c feat(api/process): set used_gpu_cc_protected_memor for GpuProcess
  • 090cd6b docs(api/libnvml): update docstrings
  • ce46b3a chore(cli): remove unreachable warnings
  • 3486b45 style(api/libnvml): update private function name
  • 727a432 style(api/process): update method name
  • db9fb6c chore(api/process): add usedGpuCcProtectedMemory to process snapshot
  • 9c7545f deps(nvidia-ml-py): add nvidia-ml-py 12.535.77 to support list

📊 Changes

10 files changed (+425 additions, -295 deletions)

View changed files

📝 .pre-commit-config.yaml (+3 -3)
📝 docs/source/spelling_wordlist.txt (+5 -0)
📝 nvitop/api/device.py (+16 -7)
📝 nvitop/api/libnvml.py (+351 -245)
📝 nvitop/api/process.py (+38 -7)
📝 nvitop/api/utils.py (+8 -0)
📝 nvitop/cli.py (+1 -31)
📝 nvitop/version.py (+1 -0)
📝 pyproject.toml (+1 -1)
📝 requirements.txt (+1 -1)

📄 Description

Issue Type

  • Bug fix

Description

The start with the NVIDIA R510 driver, the new version 3 APIs have been added for nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses. But the version 3 functions still use the version 2 type struct as the function argument type:

class c_nvmlProcessInfo_v2_t(pynvml._PrintableStructure):
    _fields_ = [
        ('pid', ctypes.c_uint),
        ('usedGpuMemory', ctypes.c_ulonglong),
        ('gpuInstanceId', ctypes.c_uint),
        ('computeInstanceId', ctypes.c_uint),
    ]
    _fmt_ = {
        'usedGpuMemory': '%d B',
    }

Recently, the NVIDIA R535 driver came out. The version 3 APIs starts to use the new version 3 type struct without a version bump. This results in invalid memory access and produces the wrong results.

class c_nvmlProcessInfo_v3_t(pynvml._PrintableStructure):
    _fields_ = [
        ('pid', ctypes.c_uint),
        ('usedGpuMemory', ctypes.c_ulonglong),
        ('gpuInstanceId', ctypes.c_uint),
        ('computeInstanceId', ctypes.c_uint),
        ('usedGpuCcProtectedMemory', ctypes.c_ulonglong),
    ]
    _fmt_ = {
        'usedGpuMemory': '%d B',
        'usedGpuCcProtectedMemory': '%d B',
    }

The two type structs have different sizes:

>>> ctypes.sizeof(libnvml.c_nvmlProcessInfo_v2_t)
24
>>> ctypes.sizeof(libnvml.c_nvmlProcessInfo_v3_t)
32

This PR adds a helper function that determines the API version and type struct version of nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses on the first API call.

Motivation and Context

Fixes #75
Fixes #76


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/XuehaiPan/nvitop/pull/79 **Author:** [@XuehaiPan](https://github.com/XuehaiPan) **Created:** 7/14/2023 **Status:** ✅ Merged **Merged:** 7/16/2023 **Merged by:** [@XuehaiPan](https://github.com/XuehaiPan) **Base:** `main` ← **Head:** `fix-r535-driver` --- ### 📝 Commits (9) - [`ecb23a6`](https://github.com/XuehaiPan/nvitop/commit/ecb23a66c38787bfb4d7c1839aa836c48488e2fb) fix(api/libnvml): fix process info support for NVIDIA R535 driver - [`788a1fb`](https://github.com/XuehaiPan/nvitop/commit/788a1fb44ee846dabbf896e63771afb09ee888e4) docs(api/libnvml): add comments for type struct fields - [`29b047c`](https://github.com/XuehaiPan/nvitop/commit/29b047c18d106fb967678c1c379eb75b3fb8e3dc) feat(api/process): set `used_gpu_cc_protected_memor` for `GpuProcess` - [`090cd6b`](https://github.com/XuehaiPan/nvitop/commit/090cd6baa0784eb593941358f41ffb4eaa2dddd7) docs(api/libnvml): update docstrings - [`ce46b3a`](https://github.com/XuehaiPan/nvitop/commit/ce46b3ad1b4c6ce45b8aae4d0b0df6b10849b82e) chore(cli): remove unreachable warnings - [`3486b45`](https://github.com/XuehaiPan/nvitop/commit/3486b45b11d642452cb4fa13f2a99d4033bfb4e2) style(api/libnvml): update private function name - [`727a432`](https://github.com/XuehaiPan/nvitop/commit/727a4322fe2357e5af8088f58068f28860545256) style(api/process): update method name - [`db9fb6c`](https://github.com/XuehaiPan/nvitop/commit/db9fb6c9cecea82a230a35db24cc48e31ae77654) chore(api/process): add `usedGpuCcProtectedMemory` to process snapshot - [`9c7545f`](https://github.com/XuehaiPan/nvitop/commit/9c7545feeeff6b6cebae6175dc7d4dcc3296fec9) deps(nvidia-ml-py): add `nvidia-ml-py` 12.535.77 to support list ### 📊 Changes **10 files changed** (+425 additions, -295 deletions) <details> <summary>View changed files</summary> 📝 `.pre-commit-config.yaml` (+3 -3) 📝 `docs/source/spelling_wordlist.txt` (+5 -0) 📝 `nvitop/api/device.py` (+16 -7) 📝 `nvitop/api/libnvml.py` (+351 -245) 📝 `nvitop/api/process.py` (+38 -7) 📝 `nvitop/api/utils.py` (+8 -0) 📝 `nvitop/cli.py` (+1 -31) 📝 `nvitop/version.py` (+1 -0) 📝 `pyproject.toml` (+1 -1) 📝 `requirements.txt` (+1 -1) </details> ### 📄 Description <!-- Provide a descriptive summary of the changes in the title above --> #### Issue Type <!-- Pick relevant types and delete the rest --> - Bug fix #### Description <!-- Describe the changes in detail --> The start with the NVIDIA R510 driver, the new version 3 APIs have been added for `nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses`. But the version 3 functions still use the version 2 type struct as the function argument type: ```python class c_nvmlProcessInfo_v2_t(pynvml._PrintableStructure): _fields_ = [ ('pid', ctypes.c_uint), ('usedGpuMemory', ctypes.c_ulonglong), ('gpuInstanceId', ctypes.c_uint), ('computeInstanceId', ctypes.c_uint), ] _fmt_ = { 'usedGpuMemory': '%d B', } ``` Recently, the NVIDIA R535 driver came out. The version 3 APIs starts to use the new version 3 type struct without a version bump. This results in invalid memory access and produces the wrong results. ```python class c_nvmlProcessInfo_v3_t(pynvml._PrintableStructure): _fields_ = [ ('pid', ctypes.c_uint), ('usedGpuMemory', ctypes.c_ulonglong), ('gpuInstanceId', ctypes.c_uint), ('computeInstanceId', ctypes.c_uint), ('usedGpuCcProtectedMemory', ctypes.c_ulonglong), ] _fmt_ = { 'usedGpuMemory': '%d B', 'usedGpuCcProtectedMemory': '%d B', } ``` The two type structs have different sizes: ```python >>> ctypes.sizeof(libnvml.c_nvmlProcessInfo_v2_t) 24 >>> ctypes.sizeof(libnvml.c_nvmlProcessInfo_v3_t) 32 ``` This PR adds a helper function that determines the API version and type struct version of `nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses` on the first API call. #### Motivation and Context <!-- Why are these changes required? --> <!-- What problems do these changes solve? --> <!-- Link to relevant issues --> Fixes #75 Fixes #76 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
gitea-mirror 2026-05-05 03:26:49 -06:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#150
No description provided.