mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 06:06:12 -06:00
[GH-ISSUE #181] [BUG] PID out of range due to API change of NVIDIA R525 driver #110
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#110
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @xieshuaix on GitHub (Aug 22, 2025).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/181
Originally assigned to: @XuehaiPan on GitHub.
Required prerequisites
What version of nvitop are you using?
1.5.3
Operating system and version
Ubuntu 20.04.5 LTS (Focal Fossa)
NVIDIA driver version
525.125.06
NVIDIA-SMI
Python environment
3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] linux
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py @ file:///home/conda/feedstock_root/build_artifacts/nvidia-ml-py_1746576379096/work
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.9.86
nvidia-nvtx-cu12==12.1.105
nvitop @ file:///home/conda/feedstock_root/build_artifacts/nvitop_1755346934447/work
onnxruntime-gpu==1.19.0
Problem description
In my case, this bug occurs when I use supervisord to launch system-level service that runs some GPU code and then use nvitop, killing those processes launched with supervisord solves the problem.
Steps to Reproduce
In my environment, this can be stably reproduced by using supervisord to launch script that runs GPU code.
Not sure if this can be reproduced on other platform.
Traceback
Logs
Expected behavior
Exception handled gracefully and nvitop keeps runnning ignoring processes causing exception.
Additional context
I am using nvitop in jupyterlab in docker container.
@XuehaiPan commented on GitHub (Aug 22, 2025):
Hi @xieshuaix, thanks for the report. Could you paste all the content of the log? Then we can investigate. The log seems to be missing the process info patching part.
@XuehaiPan commented on GitHub (Aug 22, 2025):
Also, the
nvidia-smioutput shows that it also failed to gather the process information (empty process panel, while it also does not haveNo running processes). I suspect upgrading your NVIDIA driver will resolve the issue.@xieshuaix commented on GitHub (Aug 22, 2025):
this is all logs I got before nvitop partially shows up and the exception stack trace is printed, which messes up everything printed.

if there is another way to write debug log to a separate file I can try.
@XuehaiPan commented on GitHub (Aug 22, 2025):
@xieshuaix You can find a
nvitop.logfile in yourcwdwhen you runnvitop:@xieshuaix commented on GitHub (Aug 22, 2025):
Details
@XuehaiPan commented on GitHub (Aug 22, 2025):
@xieshuaix Could you try to change the default value of
__get_running_processes_version_suffixfromNoneto'_v3'in "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py:611"?5d434b8987/nvitop/api/libnvml.py (L611-L612)I highly suspect it is a driver issue.
@xieshuaix commented on GitHub (Aug 22, 2025):
Got these logs:
[ERROR] 2025-08-22 13:28:31,599 nvitop.api.libnvml::nvmlQuery: ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetComputeRunningProcesses at 0x7f0e1b0b6170>, *args, **kwargs).
Please verify whether the
nvidia-ml-pypackage is compatible with your NVIDIA driver version.Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1076, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcessesv3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 446, in nvmlQuery
retval = func(*args, **kwargs) # type: ignore[operator]
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 728, in nvmlDeviceGetComputeRunningProcesses
return __nvml_device_get_running_processes('nvmlDeviceGetComputeRunningProcesses', handle)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 674, in __nvml_device_get_running_processes
fn = _nvmlGetFunctionPointer(f'{func}{version_suffix}')
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1079, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found
[ERROR] 2025-08-22 13:28:31,600 nvitop.api.libnvml::nvmlQuery: ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetGraphicsRunningProcesses at 0x7f0e1b0b6320>, *args, **kwargs).
Please verify whether the
nvidia-ml-pypackage is compatible with your NVIDIA driver version.Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1076, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetGraphicsRunningProcessesv3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 446, in nvmlQuery
retval = func(*args, **kwargs) # type: ignore[operator]
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 752, in nvmlDeviceGetGraphicsRunningProcesses
return __nvml_device_get_running_processes('nvmlDeviceGetGraphicsRunningProcesses', handle)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 674, in __nvml_device_get_running_processes
fn = _nvmlGetFunctionPointer(f'{func}{version_suffix}')
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1079, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found
@XuehaiPan commented on GitHub (Aug 22, 2025):
@xieshuaix Sorry, my fault. It should be
'_v3'instead of'v3'. Could you try it again? Thanks!@xieshuaix commented on GitHub (Aug 22, 2025):
that works
@XuehaiPan commented on GitHub (Aug 22, 2025):
@xieshuaix Thanks for the information. I will try to find a fix for this. In the meantime, the simplest fix is to upgrade your NVIDIA driver.