[GH-ISSUE #181] [BUG] PID out of range due to API change of NVIDIA R525 driver #110

Open
opened 2026-05-05 03:25:46 -06:00 by gitea-mirror · 10 comments
Owner

Originally created by @xieshuaix on GitHub (Aug 22, 2025).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/181

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.5.3

Operating system and version

Ubuntu 20.04.5 LTS (Focal Fossa)

NVIDIA driver version

525.125.06

NVIDIA-SMI

Fri Aug 22 12:46:15 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM...  On   | 00000000:AD:00.0 Off |                    0 |
| N/A   60C    P0   265W / 400W |  78988MiB / 81920MiB |     87%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   38C    P0   170W / 400W |  41012MiB / 81920MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM...  On   | 00000000:D0:00.0 Off |                    0 |
| N/A   40C    P0   179W / 400W |  42182MiB / 81920MiB |     86%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM...  On   | 00000000:D3:00.0 Off |                    0 |
| N/A   45C    P0   171W / 400W |  41012MiB / 81920MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Python environment

3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] linux
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py @ file:///home/conda/feedstock_root/build_artifacts/nvidia-ml-py_1746576379096/work
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.9.86
nvidia-nvtx-cu12==12.1.105
nvitop @ file:///home/conda/feedstock_root/build_artifacts/nvitop_1755346934447/work
onnxruntime-gpu==1.19.0

Problem description

In my case, this bug occurs when I use supervisord to launch system-level service that runs some GPU code and then use nvitop, killing those processes launched with supervisord solves the problem.

Steps to Reproduce

In my environment, this can be stably reproduced by using supervisord to launch script that runs GPU code.
Not sure if this can be reproduced on other platform.

Traceback

Traceback (most recent call last):
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 327, in _init
    _psplatform.cext.check_pid_range(pid)
OverflowError: signed integer is greater than maximum

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniforge3/envs/xs_stepfun/bin/nvitop", line 10, in <module>
    sys.exit(main())
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/cli.py", line 382, in main
    tui.print()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/tui.py", line 235, in print
    self.main_screen.print()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in print
    print_width = min(panel.print_width() for panel in self.container)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in <genexpr>
    print_width = min(panel.print_width() for panel in self.container)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 658, in print_width
    self.ensure_snapshots()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 356, in ensure_snapshots
    self.snapshots = self.take_snapshots()
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/caching.py", line 220, in wrapped
    result = func(*args, **kwargs)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 360, in take_snapshots
    snapshots = GpuProcess.take_snapshots(self.processes, failsafe=True)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 409, in processes
    return list(
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 410, in <genexpr>
    itertools.chain.from_iterable(device.processes().values() for device in self.devices),  # type: ignore[misc]
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/device.py", line 2271, in processes
    proc = processes[p.pid] = self.GPU_PROCESS_CLASS(
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/library/process.py", line 29, in __new__
    instance = super().__new__(cls, *args, **kwargs)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 486, in __new__
    instance._host = HostProcess(pid)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 213, in __new__
    host.Process._init(instance, pid, True)
  File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 330, in _init
    raise NoSuchProcess(pid, msg=msg) from err
psutil.NoSuchProcess: process PID out of range (pid=2529165312)

Logs

[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.

Expected behavior

Exception handled gracefully and nvitop keeps runnning ignoring processes causing exception.

Additional context

I am using nvitop in jupyterlab in docker container.

Originally created by @xieshuaix on GitHub (Aug 22, 2025). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/181 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [x] I have read the documentation <https://nvitop.readthedocs.io>. - [x] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [x] I have tried the latest version of nvitop in a new isolated virtual environment. ### What version of nvitop are you using? 1.5.3 ### Operating system and version Ubuntu 20.04.5 LTS (Focal Fossa) ### NVIDIA driver version 525.125.06 ### NVIDIA-SMI ```text Fri Aug 22 12:46:15 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A800-SXM... On | 00000000:AD:00.0 Off | 0 | | N/A 60C P0 265W / 400W | 78988MiB / 81920MiB | 87% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A800-SXM... On | 00000000:B1:00.0 Off | 0 | | N/A 38C P0 170W / 400W | 41012MiB / 81920MiB | 88% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 | | N/A 40C P0 179W / 400W | 42182MiB / 81920MiB | 86% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 | | N/A 45C P0 171W / 400W | 41012MiB / 81920MiB | 88% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ ``` ### Python environment 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] linux nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py @ file:///home/conda/feedstock_root/build_artifacts/nvidia-ml-py_1746576379096/work nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.9.86 nvidia-nvtx-cu12==12.1.105 nvitop @ file:///home/conda/feedstock_root/build_artifacts/nvitop_1755346934447/work onnxruntime-gpu==1.19.0 ### Problem description In my case, this bug occurs when I use supervisord to launch system-level service that runs some GPU code and then use nvitop, killing those processes launched with supervisord solves the problem. ### Steps to Reproduce In my environment, this can be stably reproduced by using supervisord to launch script that runs GPU code. Not sure if this can be reproduced on other platform. ### Traceback ```pytb Traceback (most recent call last): File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 327, in _init _psplatform.cext.check_pid_range(pid) OverflowError: signed integer is greater than maximum The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/root/miniforge3/envs/xs_stepfun/bin/nvitop", line 10, in <module> sys.exit(main()) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/cli.py", line 382, in main tui.print() File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/tui.py", line 235, in print self.main_screen.print() File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in print print_width = min(panel.print_width() for panel in self.container) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/__init__.py", line 191, in <genexpr> print_width = min(panel.print_width() for panel in self.container) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 658, in print_width self.ensure_snapshots() File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 356, in ensure_snapshots self.snapshots = self.take_snapshots() File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/caching.py", line 220, in wrapped result = func(*args, **kwargs) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 360, in take_snapshots snapshots = GpuProcess.take_snapshots(self.processes, failsafe=True) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 409, in processes return list( File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/screens/main/panels/process.py", line 410, in <genexpr> itertools.chain.from_iterable(device.processes().values() for device in self.devices), # type: ignore[misc] File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/device.py", line 2271, in processes proc = processes[p.pid] = self.GPU_PROCESS_CLASS( File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/tui/library/process.py", line 29, in __new__ instance = super().__new__(cls, *args, **kwargs) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 486, in __new__ instance._host = HostProcess(pid) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/process.py", line 213, in __new__ host.Process._init(instance, pid, True) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/psutil/__init__.py", line 330, in _init raise NoSuchProcess(pid, msg=msg) from err psutil.NoSuchProcess: process PID out of range (pid=2529165312) ``` ### Logs ```text [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`. [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. ``` ### Expected behavior Exception handled gracefully and nvitop keeps runnning ignoring processes causing exception. ### Additional context I am using nvitop in jupyterlab in docker container.
gitea-mirror added the
pynvml
api
bug
upstream
labels 2026-05-05 03:25:46 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Aug 22, 2025):

Hi @xieshuaix, thanks for the report. Could you paste all the content of the log? Then we can investigate. The log seems to be missing the process info patching part.

Logs

[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
<!-- gh-comment-id:3213047648 --> @XuehaiPan commented on GitHub (Aug 22, 2025): Hi @xieshuaix, thanks for the report. Could you paste all the content of the log? Then we can investigate. The log seems to be missing the process info patching part. > ### Logs > ``` > [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`. > [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. > [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`. > [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. > ```
Author
Owner

@XuehaiPan commented on GitHub (Aug 22, 2025):

Also, the nvidia-smi output shows that it also failed to gather the process information (empty process panel, while it also does not have No running processes). I suspect upgrading your NVIDIA driver will resolve the issue.

<!-- gh-comment-id:3213051068 --> @XuehaiPan commented on GitHub (Aug 22, 2025): Also, the `nvidia-smi` output shows that it also failed to gather the process information (empty process panel, while it also does not have `No running processes`). I suspect upgrading your NVIDIA driver will resolve the issue.
Author
Owner

@xieshuaix commented on GitHub (Aug 22, 2025):

Hi @xieshuaix, thanks for the report. Could you paste all the content of the log? Then we can investigate. The log seems to be missing the process info patching part.

Logs

[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.

this is all logs I got before nvitop partially shows up and the exception stack trace is printed, which messes up everything printed.
if there is another way to write debug log to a separate file I can try.
Image

<!-- gh-comment-id:3213052669 --> @xieshuaix commented on GitHub (Aug 22, 2025): > Hi [@xieshuaix](https://github.com/xieshuaix), thanks for the report. Could you paste all the content of the log? Then we can investigate. The log seems to be missing the process info patching part. > > > ### Logs > > ``` > > [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmLookupFunctionPointer: Found symbol `nvm\DeviceGetMemoryInfo_v2`. > > [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. > > [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmLookupFunctionPointer: Failed to found symbol `nvm\DeviceGetTemperatureV`. > > [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. > > ``` this is all logs I got before nvitop partially shows up and the exception stack trace is printed, which messes up everything printed. if there is another way to write debug log to a separate file I can try. <img width="1829" height="158" alt="Image" src="https://github.com/user-attachments/assets/5a0f63ef-9d5d-460b-9f8d-d15984b0d0c4" />
Author
Owner

@XuehaiPan commented on GitHub (Aug 22, 2025):

if there is another way to write debug log to a separate file I can try.

@xieshuaix You can find a nvitop.log file in your cwd when you run nvitop:

PYTHONFAULTHANDLER=1 LOGLEVEL=DEBUG nvitop -1
cat nvitop.log
<!-- gh-comment-id:3213056849 --> @XuehaiPan commented on GitHub (Aug 22, 2025): > if there is another way to write debug log to a separate file I can try. @xieshuaix You can find a `nvitop.log` file in your `cwd` when you run `nvitop`: ```bash PYTHONFAULTHANDLER=1 LOGLEVEL=DEBUG nvitop -1 cat nvitop.log ```
Author
Owner

@xieshuaix commented on GitHub (Aug 22, 2025):

cat nvitop.log

Details
[DEBUG] 2025-08-22 12:48:43,371 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:48:43,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:48:44,276 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:48:44,276 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
[DEBUG] 2025-08-22 12:48:44,366 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2025-08-22 12:48:44,367 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2025-08-22 12:48:44,367 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.
[DEBUG] 2025-08-22 12:49:03,983 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:49:03,983 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:49:03,988 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:49:03,988 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
[DEBUG] 2025-08-22 12:49:04,590 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2025-08-22 12:49:04,591 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2025-08-22 12:49:04,591 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.
[DEBUG] 2025-08-22 12:49:07,250 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:49:07,250 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:49:07,256 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:49:07,256 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
[DEBUG] 2025-08-22 12:49:07,775 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2025-08-22 12:49:07,776 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2025-08-22 12:49:07,776 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.
[DEBUG] 2025-08-22 12:49:58,821 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:49:58,821 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:49:58,832 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:49:58,832 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
[DEBUG] 2025-08-22 12:49:59,379 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2025-08-22 12:49:59,380 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2025-08-22 12:49:59,380 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.
[DEBUG] 2025-08-22 12:50:06,926 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:50:06,927 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:50:06,931 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:50:06,931 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
[DEBUG] 2025-08-22 12:50:06,999 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2025-08-22 12:50:07,001 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2025-08-22 12:50:07,001 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`.
[DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
[DEBUG] 2025-08-22 12:50:11,155 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2025-08-22 12:50:11,157 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2025-08-22 12:50:11,157 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.
[DEBUG] 2025-08-22 13:13:08,205 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-08-22 13:13:08,205 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-08-22 13:13:08,209 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`.
[DEBUG] 2025-08-22 13:13:08,209 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version.
[DEBUG] 2025-08-22 13:13:08,783 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`.
[DEBUG] 2025-08-22 13:13:08,783 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`.
[DEBUG] 2025-08-22 13:13:08,783 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct.
<!-- gh-comment-id:3213060040 --> @xieshuaix commented on GitHub (Aug 22, 2025): > cat nvitop.log <details> <summary>Details</summary> ```txt [DEBUG] 2025-08-22 12:48:43,371 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 12:48:43,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 12:48:44,276 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`. [DEBUG] 2025-08-22 12:48:44,276 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. [DEBUG] 2025-08-22 12:48:44,366 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`. [DEBUG] 2025-08-22 12:48:44,367 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`. [DEBUG] 2025-08-22 12:48:44,367 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct. [DEBUG] 2025-08-22 12:49:03,983 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 12:49:03,983 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 12:49:03,988 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`. [DEBUG] 2025-08-22 12:49:03,988 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. [DEBUG] 2025-08-22 12:49:04,590 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`. [DEBUG] 2025-08-22 12:49:04,591 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`. [DEBUG] 2025-08-22 12:49:04,591 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct. [DEBUG] 2025-08-22 12:49:07,250 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 12:49:07,250 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 12:49:07,256 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`. [DEBUG] 2025-08-22 12:49:07,256 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. [DEBUG] 2025-08-22 12:49:07,775 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`. [DEBUG] 2025-08-22 12:49:07,776 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`. [DEBUG] 2025-08-22 12:49:07,776 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct. [DEBUG] 2025-08-22 12:49:58,821 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 12:49:58,821 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 12:49:58,832 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`. [DEBUG] 2025-08-22 12:49:58,832 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. [DEBUG] 2025-08-22 12:49:59,379 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`. [DEBUG] 2025-08-22 12:49:59,380 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`. [DEBUG] 2025-08-22 12:49:59,380 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct. [DEBUG] 2025-08-22 12:50:06,926 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 12:50:06,927 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 12:50:06,931 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`. [DEBUG] 2025-08-22 12:50:06,931 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. [DEBUG] 2025-08-22 12:50:06,999 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`. [DEBUG] 2025-08-22 12:50:07,001 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`. [DEBUG] 2025-08-22 12:50:07,001 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct. [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 12:50:10,605 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`. [DEBUG] 2025-08-22 12:50:10,611 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. [DEBUG] 2025-08-22 12:50:11,155 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`. [DEBUG] 2025-08-22 12:50:11,157 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`. [DEBUG] 2025-08-22 12:50:11,157 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct. [DEBUG] 2025-08-22 13:13:08,205 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-08-22 13:13:08,205 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-08-22 13:13:08,209 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetTemperatureV`. [DEBUG] 2025-08-22 13:13:08,209 nvitop.api.libnvml::__determine_get_temperature_version_suffix: NVML get temperature version 1 API is not available due to incompatible NVIDIA driver. Fallback to use NVML get temperature API without version. [DEBUG] 2025-08-22 13:13:08,783 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Found symbol `nvmlDeviceGetComputeRunningProcesses_v3`. [DEBUG] 2025-08-22 13:13:08,783 nvitop.api.libnvml::_nvmlLookupFunctionPointer: Failed to found symbol `nvmlDeviceGetConfComputeMemSizeInfo`. [DEBUG] 2025-08-22 13:13:08,783 nvitop.api.libnvml::__determine_get_running_processes_version_suffix: NVML get running process version 3 API with v3 type struct is not available due to incompatible NVIDIA driver. Fallback to use get running process version 3 API with v2 type struct. ``` </details>
Author
Owner

@XuehaiPan commented on GitHub (Aug 22, 2025):

@xieshuaix Could you try to change the default value of __get_running_processes_version_suffix from None to '_v3' in "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py:611"?

5d434b8987/nvitop/api/libnvml.py (L611-L612)

I highly suspect it is a driver issue.

<!-- gh-comment-id:3213083025 --> @XuehaiPan commented on GitHub (Aug 22, 2025): @xieshuaix Could you try to change the default value of `__get_running_processes_version_suffix` from `None` to `'_v3'` in "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py:611"? https://github.com/XuehaiPan/nvitop/blob/5d434b89872780196b543f4f7554ac37753192eb/nvitop/api/libnvml.py#L611-L612 I highly suspect it is a driver issue.
Author
Owner

@xieshuaix commented on GitHub (Aug 22, 2025):

@xieshuaix Could you try to change the default value of __get_running_processes_version_suffix from None to 'v3' in "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py:611"?

nvitop/nvitop/api/libnvml.py

Lines 611 to 612 in 5d434b8

__get_running_processes_version_suffix: str | None = None
c_nvmlProcessInfo_t = c_nvmlProcessInfo_v3_t
I highly suspect it is a driver issue.

Got these logs:

[ERROR] 2025-08-22 13:28:31,599 nvitop.api.libnvml::nvmlQuery: ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetComputeRunningProcesses at 0x7f0e1b0b6170>, *args, **kwargs).
Please verify whether the nvidia-ml-py package is compatible with your NVIDIA driver version.
Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1076, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcessesv3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 446, in nvmlQuery
retval = func(*args, **kwargs) # type: ignore[operator]
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 728, in nvmlDeviceGetComputeRunningProcesses
return __nvml_device_get_running_processes('nvmlDeviceGetComputeRunningProcesses', handle)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 674, in __nvml_device_get_running_processes
fn = _nvmlGetFunctionPointer(f'{func}{version_suffix}')
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1079, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found
[ERROR] 2025-08-22 13:28:31,600 nvitop.api.libnvml::nvmlQuery: ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetGraphicsRunningProcesses at 0x7f0e1b0b6320>, *args, **kwargs).
Please verify whether the nvidia-ml-py package is compatible with your NVIDIA driver version.
Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1076, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 387, in getattr
func = self.getitem(name)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/init.py", line 392, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetGraphicsRunningProcessesv3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 446, in nvmlQuery
retval = func(*args, **kwargs) # type: ignore[operator]
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 752, in nvmlDeviceGetGraphicsRunningProcesses
return __nvml_device_get_running_processes('nvmlDeviceGetGraphicsRunningProcesses', handle)
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 674, in __nvml_device_get_running_processes
fn = _nvmlGetFunctionPointer(f'{func}{version_suffix}')
File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1079, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found

<!-- gh-comment-id:3213087360 --> @xieshuaix commented on GitHub (Aug 22, 2025): > [@xieshuaix](https://github.com/xieshuaix) Could you try to change the default value of `__get_running_processes_version_suffix` from `None` to `'v3'` in "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py:611"? > > [nvitop/nvitop/api/libnvml.py](https://github.com/XuehaiPan/nvitop/blob/5d434b89872780196b543f4f7554ac37753192eb/nvitop/api/libnvml.py#L611-L612) > > Lines 611 to 612 in [5d434b8](/XuehaiPan/nvitop/commit/5d434b89872780196b543f4f7554ac37753192eb) > > __get_running_processes_version_suffix: str | None = None > c_nvmlProcessInfo_t = c_nvmlProcessInfo_v3_t > I highly suspect it is a driver issue. Got these logs: [ERROR] 2025-08-22 13:28:31,599 nvitop.api.libnvml::nvmlQuery: ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetComputeRunningProcesses at 0x7f0e1b0b6170>, *args, **kwargs). Please verify whether the `nvidia-ml-py` package is compatible with your NVIDIA driver version. Traceback (most recent call last): File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1076, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__ func = self.__getitem__(name) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__ func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcessesv3 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 446, in nvmlQuery retval = func(*args, **kwargs) # type: ignore[operator] File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 728, in nvmlDeviceGetComputeRunningProcesses return __nvml_device_get_running_processes('nvmlDeviceGetComputeRunningProcesses', handle) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 674, in __nvml_device_get_running_processes fn = _nvmlGetFunctionPointer(f'{func}{version_suffix}') File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1079, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) pynvml.NVMLError_FunctionNotFound: Function Not Found [ERROR] 2025-08-22 13:28:31,600 nvitop.api.libnvml::nvmlQuery: ERROR: A FunctionNotFound error occurred while calling nvmlQuery(<function nvmlDeviceGetGraphicsRunningProcesses at 0x7f0e1b0b6320>, *args, **kwargs). Please verify whether the `nvidia-ml-py` package is compatible with your NVIDIA driver version. Traceback (most recent call last): File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1076, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__ func = self.__getitem__(name) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__ func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetGraphicsRunningProcessesv3 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 446, in nvmlQuery retval = func(*args, **kwargs) # type: ignore[operator] File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 752, in nvmlDeviceGetGraphicsRunningProcesses return __nvml_device_get_running_processes('nvmlDeviceGetGraphicsRunningProcesses', handle) File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/nvitop/api/libnvml.py", line 674, in __nvml_device_get_running_processes fn = _nvmlGetFunctionPointer(f'{func}{version_suffix}') File "/root/miniforge3/envs/xs_stepfun/lib/python3.10/site-packages/pynvml.py", line 1079, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) pynvml.NVMLError_FunctionNotFound: Function Not Found
Author
Owner

@XuehaiPan commented on GitHub (Aug 22, 2025):

@xieshuaix Sorry, my fault. It should be '_v3' instead of 'v3'. Could you try it again? Thanks!

<!-- gh-comment-id:3213094150 --> @XuehaiPan commented on GitHub (Aug 22, 2025): @xieshuaix Sorry, my fault. It should be `'_v3'` instead of `'v3'`. Could you try it again? Thanks!
Author
Owner

@xieshuaix commented on GitHub (Aug 22, 2025):

@xieshuaix Sorry, my fault. It should be '_v3' instead of 'v3'.

that works

<!-- gh-comment-id:3213103801 --> @xieshuaix commented on GitHub (Aug 22, 2025): > [@xieshuaix](https://github.com/xieshuaix) Sorry, my fault. It should be `'_v3'` instead of `'v3'`. that works
Author
Owner

@XuehaiPan commented on GitHub (Aug 22, 2025):

@xieshuaix Thanks for the information. I will try to find a fix for this. In the meantime, the simplest fix is to upgrade your NVIDIA driver.

<!-- gh-comment-id:3213353819 --> @XuehaiPan commented on GitHub (Aug 22, 2025): @xieshuaix Thanks for the information. I will try to find a fix for this. In the meantime, the simplest fix is to upgrade your NVIDIA driver.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#110
No description provided.