[GH-ISSUE #88] [Bug] Processes information cannot be obtained normally on 535.98 driver #54

Closed
opened 2026-05-05 03:23:44 -06:00 by gitea-mirror · 7 comments
Owner

Originally created by @GeekRaw on GitHub (Aug 15, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/88

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

Questions

Hello, when I use nvitop on the server, I can't get the Processes information normally, thank you for your answer

image

Originally created by @GeekRaw on GitHub (Aug 15, 2023). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/88 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have read the documentation <https://nvitop.readthedocs.io>. - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [x] I have tried the latest version of nvitop in a new isolated virtual environment. ### Questions Hello, when I use nvitop on the server, I can't get the Processes information normally, thank you for your answer ![image](https://github.com/XuehaiPan/nvitop/assets/51330563/647bcde0-496b-4210-9749-c63db6c1da98)
gitea-mirror 2026-05-05 03:23:44 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Aug 15, 2023):

Hi @GeekRaw, could you provide some relevant information, such as nvidia-smi output and the package version list of your Python environment? It would also be helpful whether you are running nvitop natively or in a container-like environment. Then we can investigate this issue deeper.

<!-- gh-comment-id:1678990265 --> @XuehaiPan commented on GitHub (Aug 15, 2023): Hi @GeekRaw, could you provide some relevant information, such as `nvidia-smi` output and the package version list of your Python environment? It would also be helpful whether you are running `nvitop` natively or in a container-like environment. Then we can investigate this issue deeper.
Author
Owner

@cfroehli commented on GitHub (Aug 16, 2023):

Hello,

If that may help, we noticed the same behavior recently too as we upgraded our drivers version (currently on 536.86.10, Ubuntu 20.04 cuda 12.2). Card model seems not relevant. The load and chart on the top is matching the nvidia-smi output, but the process list is broken. nvidia-smi is able to show the actual processes. Install is a basic python3 venv on the actual server, no container involved.

Depending on the tty refresh/timing, it is possible to see an ERROR: A FunctionNotFound error occured while calling nvmlQuery(<function nvmlDeviceGetGraphicsRunningProcesses at 0x7f08ff962940>, *args, **kwargs). Please verify whether the nvidia-ml-py package is compatible with your NVIDIA driver version getting printed (often get overwritten so easy to miss). Guess some api changed in a recent nvidia-ml version.

------------- ---------
cachetools    5.3.1
nvidia-ml-py  12.535.77
nvitop        1.2.0
pip           23.2.1
pkg_resources 0.0.0
psutil        5.9.5
setuptools    68.0.0
termcolor     2.3.0
wheel         0.41.0

Downgrading it to some of the latest 11.* version didn't help.

$python 
Python 3.8.10 (default, May 26 2023, 14:05:08) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
>>> nvmlSystemGetDriverVersion()
'535.86.10'
>>> handle = nvmlDeviceGetHandleByIndex(0)
>>> nvmlDeviceGetComputeRunningProcesses(handle)
Traceback (most recent call last):
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 913, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 2775, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v3(handle);
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 2741, in nvmlDeviceGetComputeRunningProcesses_v3
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
  File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 916, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found

$ nm -gD /lib/x86_64-linux-gnu/libnvidia-ml.so.1 | grep Running
000000000006cfc0 T nvmlDeviceGetComputeRunningProcesses
000000000006d1b0 T nvmlDeviceGetComputeRunningProcesses_v2
000000000006d3a0 T nvmlDeviceGetGraphicsRunningProcesses
000000000006d590 T nvmlDeviceGetGraphicsRunningProcesses_v2
000000000006d780 T nvmlDeviceGetMPSComputeRunningProcesses
000000000006d970 T nvmlDeviceGetMPSComputeRunningProcesses_v2

Seems the _v3 is not there anymore but the python bindings keep using it ?

<!-- gh-comment-id:1679886387 --> @cfroehli commented on GitHub (Aug 16, 2023): Hello, If that may help, we noticed the same behavior recently too as we upgraded our drivers version (currently on 536.86.10, Ubuntu 20.04 cuda 12.2). Card model seems not relevant. The load and chart on the top is matching the nvidia-smi output, but the process list is broken. nvidia-smi is able to show the actual processes. Install is a basic python3 venv on the actual server, no container involved. Depending on the tty refresh/timing, it is possible to see an `ERROR: A FunctionNotFound error occured while calling nvmlQuery(<function nvmlDeviceGetGraphicsRunningProcesses at 0x7f08ff962940>, *args, **kwargs). Please verify whether the nvidia-ml-py package is compatible with your NVIDIA driver version` getting printed (often get overwritten so easy to miss). Guess some api changed in a recent nvidia-ml version. ```Package Version ------------- --------- cachetools 5.3.1 nvidia-ml-py 12.535.77 nvitop 1.2.0 pip 23.2.1 pkg_resources 0.0.0 psutil 5.9.5 setuptools 68.0.0 termcolor 2.3.0 wheel 0.41.0 ``` Downgrading it to some of the latest 11.* version didn't help. ``` $python Python 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from pynvml import * >>> nvmlInit() >>> nvmlSystemGetDriverVersion() '535.86.10' >>> handle = nvmlDeviceGetHandleByIndex(0) >>> nvmlDeviceGetComputeRunningProcesses(handle) Traceback (most recent call last): File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 913, in _nvmlGetFunctionPointer _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name) File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__ func = self.__getitem__(name) File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__ func = self._FuncPtr((name_or_ordinal, self)) AttributeError: /lib/x86_64-linux-gnu/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 2775, in nvmlDeviceGetComputeRunningProcesses return nvmlDeviceGetComputeRunningProcesses_v3(handle); File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 2741, in nvmlDeviceGetComputeRunningProcesses_v3 fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3") File "/opt/gpu-tools/lib/python3.8/site-packages/pynvml.py", line 916, in _nvmlGetFunctionPointer raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND) pynvml.NVMLError_FunctionNotFound: Function Not Found $ nm -gD /lib/x86_64-linux-gnu/libnvidia-ml.so.1 | grep Running 000000000006cfc0 T nvmlDeviceGetComputeRunningProcesses 000000000006d1b0 T nvmlDeviceGetComputeRunningProcesses_v2 000000000006d3a0 T nvmlDeviceGetGraphicsRunningProcesses 000000000006d590 T nvmlDeviceGetGraphicsRunningProcesses_v2 000000000006d780 T nvmlDeviceGetMPSComputeRunningProcesses 000000000006d970 T nvmlDeviceGetMPSComputeRunningProcesses_v2 ``` Seems the _v3 is not there anymore but the python bindings keep using it ?
Author
Owner

@XuehaiPan commented on GitHub (Aug 16, 2023):

@cfroehli Thanks for the feedback! This is due to poor version management for the NVML library.

The v3 APIs were introduced in the 510.39.01 driver: b2f0e7f437

but they were removed in the 535.98 driver: 0cb3beffa0


Version change:

495.46 -> 510.39.01: b2f0e7f437

  • Add process info v3 APIs but use v2 nvmlProcessInfo_st struct type

  • default:

    • nvmlDeviceGetComputeRunningProcesses -> nvmlDeviceGetComputeRunningProcesses_v3
    • nvmlProcessInfo_st -> nvmlProcessInfo_v2_st

530.41.03 -> 535.43.02: 39c3e28e84

  • Process info v3 APIs use v3 nvmlProcessInfo_st struct type without a version bump

  • default:

    • nvmlDeviceGetComputeRunningProcesses -> nvmlDeviceGetComputeRunningProcesses_v3
    • nvmlProcessInfo_st -> nvmlProcessInfo_v3_st

535.86.05 -> 535.98: 0cb3beffa0

  • Remove process info v3 APIs and v3 nvmlProcessInfo_st struct type

  • default:

    • nvmlDeviceGetComputeRunningProcesses -> nvmlDeviceGetComputeRunningProcesses_v2
    • nvmlProcessInfo_st -> nvmlProcessInfo_v2_st

UPDATE:

535.98 -> 535.104.05: 74cae7fa6a

  • Re-add process info v3 APIs but use v2 nvmlProcessInfo_st struct type

  • default:

    • nvmlDeviceGetComputeRunningProcesses -> nvmlDeviceGetComputeRunningProcesses_v3
    • nvmlProcessInfo_st -> nvmlProcessInfo_v2_st
<!-- gh-comment-id:1680052592 --> @XuehaiPan commented on GitHub (Aug 16, 2023): @cfroehli Thanks for the feedback! This is due to poor version management for the NVML library. The v3 APIs were introduced in the 510.39.01 driver: https://github.com/NVIDIA/nvidia-settings/commit/b2f0e7f437c42d92ed58120ec8d880f5f4b90d60 but they were removed in the 535.98 driver: https://github.com/NVIDIA/nvidia-settings/commit/0cb3beffa0cb8a1f8cb405291b11a1e2eb7a4786 ------ Version change: 495.46 -> 510.39.01: https://github.com/NVIDIA/nvidia-settings/commit/b2f0e7f437c42d92ed58120ec8d880f5f4b90d60 - Add process info v3 APIs but use v2 `nvmlProcessInfo_st` struct type - default: - `nvmlDeviceGetComputeRunningProcesses` -> `nvmlDeviceGetComputeRunningProcesses_v3` - `nvmlProcessInfo_st` -> `nvmlProcessInfo_v2_st` 530.41.03 -> 535.43.02: https://github.com/NVIDIA/nvidia-settings/commit/39c3e28e84f3ffb034abaf1ae92dbb570c207d05 - Process info v3 APIs use v3 `nvmlProcessInfo_st` struct type without a version bump - default: - `nvmlDeviceGetComputeRunningProcesses` -> `nvmlDeviceGetComputeRunningProcesses_v3` - `nvmlProcessInfo_st` -> `nvmlProcessInfo_v3_st` 535.86.05 -> 535.98: https://github.com/NVIDIA/nvidia-settings/commit/0cb3beffa0cb8a1f8cb405291b11a1e2eb7a4786 - Remove process info v3 APIs and v3 `nvmlProcessInfo_st` struct type - default: - `nvmlDeviceGetComputeRunningProcesses` -> `nvmlDeviceGetComputeRunningProcesses_v2` - `nvmlProcessInfo_st` -> `nvmlProcessInfo_v2_st` UPDATE: 535.98 -> 535.104.05: https://github.com/NVIDIA/nvidia-settings/commit/74cae7fa6a3da595a1bd87918ef0a67bb4326925 - Re-add process info v3 APIs but use v2 `nvmlProcessInfo_st` struct type - default: - `nvmlDeviceGetComputeRunningProcesses` -> `nvmlDeviceGetComputeRunningProcesses_v3` - `nvmlProcessInfo_st` -> `nvmlProcessInfo_v2_st`
Author
Owner

@XuehaiPan commented on GitHub (Aug 16, 2023):

Hi @cfroehli @GeekRaw, I created a new PR to resolve this. You could try:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@fix-process-api

Let me know if this works for you.

<!-- gh-comment-id:1680127032 --> @XuehaiPan commented on GitHub (Aug 16, 2023): Hi @cfroehli @GeekRaw, I created a new PR to resolve this. You could try: ```bash pip3 install git+https://github.com/XuehaiPan/nvitop.git@fix-process-api ``` Let me know if this works for you.
Author
Owner

@cfroehli commented on GitHub (Aug 17, 2023):

That fixes the process listing in my case. (thanks for the quick fix and the nice tool btw)

  • Enabled the venv, started nvitop => gpu chart showing a load, process list empty, may spot the error if restarting enough time
  • install @fix-process-api version
  • start nvitop a few times => no error, processes list get displayed as usual and get updated upon process starting/terminating too
<!-- gh-comment-id:1681666995 --> @cfroehli commented on GitHub (Aug 17, 2023): That fixes the process listing in my case. (thanks for the quick fix and the nice tool btw) - Enabled the venv, started nvitop => gpu chart showing a load, process list empty, may spot the error if restarting enough time - install @fix-process-api version - start nvitop a few times => no error, processes list get displayed as usual and get updated upon process starting/terminating too
Author
Owner

@XuehaiPan commented on GitHub (Aug 17, 2023):

@cfroehli Thanks for the feedback. A new version with the fix will release soon.

<!-- gh-comment-id:1681706806 --> @XuehaiPan commented on GitHub (Aug 17, 2023): @cfroehli Thanks for the feedback. A new version with the fix will release soon.
Author
Owner

@XuehaiPan commented on GitHub (Aug 24, 2023):

Hi, the NVIDIA driver upstream re-add the v3 APIs back in the last driver release:

535.98 -> 535.104.05: 74cae7fa6a

  • Re-add process info v3 APIs but use v2 nvmlProcessInfo_st struct type

  • default:

    • nvmlDeviceGetComputeRunningProcesses -> nvmlDeviceGetComputeRunningProcesses_v3
    • nvmlProcessInfo_st -> nvmlProcessInfo_v2_st

nvitop 1.2.0 will work fine if you upgrade your NVIDIA driver to 535.104.05.

<!-- gh-comment-id:1691329183 --> @XuehaiPan commented on GitHub (Aug 24, 2023): Hi, the NVIDIA driver upstream re-add the v3 APIs back in the last driver release: 535.98 -> 535.104.05: https://github.com/NVIDIA/nvidia-settings/commit/74cae7fa6a3da595a1bd87918ef0a67bb4326925 - Re-add process info v3 APIs but use v2 `nvmlProcessInfo_st` struct type - default: - `nvmlDeviceGetComputeRunningProcesses` -> `nvmlDeviceGetComputeRunningProcesses_v3` - `nvmlProcessInfo_st` -> `nvmlProcessInfo_v2_st` ~~nvitop 1.2.0 will work fine if you upgrade your NVIDIA driver to 535.104.05.~~
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#54
No description provided.