mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 14:15:55 -06:00
[GH-ISSUE #75] [BUG] PIDs are scrambled and No Such Process is printed since update to NVIDIA drivers #45
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#45
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @marcreichman-pfi on GitHub (Jun 20, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/75
Originally assigned to: @XuehaiPan on GitHub.
Required prerequisites
What version of nvitop are you using?
git hash
4093334972a334e9057f5acf7661a2c1a96bd021Operating system and version
Docker image (under Centos 7 host)
NVIDIA driver version
535.54.03
NVIDIA-SMI
Python environment
This is the docker version from the latest git head (6/20/2023)
Problem description
The output shows scrambled PIDs for processes after the initial process in the lists for each card, and then shows
No Such Processfor the wrong PIDs. This only started after the driver update, so I assume something is changed in the nvidia drivers.Steps to Reproduce
The Python snippets (if any):
Command lines:
Traceback
No response
Logs
Expected behavior
Prior to the driver update, the information was present for the same PIDs included in
nvidia-smibut with the full commandlines and the per-process resource statistics (e.g.GPU PID USER GPU-MEM %SM %CPU %MEM TIME). Now it seems to be having an issue parsing proper PIDs from the nvidia libraries, and then failing downstream from there.Additional context
I'm not much of a Python programmer unfortunately so I'm not clear where to dig in, but I'd assume the issue is somewhere in the area of receiving the process list for the cards and deciphering the PIDs. My assumption is that something changed in the driver or some structure or class such that parsing code seems to have broken somewhere.
@XuehaiPan commented on GitHub (Jun 21, 2023):
Hi @marcreichman-pfi, thanks for raising this. I have encountered the same issue before. I think this would be a bug on the upstream (
nvidia-ml-py) with the incompatible NVIDIA driver. Thenvidia-ml-pyreturns invalid PIDs.I haven't found a solution for this yet. This may be due to an internal API change in the NVML library. We may need to wait for the next
nvidia-ml-pyrelease.As a temporary workaround, you could downgrade your NVIDIA driver version.
See also:
@marcreichman-pfi commented on GitHub (Jun 21, 2023):
Hi @XuehaiPan and thanks for your response and excellent tool!
We cannot downgrade because we need newer CUDA version support, so for now we'll just have to wait for an updated version with the NVML library fix.
@XuehaiPan commented on GitHub (Jul 7, 2023):
Hi @marcreichman-pfi, a new release of
nvidia-ml-pywith version 12.535.77 came out several hours ago. You can upgrade yournvidia-ml-pypackage with the command:This would resolve the unrecognized PIDs with CUDA 12 drivers.
I would also make a new release of
nvitopto resolve CUDA 12 driver support.@marcreichman-pfi commented on GitHub (Jul 7, 2023):
Thanks @XuehaiPan - is there a way to do this in the docker version?
@XuehaiPan commented on GitHub (Jul 7, 2023):
@marcreichman-pfi You could upgrade
nvidia-ml-pyin your docker container.@marcreichman-pfi commented on GitHub (Jul 7, 2023):
Thanks this did the trick! Here was what I did from your
Dockerfile:@ukejeb commented on GitHub (Aug 15, 2024):
nvitop-1.3.2withnvidia-ml-py-12.535.161,CUDA 12.2andDriver Version 535.129.03also showsNo Such Process.