mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 14:15:55 -06:00
[GH-ISSUE #99] [BUG] nvitop.Device.from_cuda_visible_devices() not detecting GPU #59
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#59
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @juan-barajas-p on GitHub (Oct 4, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/99
Originally assigned to: @XuehaiPan on GitHub.
Required prerequisites
What version of nvitop are you using?
1.3.0
Operating system and version
Pop!_OS 22.04 LTS
NVIDIA driver version
535.113.01
NVIDIA-SMI
Python environment
Virtualenv created with micromamba v1.5.1 with
mm create --name testing python=3.11, then installed nvitop withpip install nvitop.Command output:
Problem description
Using the following code snippet results in an empty list:
Regardless of if
CUDA_VISIBLE_DEVICESis set or not.Steps to Reproduce
Command lines:
Traceback
Logs
Expected behavior
I would expect to see the same number of devices given by
nvitop.Device.all()when callingnvitop.Device.from_cuda_visible_devices()ifCUDA_VISIBLE_DEVICESis not set or ifCUDA_VISIBLE_DEVICESis set to all GPUs in the system.Additional context
This has never happened before on any previous machines using the same
nvitopversion and OS, which at first led me to believe it was a problem with this particular machine's setup, but after some more testing I'm not so sure. I'm giving the following information to see ifnvitopcan be improved to deal with this situation accordingly.I looked into it in more detail, and it turns out that
visible_device_indicesis empty in this machine whereas in other machines it does find the correct GPU uuid.Looking closer at
_parse_cuda_visible_devices, the complete uuid is correctly detected by_get_all_physical_device_attrs():But the subprocess that parses visible devices to uuids appears to be missing the last part of the uuid. This causes further logic to assume this uuid is for a MIG device (as it doesn't find it in physical_device_attrs), and among other things, it ends up not showing up as a valid GPU detected by
nvitop.I kept on tracking the incorrect UUID to
cuDeviceGetUuidand it appears that this is the point where the uuid is incomplete.As I understand, this is just a wrapper for using the CUDA driver API, directly using the function
cuDeviceGetUuid_v2, so I tried to use NVIDIA's cuda-python to see if I could replicate it, but oddly enough this does return the full uuid of the GPU.As using the python wrappers of the API returns the expected value, I wonder if there's something
nvitop's implementation could do to mitigate this issue.@XuehaiPan commented on GitHub (Oct 4, 2023):
Hi, @juan-barajas-p thanks for raising this! Much appreciate the detailed context for the investigation.
The cause is the UUID contains the null character
\x00, which terminates the string buffer.Your UUID:
stripped uuid:
as we can see there is a
00after..ee08and the string buffer terminates early at the null character.I will submit a quick fix for this.
@XuehaiPan commented on GitHub (Oct 4, 2023):
Hi @juan-barajas-p, I create a fix to resolve this issue:
You can try it via:
BTW, you can use
Device.cuda.all()orCudaDevice.all()to get all CUDA visible devices.@juan-barajas-p commented on GitHub (Oct 4, 2023):
Hi! Thank you for the very quick response. Good job with this library, as it's the easiest method of interacting with GPU metrics that I've used.
Ohh of course that's the problem haha. Also, thank you for the tip! I didn't know you could do it that way.
It almost works. I think you meant to apply the fix to
api.libcuda.cuDeviceGetUuidinstead ofapi.libcuda.cuDeviceGetUuid_v2, as it's the entrypoint used inapi.device._cuda_visible_devices_parser? But if I use cuDeviceGetUuid_v2 is does solve the issue!@XuehaiPan commented on GitHub (Oct 5, 2023):
Thanks for the notes. I have updated the fix accordingly.