[GH-ISSUE #99] [BUG] nvitop.Device.from_cuda_visible_devices() not detecting GPU #59

Closed
opened 2026-05-05 03:23:59 -06:00 by gitea-mirror · 4 comments
Owner

Originally created by @juan-barajas-p on GitHub (Oct 4, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/99

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.3.0

Operating system and version

Pop!_OS 22.04 LTS

NVIDIA driver version

535.113.01

NVIDIA-SMI

Wed Oct  4 08:57:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    Off | 00000000:01:00.0  On |                  N/A |
| N/A   51C    P8              15W / 125W |     59MiB /  8192MiB |     13%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3081      G   /usr/lib/xorg/Xorg                           53MiB |
+---------------------------------------------------------------------------------------+

Python environment

Virtualenv created with micromamba v1.5.1 with mm create --name testing python=3.11, then installed nvitop with pip install nvitop.

Command output:

3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] linux
nvidia-ml-py==12.535.108
nvitop==1.3.0

Problem description

Using the following code snippet results in an empty list:

import nvitop; nvitop.Device.from_cuda_visible_devices()

Regardless of if CUDA_VISIBLE_DEVICES is set or not.

Steps to Reproduce

Command lines:

python -c "import nvitop; print(nvitop.Device.from_cuda_visible_devices())"

Traceback

N/A

Logs

N/A

Expected behavior

I would expect to see the same number of devices given by nvitop.Device.all() when calling nvitop.Device.from_cuda_visible_devices() if CUDA_VISIBLE_DEVICES is not set or if CUDA_VISIBLE_DEVICES is set to all GPUs in the system.

Additional context

This has never happened before on any previous machines using the same nvitop version and OS, which at first led me to believe it was a problem with this particular machine's setup, but after some more testing I'm not so sure. I'm giving the following information to see if nvitop can be improved to deal with this situation accordingly.

I looked into it in more detail, and it turns out that visible_device_indices is empty in this machine whereas in other machines it does find the correct GPU uuid.

# file: api.device.py, method: from_cuda_visible_devices

visible_device_indices = Device.parse_cuda_visible_devices()  # value: []

Looking closer at _parse_cuda_visible_devices, the complete uuid is correctly detected by _get_all_physical_device_attrs():

# file: api.device.py, function: _parse_cuda_visible_devices

physical_device_attrs = _get_all_physical_device_attrs()  # value: _PhysicalDeviceAttrs(index=0, name='NVIDIA GeForce RTX 3070 Ti Laptop GPU', uuid='GPU-13096139-7ada-8313-ee08-000dd8540fe1', support_mig_mode=False))])

But the subprocess that parses visible devices to uuids appears to be missing the last part of the uuid. This causes further logic to assume this uuid is for a MIG device (as it doesn't find it in physical_device_attrs), and among other things, it ends up not showing up as a valid GPU detected by nvitop.

# file: api.device.py, function: _parse_cuda_visible_devices

raw_uuids = subprocess.check_output(...)  # value: ['13096139-7ada-8313-ee08-']

I kept on tracking the incorrect UUID to cuDeviceGetUuid and it appears that this is the point where the uuid is incomplete.

# file: api.libcuda.py, function: cuDeviceGetUuid

uuid = ''.join(map('{:02x}'.format, uuid.value))  # value: "130961397ada8313ee08"

As I understand, this is just a wrapper for using the CUDA driver API, directly using the function cuDeviceGetUuid_v2, so I tried to use NVIDIA's cuda-python to see if I could replicate it, but oddly enough this does return the full uuid of the GPU.

micromamba create --name testing_2 python=3.11
micromamba activate
pip install cuda-python  # v12.2.0
python -c "from cuda import cuda; cuda.cuInit(0); print(cuda.cuDeviceGetUuid_v2(0)[1])"
# prints: bytes : 130961397ada8313ee08000dd8540fe1

As using the python wrappers of the API returns the expected value, I wonder if there's something nvitop's implementation could do to mitigate this issue.

Originally created by @juan-barajas-p on GitHub (Oct 4, 2023). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/99 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have read the documentation <https://nvitop.readthedocs.io>. - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### What version of nvitop are you using? 1.3.0 ### Operating system and version Pop!_OS 22.04 LTS ### NVIDIA driver version 535.113.01 ### NVIDIA-SMI ```text Wed Oct 4 08:57:35 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3070 ... Off | 00000000:01:00.0 On | N/A | | N/A 51C P8 15W / 125W | 59MiB / 8192MiB | 13% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3081 G /usr/lib/xorg/Xorg 53MiB | +---------------------------------------------------------------------------------------+ ``` ### Python environment Virtualenv created with micromamba v1.5.1 with `mm create --name testing python=3.11`, then installed nvitop with `pip install nvitop`. Command output: ``` 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] linux nvidia-ml-py==12.535.108 nvitop==1.3.0 ``` ### Problem description Using the following code snippet results in an empty list: ```python import nvitop; nvitop.Device.from_cuda_visible_devices() ``` Regardless of if `CUDA_VISIBLE_DEVICES` is set or not. ### Steps to Reproduce Command lines: ```bash python -c "import nvitop; print(nvitop.Device.from_cuda_visible_devices())" ``` ### Traceback ```pytb N/A ``` ### Logs ```text N/A ``` ### Expected behavior I would expect to see the same number of devices given by `nvitop.Device.all()` when calling `nvitop.Device.from_cuda_visible_devices()` if `CUDA_VISIBLE_DEVICES` is not set or if `CUDA_VISIBLE_DEVICES` is set to all GPUs in the system. ### Additional context This has never happened before on any previous machines using the same `nvitop` version and OS, which at first led me to believe it was a problem with this particular machine's setup, but after some more testing I'm not so sure. I'm giving the following information to see if `nvitop` can be improved to deal with this situation accordingly. I looked into it in more detail, and it turns out that `visible_device_indices` is empty in this machine whereas in other machines it does find the correct GPU uuid. ```python # file: api.device.py, method: from_cuda_visible_devices visible_device_indices = Device.parse_cuda_visible_devices() # value: [] ``` Looking closer at `_parse_cuda_visible_devices`, the complete uuid is correctly detected by `_get_all_physical_device_attrs()`: ```python # file: api.device.py, function: _parse_cuda_visible_devices physical_device_attrs = _get_all_physical_device_attrs() # value: _PhysicalDeviceAttrs(index=0, name='NVIDIA GeForce RTX 3070 Ti Laptop GPU', uuid='GPU-13096139-7ada-8313-ee08-000dd8540fe1', support_mig_mode=False))]) ``` But the subprocess that parses visible devices to uuids appears to be missing the last part of the uuid. This causes further logic to assume this uuid is for a MIG device (as it doesn't find it in physical_device_attrs), and among other things, it ends up not showing up as a valid GPU detected by `nvitop`. ```python # file: api.device.py, function: _parse_cuda_visible_devices raw_uuids = subprocess.check_output(...) # value: ['13096139-7ada-8313-ee08-'] ``` I kept on tracking the incorrect UUID to `cuDeviceGetUuid` and it appears that this is the point where the uuid is incomplete. ```python # file: api.libcuda.py, function: cuDeviceGetUuid uuid = ''.join(map('{:02x}'.format, uuid.value)) # value: "130961397ada8313ee08" ``` As I understand, this is just a wrapper for using the CUDA driver API, directly using the function `cuDeviceGetUuid_v2`, so I tried to use NVIDIA's [cuda-python](https://nvidia.github.io/cuda-python/index.html) to see if I could replicate it, but oddly enough this does return the full uuid of the GPU. ```bash micromamba create --name testing_2 python=3.11 micromamba activate pip install cuda-python # v12.2.0 python -c "from cuda import cuda; cuda.cuInit(0); print(cuda.cuDeviceGetUuid_v2(0)[1])" # prints: bytes : 130961397ada8313ee08000dd8540fe1 ``` As using the python wrappers of the API returns the expected value, I wonder if there's something `nvitop`'s implementation could do to mitigate this issue.
gitea-mirror 2026-05-05 03:23:59 -06:00
  • closed this issue
  • added the
    api
    bug
    labels
Author
Owner

@XuehaiPan commented on GitHub (Oct 4, 2023):

Hi, @juan-barajas-p thanks for raising this! Much appreciate the detailed context for the investigation.

The cause is the UUID contains the null character \x00, which terminates the string buffer.

Your UUID:

uuid = '130961397ada8313ee08000dd8540fe1'

stripped uuid:

uuid = '130961397ada8313ee08'

as we can see there is a 00 after ..ee08 and the string buffer terminates early at the null character.

I will submit a quick fix for this.

<!-- gh-comment-id:1747451496 --> @XuehaiPan commented on GitHub (Oct 4, 2023): Hi, @juan-barajas-p thanks for raising this! Much appreciate the detailed context for the investigation. The cause is the UUID contains the null character `\x00`, which terminates the string buffer. Your UUID: ```python uuid = '130961397ada8313ee08000dd8540fe1' ``` stripped uuid: ```python uuid = '130961397ada8313ee08' ``` as we can see there is a `00` after `..ee08` and the string buffer terminates early at the null character. I will submit a quick fix for this.
Author
Owner

@XuehaiPan commented on GitHub (Oct 4, 2023):

Hi @juan-barajas-p, I create a fix to resolve this issue:

You can try it via:

python3 -m pip install git+https://github.com/XuehaiPan/nvitop.git@fix-cuDeviceGetUuid

BTW, you can use Device.cuda.all() or CudaDevice.all() to get all CUDA visible devices.

from nvitop import Device, CudaDevice

# Use this only when you don't want to use the `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.from_cuda_visible_devices()             # from the environment variable
other_cuda_devices = Device.from_cuda_visible_devices('4,3,0,1')  # do not use the environment variable

# alternatives if you only read `CUDA_VISIBLE_DEVICES` from the environment variable
all_cuda_devices = Device.cuda.all()  # you can have only `from nvitop import Device`
all_cuda_devices = CudaDevice.all()
<!-- gh-comment-id:1747481631 --> @XuehaiPan commented on GitHub (Oct 4, 2023): Hi @juan-barajas-p, I create a fix to resolve this issue: - #100 You can try it via: ```bash python3 -m pip install git+https://github.com/XuehaiPan/nvitop.git@fix-cuDeviceGetUuid ``` BTW, you can use `Device.cuda.all()` or `CudaDevice.all()` to get all CUDA visible devices. ```python from nvitop import Device, CudaDevice # Use this only when you don't want to use the `CUDA_VISIBLE_DEVICES` from the environment variable all_cuda_devices = Device.from_cuda_visible_devices() # from the environment variable other_cuda_devices = Device.from_cuda_visible_devices('4,3,0,1') # do not use the environment variable # alternatives if you only read `CUDA_VISIBLE_DEVICES` from the environment variable all_cuda_devices = Device.cuda.all() # you can have only `from nvitop import Device` all_cuda_devices = CudaDevice.all() ```
Author
Owner

@juan-barajas-p commented on GitHub (Oct 4, 2023):

Hi! Thank you for the very quick response. Good job with this library, as it's the easiest method of interacting with GPU metrics that I've used.

Ohh of course that's the problem haha. Also, thank you for the tip! I didn't know you could do it that way.

It almost works. I think you meant to apply the fix to api.libcuda.cuDeviceGetUuid instead of api.libcuda.cuDeviceGetUuid_v2, as it's the entrypoint used in api.device._cuda_visible_devices_parser? But if I use cuDeviceGetUuid_v2 is does solve the issue!

<!-- gh-comment-id:1747713935 --> @juan-barajas-p commented on GitHub (Oct 4, 2023): Hi! Thank you for the very quick response. Good job with this library, as it's the easiest method of interacting with GPU metrics that I've used. Ohh of course that's the problem haha. Also, thank you for the tip! I didn't know you could do it that way. It almost works. I think you meant to apply the fix to `api.libcuda.cuDeviceGetUuid` instead of `api.libcuda.cuDeviceGetUuid_v2`, as it's the entrypoint used in `api.device._cuda_visible_devices_parser`? But if I use cuDeviceGetUuid_v2 is does solve the issue!
Author
Owner

@XuehaiPan commented on GitHub (Oct 5, 2023):

It almost works. I think you meant to apply the fix to api.libcuda.cuDeviceGetUuid instead of api.libcuda.cuDeviceGetUuid_v2, as it's the entrypoint used in api.device._cuda_visible_devices_parser?

Thanks for the notes. I have updated the fix accordingly.

<!-- gh-comment-id:1748651299 --> @XuehaiPan commented on GitHub (Oct 5, 2023): > It almost works. I think you meant to apply the fix to `api.libcuda.cuDeviceGetUuid` instead of `api.libcuda.cuDeviceGetUuid_v2`, as it's the entrypoint used in `api.device._cuda_visible_devices_parser`? Thanks for the notes. I have updated the fix accordingly.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#59
No description provided.