[GH-ISSUE #145] [BUG] GPU Unknown Error 会导致 as_snapshot() 调用触发段错误 #93

Closed
opened 2026-05-05 03:25:11 -06:00 by gitea-mirror · 3 comments
Owner

Originally created by @jue-jue-zi on GitHub (Jan 13, 2025).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/145

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.4.0

Operating system and version

Ubuntu 24.04 LTS

NVIDIA driver version

565.57.01

NVIDIA-SMI

nvidia-smi
Unable to determine the device handle for GPU0000:1B:00.0: Unknown Error

nvidia-smi -i 0,1,2
Mon Jan 13 16:25:54 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:03:00.0 Off |                  N/A |
|  0%   33C    P8              9W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:0B:00.0 Off |                  N/A |
|  0%   34C    P8             15W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:0C:00.0 Off |                  N/A |
|  0%   29C    P8             10W /  250W |       3MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Python environment

3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] linux
nvidia-ml-py==12.535.161
nvitop==1.4.0

Problem description

GPU Unknown Error 会导致 as_snapshot() 调用触发段错误

Steps to Reproduce

root@vm:/usr/local/lib/python3.12/dist-packages/nvitop# python
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nvitop.gui import Device
>>> devices = Device.from_indices([3,])
>>> devices[0]
Device(index=3, name='ERROR: Unknown', total_memory=N/A)
>>> devices[0].as_snapshot()
段错误

更简单的复现方案:

root@vm:/usr/local/lib/python3.12/dist-packages/nvitop# python
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nvitop.api import Device
>>> device = Device(0)
>>> device.as_snapshot()
PhysicalDeviceSnapshot(...)
>>> device._handle = None
>>> device.as_snapshot()
段错误

Traceback

No response

Logs

LOGLEVEL=DEBUG nvitop >nvitop.log 2>&1
段错误

cat nvitop.log
[DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: Found symbol `nvmlDeviceGetMemoryInfo_v2`.
[DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.
[DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available.

Expected behavior

No response

Additional context

No response

Originally created by @jue-jue-zi on GitHub (Jan 13, 2025). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/145 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have read the documentation <https://nvitop.readthedocs.io>. - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### What version of nvitop are you using? 1.4.0 ### Operating system and version Ubuntu 24.04 LTS ### NVIDIA driver version 565.57.01 ### NVIDIA-SMI ```bash nvidia-smi Unable to determine the device handle for GPU0000:1B:00.0: Unknown Error nvidia-smi -i 0,1,2 Mon Jan 13 16:25:54 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:03:00.0 Off | N/A | | 0% 33C P8 9W / 250W | 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce GTX 1080 Ti Off | 00000000:0B:00.0 Off | N/A | | 0% 34C P8 15W / 250W | 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce GTX 1080 Ti Off | 00000000:0C:00.0 Off | N/A | | 0% 29C P8 10W / 250W | 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` ### Python environment 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] linux nvidia-ml-py==12.535.161 nvitop==1.4.0 ### Problem description GPU Unknown Error 会导致 as_snapshot() 调用触发段错误 ### Steps to Reproduce ```bash root@vm:/usr/local/lib/python3.12/dist-packages/nvitop# python Python 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from nvitop.gui import Device >>> devices = Device.from_indices([3,]) >>> devices[0] Device(index=3, name='ERROR: Unknown', total_memory=N/A) >>> devices[0].as_snapshot() 段错误 ``` 更简单的复现方案: ```bash root@vm:/usr/local/lib/python3.12/dist-packages/nvitop# python Python 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from nvitop.api import Device >>> device = Device(0) >>> device.as_snapshot() PhysicalDeviceSnapshot(...) >>> device._handle = None >>> device.as_snapshot() 段错误 ``` ### Traceback _No response_ ### Logs ```bash LOGLEVEL=DEBUG nvitop >nvitop.log 2>&1 段错误 cat nvitop.log [DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: Found symbol `nvmlDeviceGetMemoryInfo_v2`. [DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. [DEBUG] 2025-01-13 16:28:33,371 nvitop.api.libnvml::__determine_get_memory_info_version_suffix: NVML get memory info version 2 is available. ``` ### Expected behavior _No response_ ### Additional context _No response_
gitea-mirror 2026-05-05 03:25:11 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Jan 13, 2025):

感谢提交 issue。我将尽快 fix 这个 bug。

<!-- gh-comment-id:2586500978 --> @XuehaiPan commented on GitHub (Jan 13, 2025): 感谢提交 issue。我将尽快 fix 这个 bug。
Author
Owner

@XuehaiPan commented on GitHub (Jan 13, 2025):

可以试试:

pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop
<!-- gh-comment-id:2586594305 --> @XuehaiPan commented on GitHub (Jan 13, 2025): 可以试试: ```bash pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop ```
Author
Owner

@jue-jue-zi commented on GitHub (Jan 13, 2025):

可以试试:

pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop

正常了👍

image
<!-- gh-comment-id:2586631765 --> @jue-jue-zi commented on GitHub (Jan 13, 2025): > 可以试试: > > ```shell > pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop > ``` 正常了👍 <img width="1133" alt="image" src="https://github.com/user-attachments/assets/975d7a89-4627-4057-97a8-97e5337996f2" />
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#93
No description provided.