[GH-ISSUE #139] [BUG] Segmentation Fault when one GPU lost from PCIe bus #87

Closed
opened 2026-05-05 03:25:01 -06:00 by gitea-mirror · 1 comment
Owner

Originally created by @Junyi-99 on GitHub (Nov 19, 2024).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/139

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.3.2

Operating system and version

Ubuntu 22.04

NVIDIA driver version

560.35.03

NVIDIA-SMI

$ nvidia-smi -i 0
Tue Nov 19 14:59:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:1D:00.0 Off |                  N/A |
| 30%   28C    P8             31W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Python environment

$ python3 -m pip freeze | python3 -c 'import sys; print(sys.version, sys.platform); print("".join(filter(lambda s: any(word in s.lower() for word in ("nvi", "cuda", "nvml", "gpu")), sys.stdin)))'
3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] linux
gpustat==1.1.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.535.108
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
nvitop==1.3.2

Problem description

The nvitop exits with a segmentation fault when one of the gpu is lost from the bus.

First of all, this is not a problem with nvitop itself.

I encountered this issue and would like to suggest that nvitop should still be able to display other GPUs even when one GPU is faulty, instead of resulting in a segmentation fault.

It would be nice if nvitop could skip the faulty GPU. (like gpustat)

Steps to Reproduce

  1. Unplug the gpu from the pcie bus. (don't know how to do that..)
  2. nvitop

Traceback

nvitop[2398212]: segfault at 0 ip 00007f2113c7128b sp 00007ffc6e223820 error 4 in libnvidia-ml.so.560.35.03[7f2113c00000+1d3000]

Logs

No response

Expected behavior

It would be nice if nvitop could skip the faulty GPU.

For example gpustat can show the faulty GPU:

image

Additional context

No response

Originally created by @Junyi-99 on GitHub (Nov 19, 2024). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/139 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have read the documentation <https://nvitop.readthedocs.io>. - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### What version of nvitop are you using? 1.3.2 ### Operating system and version Ubuntu 22.04 ### NVIDIA driver version 560.35.03 ### NVIDIA-SMI ```text $ nvidia-smi -i 0 Tue Nov 19 14:59:58 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:1D:00.0 Off | N/A | | 30% 28C P8 31W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` ### Python environment $ python3 -m pip freeze | python3 -c 'import sys; print(sys.version, sys.platform); print("".join(filter(lambda s: any(word in s.lower() for word in ("nvi", "cuda", "nvml", "gpu")), sys.stdin)))' 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] linux gpustat==1.1.1 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.535.108 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.68 nvidia-nvtx-cu12==12.1.105 nvitop==1.3.2 ### Problem description The `nvitop` exits with a segmentation fault when one of the gpu is lost from the bus. First of all, this is not a problem with nvitop itself. I encountered this issue and would like to suggest that `nvitop` should still be able to display other GPUs even when one GPU is faulty, instead of resulting in a segmentation fault. It would be nice if nvitop could skip the faulty GPU. (like `gpustat`) ### Steps to Reproduce 1. Unplug the gpu from the pcie bus. (don't know how to do that..) 2. `nvitop` ### Traceback ```pytb nvitop[2398212]: segfault at 0 ip 00007f2113c7128b sp 00007ffc6e223820 error 4 in libnvidia-ml.so.560.35.03[7f2113c00000+1d3000] ``` ### Logs _No response_ ### Expected behavior It would be nice if nvitop could skip the faulty GPU. For example `gpustat` can show the faulty GPU: <img width="736" alt="image" src="https://github.com/user-attachments/assets/a6b2e139-7724-4269-87d4-f38b09fbc190"> ### Additional context _No response_
gitea-mirror 2026-05-05 03:25:01 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Jan 13, 2025):

Sorry for the late response. You can try:

pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop
<!-- gh-comment-id:2586615606 --> @XuehaiPan commented on GitHub (Jan 13, 2025): Sorry for the late response. You can try: ```bash pipx run --spec git+https://github.com/XuehaiPan/nvitop.git@fix-invalid-device-handle nvitop ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#87
No description provided.