[GH-ISSUE #45] [Enhancement] Skip error gpus and show normal infos automatically #31

Closed
opened 2026-05-05 03:22:38 -06:00 by gitea-mirror · 6 comments
Owner

Originally created by @jue-jue-zi on GitHub (Oct 22, 2022).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/45

Originally assigned to: @XuehaiPan on GitHub.

Runtime Environment

  • Operating system and version: Ubuntu 20.04 LTS
  • Terminal emulator and version: SSH
  • Python version: 3.8.10
  • NVML version (driver version): 515.65.01
  • nvitop version or commit: 0.10.0
  • nvidia-ml-py version: 11.515.75

Current Behavior

There are four GPUs on our server. And one of those was overheated for some reasons, which make that GPU cannot be recognized. If run nvidia-smi command without any args to query all the GPUs, error Unable to determine the device handle for GPU 0000:0C:00.0: Unknown Error will show without showing the remaining normal GPUs' infos. But if the command assigns the normal GPUs (nvidia-smi -i 0,1,3), all infos of the normal GPUs can be shown directly.

image image

And if I use nvitop command to show the GPUs' infos, nvidia-ml-py will throw exceptions like this below,

image image

Expected Behavior

I hope that with nvitop command, all the GPUs with errors can be skipped automatically, and show the normal GPUs' infos. If possible, maybe the error GPUs' info can be shown as tips below the normal infos using red fonts for emphasizing.

Originally created by @jue-jue-zi on GitHub (Oct 22, 2022). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/45 Originally assigned to: @XuehaiPan on GitHub. #### Runtime Environment - Operating system and version: Ubuntu 20.04 LTS - Terminal emulator and version: SSH - Python version: 3.8.10 - NVML version (driver version): 515.65.01 - `nvitop` version or commit: 0.10.0 - `nvidia-ml-py` version: 11.515.75 #### Current Behavior There are four GPUs on our server. And one of those was overheated for some reasons, which make that GPU cannot be recognized. If run `nvidia-smi` command without any args to query all the GPUs, error `Unable to determine the device handle for GPU 0000:0C:00.0: Unknown Error` will show without showing the remaining normal GPUs' infos. But if the command assigns the normal GPUs (`nvidia-smi -i 0,1,3`), all infos of the normal GPUs can be shown directly. <img width="829" alt="image" src="https://user-images.githubusercontent.com/26075785/197344576-691086f4-00b2-4e37-bd2f-406211610672.png"> <img width="767" alt="image" src="https://user-images.githubusercontent.com/26075785/197344254-9fd1eb95-dd18-4709-bfc0-3a685ea14168.png"> And if I use `nvitop` command to show the GPUs' infos, `nvidia-ml-py` will throw exceptions like this below, <img width="943" alt="image" src="https://user-images.githubusercontent.com/26075785/197344392-8daf9103-470f-4f9b-88d1-61779579dbf2.png"> <img width="987" alt="image" src="https://user-images.githubusercontent.com/26075785/197344670-2f31393f-93c4-4666-acb2-05f9abe3938e.png"> #### Expected Behavior I hope that with `nvitop` command, all the GPUs with errors can be skipped automatically, and show the normal GPUs' infos. If possible, maybe the error GPUs' info can be shown as tips below the normal infos using red fonts for emphasizing.
gitea-mirror 2026-05-05 03:22:38 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Oct 22, 2022):

@jue-jue-zi Thanks for the feedback! I'll add a quick fix soon.

<!-- gh-comment-id:1287812720 --> @XuehaiPan commented on GitHub (Oct 22, 2022): @jue-jue-zi Thanks for the feedback! I'll add a quick fix soon.
Author
Owner

@XuehaiPan commented on GitHub (Oct 22, 2022):

@jue-jue-zi I pushed a new commit to handle this. You can reinstall nvitop from GitHub by:

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop
<!-- gh-comment-id:1287839645 --> @XuehaiPan commented on GitHub (Oct 22, 2022): @jue-jue-zi I pushed a new commit to handle this. You can reinstall `nvitop` from GitHub by: ```bash pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop ```
Author
Owner

@jue-jue-zi commented on GitHub (Oct 22, 2022):

@jue-jue-zi I pushed a new commit to handle this. You can reinstall nvitop from GitHub by:

pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop

Thanks for fixing it so soon, but it seems that there still exist some problems,

Traceback (most recent call last):
  File "/usr/local/bin/nvitop", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvitop/cli.py", line 336, in main
    ui = UI(
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/ui.py", line 43, in __init__
    self.main_screen = MainScreen(
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/__init__.py", line 38, in __init__
    self.device_panel = DevicePanel(self.devices, compact, win=win, root=root)
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 61, in __init__
    self.snapshots = self.take_snapshots()
  File "/usr/local/lib/python3.8/dist-packages/cachetools/func.py", line 62, in wrapper
    v = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in take_snapshots
    snapshots = [device.as_snapshot() for device in self.all_devices]
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in <listcomp>
    snapshots = [device.as_snapshot() for device in self.all_devices]
  File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/library/device.py", line 70, in as_snapshot
    self._snapshot = super().as_snapshot()
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in as_snapshot
    **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS},
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in <dictcomp>
    **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS},
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 878, in memory_used
    return self.memory_info().used
  File "/usr/local/lib/python3.8/dist-packages/nvitop/core/utils.py", line 702, in wrapped
    ret = self._cache[method]  # pylint: disable=protected-access
TypeError: 'function' object is not subscriptable
<!-- gh-comment-id:1287845532 --> @jue-jue-zi commented on GitHub (Oct 22, 2022): > @jue-jue-zi I pushed a new commit to handle this. You can reinstall `nvitop` from GitHub by: > > ```shell > pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop > ``` Thanks for fixing it so soon, but it seems that there still exist some problems, ``` Traceback (most recent call last): File "/usr/local/bin/nvitop", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/nvitop/cli.py", line 336, in main ui = UI( File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/ui.py", line 43, in __init__ self.main_screen = MainScreen( File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/__init__.py", line 38, in __init__ self.device_panel = DevicePanel(self.devices, compact, win=win, root=root) File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 61, in __init__ self.snapshots = self.take_snapshots() File "/usr/local/lib/python3.8/dist-packages/cachetools/func.py", line 62, in wrapper v = func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in take_snapshots snapshots = [device.as_snapshot() for device in self.all_devices] File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/screens/main/device.py", line 129, in <listcomp> snapshots = [device.as_snapshot() for device in self.all_devices] File "/usr/local/lib/python3.8/dist-packages/nvitop/gui/library/device.py", line 70, in as_snapshot self._snapshot = super().as_snapshot() File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in as_snapshot **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS}, File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 1667, in <dictcomp> **{key: getattr(self, key)() for key in self.SNAPSHOT_KEYS}, File "/usr/local/lib/python3.8/dist-packages/nvitop/core/device.py", line 878, in memory_used return self.memory_info().used File "/usr/local/lib/python3.8/dist-packages/nvitop/core/utils.py", line 702, in wrapped ret = self._cache[method] # pylint: disable=protected-access TypeError: 'function' object is not subscriptable ```
Author
Owner

@XuehaiPan commented on GitHub (Oct 22, 2022):

but it seems that there still exist some problems,

Fixed by the newest commit.

<!-- gh-comment-id:1287850304 --> @XuehaiPan commented on GitHub (Oct 22, 2022): > but it seems that there still exist some problems, Fixed by the newest commit.
Author
Owner

@jue-jue-zi commented on GitHub (Oct 22, 2022):

It works right now! Thanks, it is a really great project.

image
<!-- gh-comment-id:1287851245 --> @jue-jue-zi commented on GitHub (Oct 22, 2022): It works right now! Thanks, it is a really great project. <img width="1709" alt="image" src="https://user-images.githubusercontent.com/26075785/197352108-ce1a5e2c-3dd7-49a3-aaa6-31cfa3847fc1.png">
Author
Owner

@jue-jue-zi commented on GitHub (Oct 22, 2022):

It works right now! Thanks, it is a really great project.

image

Maybe red fonts for errors would be better.

<!-- gh-comment-id:1287851449 --> @jue-jue-zi commented on GitHub (Oct 22, 2022): > It works right now! Thanks, it is a really great project. > > <img alt="image" width="1709" src="https://user-images.githubusercontent.com/26075785/197352108-ce1a5e2c-3dd7-49a3-aaa6-31cfa3847fc1.png"> Maybe red fonts for errors would be better.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#31
No description provided.