[GH-ISSUE #5] [Feature Request] MIG device support (e.g. A100 GPUs) #5

Closed
opened 2026-05-05 03:21:22 -06:00 by gitea-mirror · 14 comments
Owner

Originally created by @ki-arie on GitHub (Aug 10, 2021).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/5

Originally assigned to: @XuehaiPan on GitHub.

Hello!

Firstly, thanks for creating and maintaining such an excellent library.

Runtime Environment

  • Operating system and version: Ubuntu 20.04 LTS
  • Terminal emulator and version: GNOME Terminal 3.36.2
  • Python version: `3.7
  • NVML version (driver version): 450.0
  • nvitop version or commit: main@b669fa3
  • python-ml-py version: 11.450.51
  • Locale: en_US.UTF-8

Current Behavior

When running nvitop on MiG enabled A100 GPU. nvitop fails to detect the GPU running process and GPU memory consumption. Which can otherwise be viewed by running the command, nvidia-smi

Expected Behavior

The A100 MiG GPU should be visible in the GUI.

Context

So far we can only view CPU usage metrics, which are really handy but it would also be nice to have GPU usage as designed.

Possible Solutions

I think that the MiG naming convention is different from regular naming conventions, and looks something like this:
MIG 7g.80gb Device 0: rather than just Device 0: as is currently set-up in the nvitop repo.

Steps to reproduce

  • Run A100 in Mig mode
  • start nvitop watch -n 0.5 nvitop
Originally created by @ki-arie on GitHub (Aug 10, 2021). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/5 Originally assigned to: @XuehaiPan on GitHub. Hello! Firstly, thanks for creating and maintaining such an excellent library. #### Runtime Environment - Operating system and version: Ubuntu 20.04 LTS - Terminal emulator and version: GNOME Terminal 3.36.2 - Python version: `3.7 - NVML version (driver version): `450.0` - `nvitop` version or commit: `main@b669fa3` - `python-ml-py` version: `11.450.51` - Locale: `en_US.UTF-8` #### Current Behavior When running `nvitop` on [MiG enabled A100 GPU](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/). `nvitop` fails to detect the GPU running process and GPU memory consumption. Which can otherwise be viewed by running the command, `nvidia-smi` #### Expected Behavior The A100 MiG GPU should be visible in the GUI. #### Context So far we can only view CPU usage metrics, which are really handy but it would also be nice to have GPU usage as designed. #### Possible Solutions I think that the MiG naming convention is different from regular naming conventions, and looks something like this: `MIG 7g.80gb Device 0:` rather than just `Device 0:` as is currently set-up in the nvitop repo. #### Steps to reproduce - Run A100 in Mig mode - start nvitop `watch -n 0.5 nvitop`
gitea-mirror 2026-05-05 03:21:22 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Aug 10, 2021):

Thanks for the feedback! I'm sorry that nvitop does not support MIG enabled devices yet. But we are working on it. It would be very nice that if you can help us to make nvitop better.

Ref wookayin/gpustat#102


When running nvitop on MiG enabled A100 GPU. nvitop fails to detect the GPU running process and GPU memory consumption. Which can otherwise be viewed by running the command, nvidia-smi

nvitop has not tested on MIG enabled devices. (I don't have any A100/A30 GPU available though.) Could you please run the following commands on your device, which could be very helpful to identify the error.

python3 -m venv venv  # create a virtual environment
source venv/bin/activate
pip3 install nvidia-ml-py==11.450.51   # the pinned version for nvitop
python3 test.py
pip3 install nvidia-ml-py==11.450.129  # the newer version
python3 test.py
deactivate
rm -rf venv

nvidia-smi

The content of test.py:

from pynvml import *

nvmlInit()
print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode()))

device = nvmlDeviceGetHandleByIndex(index=0)  # change the GPU index here

print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device)))
print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device)))

print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device)))
print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))

print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device)))))

migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0)  # change the MIG device index here

print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice)))
print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice)))

print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice)))))

nvmlShutdown()

Possible Solutions
I think that the MiG naming convention is different from regular naming conventions, and looks something like this:
MIG 7g.80gb Device 0: rather than just Device 0: as is currently set-up in the nvitop repo.

Agreed. I think we should redesign the UI and add a new panel for MIG devices.


Steps to reproduce

  • Run A100 in Mig mode
  • start nvitop watch -n 0.5 nvitop

You can use the monitor mode of nvitop by:

nvitop -m

Type nvitop --help for more command line options.

<!-- gh-comment-id:895758689 --> @XuehaiPan commented on GitHub (Aug 10, 2021): Thanks for the feedback! I'm sorry that `nvitop` does not support MIG enabled devices yet. But we are working on it. It would be very nice that if you can help us to make `nvitop` better. Ref wookayin/gpustat#102 --- > When running nvitop on MiG enabled A100 GPU. nvitop fails to detect the GPU running process and GPU memory consumption. Which can otherwise be viewed by running the command, `nvidia-smi` `nvitop` has not tested on MIG enabled devices. (I don't have any A100/A30 GPU available though.) Could you please run the following commands on your device, which could be very helpful to identify the error. ```bash python3 -m venv venv # create a virtual environment source venv/bin/activate pip3 install nvidia-ml-py==11.450.51 # the pinned version for nvitop python3 test.py pip3 install nvidia-ml-py==11.450.129 # the newer version python3 test.py deactivate rm -rf venv nvidia-smi ``` The content of `test.py`: ```python from pynvml import * nvmlInit() print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode())) device = nvmlDeviceGetHandleByIndex(index=0) # change the GPU index here print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device))) print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device))) print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device))) print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device))) print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device))))) migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0) # change the MIG device index here print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice))) print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice))) print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice))))) nvmlShutdown() ``` --- > **Possible Solutions** > I think that the MiG naming convention is different from regular naming conventions, and looks something like this: > `MIG 7g.80gb Device 0:` rather than just `Device 0:` as is currently set-up in the nvitop repo. Agreed. I think we should redesign the UI and add a new panel for MIG devices. --- > **Steps to reproduce** > - Run A100 in Mig mode > - start nvitop `watch -n 0.5 nvitop` You can use the monitor mode of `nvitop` by: ``` nvitop -m ``` Type `nvitop --help` for more command line options.
Author
Owner

@ki-arie commented on GitHub (Aug 10, 2021):

Hi,
Thanks for the quick response! Here are the outputs of running the above commands:

  • For the bit, pip3 install nvidia-ml-py==11.450.51 # the pinned version for nvitop python3 test.py, the console output is:
Driver version: 450.142.00
MIG mode: [1, 1]
MIG count: 7
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
  • For the new install: pip3 install nvidia-ml-py==11.450.129, the console output is:
MIG mode: [1, 1]
MIG count: 7
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 2009, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 703, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported

AFAIK - this is expected as to use MiG mode you've gotta disable the A100s from using Fabric Manager and NVLink.

  • Running nvidia-smi results in this screen:
    image

Let me know how else I can help with this - this library's pretty cool :)

<!-- gh-comment-id:896343982 --> @ki-arie commented on GitHub (Aug 10, 2021): Hi, Thanks for the quick response! Here are the outputs of running the above commands: - For the bit, `pip3 install nvidia-ml-py==11.450.51 # the pinned version for nvitop python3 test.py`, the console output is: ``` Driver version: 450.142.00 MIG mode: [1, 1] MIG count: 7 Traceback (most recent call last): File "test.py", line 11, in <module> print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device))) File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates _nvmlCheckReturn(ret) File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 697, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_NotSupported: Not Supported ``` - For the new install: `pip3 install nvidia-ml-py==11.450.129`, the console output is: ```Driver version: 450.142.00 MIG mode: [1, 1] MIG count: 7 Traceback (most recent call last): File "test.py", line 11, in <module> print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device))) File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 2009, in nvmlDeviceGetUtilizationRates _nvmlCheckReturn(ret) File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 703, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_NotSupported: Not Supported ``` AFAIK - this is expected as to use MiG mode you've gotta disable the A100s from using Fabric Manager and NVLink. - Running `nvidia-smi` results in this screen: ![image](https://user-images.githubusercontent.com/73854284/128941439-276688f9-9bd0-40ee-af4f-7678b81dd471.png) Let me know how else I can help with this - this library's pretty cool :)
Author
Owner

@XuehaiPan commented on GitHub (Aug 11, 2021):

Thanks to @ki-arie !

It seams that we cannot get the GPU level infos about fans peed, memory usage, GPU utilization on MIG enabled devices. Both from NVML python bindings and the nvidia-smi output.

I'm sorry for the poor exception handling in the example code. Can you try the Python code above again, but in Python REPL (just type python3 in command line)?

$ # create virtual environment and pip3 install ...
$ python3
Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
...

And it could be better that there are some processes is running on the MIG device when you testing with the NVML bindings. You can try:

pip3 install cupy-cuda102  # replace the suffix here to your CUDA version (e.g. `cuda110` for CUDA 11.0)
python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &

If you have installed TensorFlow or PyTorch, you can try:

python3 -c 'import time; import torch; x = torch.zeros((1, 1), device="cuda:0"); time.sleep(120)' &

This command will use the GPU for 2 minutes in the background.

<!-- gh-comment-id:896705771 --> @XuehaiPan commented on GitHub (Aug 11, 2021): Thanks to @ki-arie ! It seams that we cannot get the GPU level infos about _fans peed_, _memory usage_, _GPU utilization_ on MIG enabled devices. Both from NVML python bindings and the `nvidia-smi` output. I'm sorry for the poor exception handling in the example code. Can you try the Python code above again, but in Python REPL (just type `python3` in command line)? ```console $ # create virtual environment and pip3 install ... $ python3 Python 3.9.6 (default, Jun 28 2021, 08:57:49) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from pynvml import * >>> nvmlInit() ... ``` And it could be better that there are some processes is running on the MIG device when you testing with the NVML bindings. You can try: ```bash pip3 install cupy-cuda102 # replace the suffix here to your CUDA version (e.g. `cuda110` for CUDA 11.0) python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' & ``` If you have installed TensorFlow or PyTorch, you can try: ```bash python3 -c 'import time; import torch; x = torch.zeros((1, 1), device="cuda:0"); time.sleep(120)' & ``` This command will use the GPU for 2 minutes in the background.
Author
Owner

@zabique commented on GitHub (Nov 11, 2021):

image_2021-11-11_203703

works fine on 2x3090 NVLINK (MIG)

<!-- gh-comment-id:966608797 --> @zabique commented on GitHub (Nov 11, 2021): ![image_2021-11-11_203703](https://user-images.githubusercontent.com/58276510/141365520-34a0afc7-c19a-4a9b-baf4-833422525af0.png) works fine on 2x3090 NVLINK (MIG)
Author
Owner

@XuehaiPan commented on GitHub (Nov 12, 2021):

@zabique Thanks for the report. Glad to see people using nvitop on Windows!

According to your screenshot, you are using Dual-3090 on Windows, which is not a MIG setup. BTW, you can change the font of your terminal to get a better experience (the fonts are missing (?s in boxes) in the graph views and the last characters of the bars.)


NVIDIA Multi-Instance GPU User Guide: Introduction

Introduction

The new Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.

The MIG feature is to split one physical GPU into multiple separate GPU instances.

By now, only A100 series and A30 GPUs support MIG mode and are only available on Linux (NVIDIA Multi-Instance GPU User Guide: Supported GPUs).

<!-- gh-comment-id:966787685 --> @XuehaiPan commented on GitHub (Nov 12, 2021): @zabique Thanks for the report. Glad to see people using `nvitop` on Windows! According to your screenshot, you are using Dual-3090 on Windows, which is not a MIG setup. BTW, you can change the font of your terminal to get a better experience (the fonts are missing (`?`s in boxes) in the graph views and the last characters of the bars.) ------ [NVIDIA Multi-Instance GPU User Guide: Introduction](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#introductions) > **Introduction** > > The new Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization. The MIG feature is to split one physical GPU into multiple separate GPU instances. By now, only A100 series and A30 GPUs support MIG mode and are only available on Linux ([NVIDIA Multi-Instance GPU User Guide: Supported GPUs](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#supported-gpus)).
Author
Owner

@zabique commented on GitHub (Nov 12, 2021):

Thanks for reply and font hint as I was too shy to ask about it :).

I thougjt MIG is enabled on my GPUs because nvidia-smi show it in too right corner.
I also compared performance in ubuntu 20.04 and windows and I can run same model with pretty much same performance + nvidia-smi in windows allow a lot more hardware control.

Feel free to ask for any testing.
Your nvitop is great!

<!-- gh-comment-id:967046667 --> @zabique commented on GitHub (Nov 12, 2021): Thanks for reply and font hint as I was too shy to ask about it :). I thougjt MIG is enabled on my GPUs because nvidia-smi show it in too right corner. I also compared performance in ubuntu 20.04 and windows and I can run same model with pretty much same performance + nvidia-smi in windows allow a lot more hardware control. Feel free to ask for any testing. Your nvitop is great!
Author
Owner

@lixeon commented on GitHub (Nov 17, 2021):

I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop.
Hope these information below will help you. And thanks for develop this awesome tools.
BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight?

$ python testnvitop.py 
Driver version: 470.42.01
MIG mode: [1, 1]
MIG count: 7
Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B)
Traceback (most recent call last):
  File "/home/testnvitop.py", line 12, in <module>
    print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
$ python
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
>>> print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode()))
Driver version: 470.42.01
>>> device = nvmlDeviceGetHandleByIndex(index=0) 
>>> print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device)))
MIG mode: [1, 1]
>>> print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device)))
MIG count: 7
>>> print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device)))
Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B)
>>> print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
>>> print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device)))))
Processes from GPU: ["{'pid': 1562472, 'usedGpuMemory': 2104492032}", "{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1698220, 'usedGpuMemory': 2832203776}", "{'pid': 1701131, 'usedGpuMemory': 2345664512}", "{'pid': 1702015, 'usedGpuMemory': 2588934144}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"]
>>> migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0) 
>>> print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice)))
Memory info from MIG device: c_nvmlMemory_t(total: 10468982784 B, free: 5387976704 B, used: 5081006080 B)
>>> print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument
>>> print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice)))))
Processes from MIG device: ["{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"]
>>> nvmlShutdown()
>>> exit()

This time the process running situation in GPU.

$ nvidia-smi
Wed Nov 17 11:58:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:05:00.0 Off |                   On |
| N/A   84C    P0   154W / 250W |  14293MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    3   0   0  |   4845MiB /  9984MiB | 28      0 |  2   0    1    0    0 |
|                  |      8MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   1  |   2014MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      8MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   2  |   2708MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      4MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |   2244MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      8MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |   2476MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      8MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      6MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0    3    0    1695752      C   python                           2647MiB |
|    0    3    0    1733844      C   python                           2185MiB |
|    0    9    0    1562472      C   python                           2007MiB |
|    0   10    0    1698220      C   python                           2701MiB |
|    0   11    0    1701131      C   python                           2237MiB |
|    0   12    0    1702015      C   python                           2469MiB |
+-----------------------------------------------------------------------------+
<!-- gh-comment-id:971521113 --> @lixeon commented on GitHub (Nov 17, 2021): I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop. Hope these information below will help you. And thanks for develop this awesome tools. BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight? ```console $ python testnvitop.py Driver version: 470.42.01 MIG mode: [1, 1] MIG count: 7 Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B) Traceback (most recent call last): File "/home/testnvitop.py", line 12, in <module> print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device))) File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates _nvmlCheckReturn(ret) File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_NotSupported: Not Supported ``` ```pycon $ python Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> from pynvml import * >>> nvmlInit() >>> print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode())) Driver version: 470.42.01 >>> device = nvmlDeviceGetHandleByIndex(index=0) >>> print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device))) MIG mode: [1, 1] >>> print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device))) MIG count: 7 >>> print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device))) Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B) >>> print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device))) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates _nvmlCheckReturn(ret) File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_NotSupported: Not Supported >>> print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device))))) Processes from GPU: ["{'pid': 1562472, 'usedGpuMemory': 2104492032}", "{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1698220, 'usedGpuMemory': 2832203776}", "{'pid': 1701131, 'usedGpuMemory': 2345664512}", "{'pid': 1702015, 'usedGpuMemory': 2588934144}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"] >>> migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0) >>> print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice))) Memory info from MIG device: c_nvmlMemory_t(total: 10468982784 B, free: 5387976704 B, used: 5081006080 B) >>> print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice))) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates _nvmlCheckReturn(ret) File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_InvalidArgument: Invalid Argument >>> print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice))))) Processes from MIG device: ["{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"] >>> nvmlShutdown() >>> exit() ``` This time the process running situation in GPU. ```console $ nvidia-smi Wed Nov 17 11:58:34 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... Off | 00000000:05:00.0 Off | On | | N/A 84C P0 154W / 250W | 14293MiB / 40536MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 3 0 0 | 4845MiB / 9984MiB | 28 0 | 2 0 1 0 0 | | | 8MiB / 16383MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 9 0 1 | 2014MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 8MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 10 0 2 | 2708MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 4MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 11 0 3 | 2244MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 8MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 12 0 4 | 2476MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 8MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 13 0 5 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 6MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 3 0 1695752 C python 2647MiB | | 0 3 0 1733844 C python 2185MiB | | 0 9 0 1562472 C python 2007MiB | | 0 10 0 1698220 C python 2701MiB | | 0 11 0 1701131 C python 2237MiB | | 0 12 0 1702015 C python 2469MiB | +-----------------------------------------------------------------------------+ ```
Author
Owner

@XuehaiPan commented on GitHub (Nov 17, 2021):

I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop.
Hope these information below will help you.

@lixeon Thanks a lot for the informative results. I'll try to improve nvitop on MIG enabled devices.


BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight?

From NVML API Reference:

The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing various states within NVIDIA Tesla™ GPUs.

The NVML and applications based on it (nvidia-smi, nvidia-ml-py, nvitop, nvtop, gpustat, etc.) are designed to monitor the GPU states in a global view. These tools can only capture the overall GPU SM and VRAM usage of a single process. They are not designed for code profiling.

Nsight is a profiling tool that can grab more fine-grained GPU usage information (nvvp is deprecated). It counts the running time for each API call.

<!-- gh-comment-id:971551967 --> @XuehaiPan commented on GitHub (Nov 17, 2021): > I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop. > Hope these information below will help you. @lixeon Thanks a lot for the informative results. I'll try to improve `nvitop` on MIG enabled devices. ------ > BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight? From [NVML API Reference](https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference): > The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing various states within NVIDIA Tesla™ GPUs. The NVML and applications based on it (`nvidia-smi`, `nvidia-ml-py`, `nvitop`, `nvtop`, `gpustat`, etc.) are designed to monitor the GPU states in a global view. These tools can only capture the overall GPU SM and VRAM usage of a single process. They are not designed for code profiling. Nsight is a profiling tool that can grab more fine-grained GPU usage information ([nvvp is deprecated](https://developer.nvidia.com/blog/migrating-nvidia-nsight-tools-nvvp-nvprof)). It counts the running time for each API call.
Author
Owner

@XuehaiPan commented on GitHub (Jun 15, 2022):

Hi, you guys! I add the MIG support to the GUI. To install:

git clone --branch=mig-support https://github.com/XuehaiPan/nvitop.git
cd nvitop

python3 -m venv --upgrade-deps venv
source venv/bin/activate

pip3 install -r requirements.txt
python3 nvitop.py -m

Any feedback is welcome.

<!-- gh-comment-id:1156482996 --> @XuehaiPan commented on GitHub (Jun 15, 2022): Hi, you guys! I add the MIG support to the GUI. To install: ```bash git clone --branch=mig-support https://github.com/XuehaiPan/nvitop.git cd nvitop python3 -m venv --upgrade-deps venv source venv/bin/activate pip3 install -r requirements.txt python3 nvitop.py -m ``` Any feedback is welcome.
Author
Owner

@XuehaiPan commented on GitHub (Jun 26, 2022):

Close as resolved by PR #8.

<!-- gh-comment-id:1166511984 --> @XuehaiPan commented on GitHub (Jun 26, 2022): Close as resolved by PR #8.
Author
Owner

@XuehaiPan commented on GitHub (Jul 14, 2022):

I just got my sudo access to an A100 GPU. I tweaked the visual result of the CLI and may release it soon.

MIG

<!-- gh-comment-id:1184051620 --> @XuehaiPan commented on GitHub (Jul 14, 2022): I just got my `sudo` access to an A100 GPU. I tweaked the visual result of the CLI and may release it soon. ![MIG](https://user-images.githubusercontent.com/16078332/178963038-a5cd4eb5-02a8-4456-966f-d5ff04eb44d8.png)
Author
Owner

@ki-arie commented on GitHub (Jul 14, 2022):

Omg incredible work! 🤩

<!-- gh-comment-id:1184053346 --> @ki-arie commented on GitHub (Jul 14, 2022): Omg incredible work! 🤩
Author
Owner

@ytaoeer commented on GitHub (Nov 3, 2023):

so nvitop can't get the migdevice's gpu utilization and sm?

<!-- gh-comment-id:1791845545 --> @ytaoeer commented on GitHub (Nov 3, 2023): so nvitop can't get the migdevice's gpu utilization and sm?
Author
Owner

@XuehaiPan commented on GitHub (Nov 3, 2023):

so nvitop can't get the migdevice's gpu utilization and sm?

@ytaoeer nvitop is based on the NVML library. The API reference of nvmlDeviceGetUtilizationRates notes that:

Note:

  • During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
  • On MIG-enabled GPUs, querying device utilization rates is not currently supported.

All NVML-based monitoring tools cannot track the GPU utilization of the MIG instances (including nvidia-smi). You can submit a feature request to the NVML upstream to ask for this support.

<!-- gh-comment-id:1792021336 --> @XuehaiPan commented on GitHub (Nov 3, 2023): > so nvitop can't get the migdevice's gpu utilization and sm? @ytaoeer `nvitop` is based on the NVML library. The API reference of [`nvmlDeviceGetUtilizationRates`](https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g540824faa6cef45500e0d1dc2f50b321) notes that: > **Note:** > > - During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization. > - On MIG-enabled GPUs, querying device utilization rates is not currently supported. All NVML-based monitoring tools cannot track the GPU utilization of the MIG instances (including `nvidia-smi`). You can submit a feature request to the NVML upstream to ask for this support.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#5
No description provided.