[PR #198] support mthreads gpu monitoring #200

New issue

Open

opened 2026-05-05 03:27:44 -06:00 by gitea-mirror · 0 comments

gitea-mirror commented

2026-05-05 03:27:44 -06:00

Owner

📋 Pull Request Information

Original PR: https://github.com/XuehaiPan/nvitop/pull/198
Author: @gingerXue
Created: 12/19/2025
Status: 🔄 Open

Base: main ← Head: feat/mtgpu-support

📝 Commits (1)

a1c3f09 support mthreads-ml-py

📊 Changes

3 files changed (+44 additions, -9 deletions)

View changed files

📝 nvitop/api/libnvml.py (+33 -9)
📝 nvitop/api/utils.py (+10 -0)
📝 pyproject.toml (+1 -0)

📄 Description

Issue Type

Improvement/feature implementation

Runtime Environment

Operating system and version: Ubuntu 22.04.4 LT
Terminal emulator and version: xterm-256color
Python version: 3.10.12
NVML version (driver version): N/A
MTML version: 2.2.0
nvitop version or commit: 1.6.2.dev4+g31792dd
mthreads-ml-py version: 2.2.0
Locale: C.UTF-8

Description

This PR adds Mthreads GPU (mtml) support to nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.

The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.

Motivation and Context

nvitop currently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.
In such environments, users lack a lightweight, top-like GPU monitoring tool.

This PR aims to:

Extend nvitop to support MTGPU-based platforms
Preserve existing behavior on NVIDIA GPUs
Minimize impact on the current code structure

Design & Implementation

Introduced a new backend based on mtml, parallel to the existing NVML backend
Runtime detection is used to select the appropriate backend:
- nvml → NVIDIA GPUs
- mtml → MTGPU devices
Implemented a compatibility layer to map MTGPU APIs to nvitop's internal data structures

Currently Supported Features (MTGPU)

Driver Version
GPU device enumeration
Total / used memory reporting
Basic utilization metrics
Power usage

Not Yet Supported

MIG-related features
Processes enumeration and utilization
Cuda driver version information
Persistence Mode
Bus-Id infomation
Advanced performance counters (not available in mtml)

Testing

Tested on:

MTGPU platform with mtml

Manual test cases include:

nvitop startup and refresh
MTGpu information
Memory usage display
Mixed error handling when NVML is not present

basic api test

from nvitop import Device

count = Device.count()
print(f'There are {count} MUSA devices')
devices = Device.all()

for device in devices:
    processes = device.processes()
    sorted_pids = sorted(processes)
    
    print(device)
    print(f'  - Fan speed:       {device.fan_speed()}%')
    print(f'  - Temperature:     {device.temperature()}C')
    print(f'  - GPU utilization: {device.gpu_utilization()}%')
    print(f'  - Total memory:    {device.memory_total_human()}')
    print(f'  - Used memory:     {device.memory_used_human()}')
    print(f'  - Free memory:     {device.memory_free_human()}')
    print(f'  - Processes ({len(processes)}): {sorted_pids}')
    for pid in sorted_pids:
        print(f'    - {processes[pid]}')
    print('-' * 120)

There are 8 MUSA devices
PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     52C
  - GPU utilization: 0%
  - Total memory:    80.00GiB
  - Used memory:     78.88GiB
  - Free memory:     1148MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.63GiB
  - Free memory:     6519MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     71.03GiB
  - Free memory:     9187MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     59C
  - GPU utilization: 59%
  - Total memory:    80.00GiB
  - Used memory:     78.23GiB
  - Free memory:     1810MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     77C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.39GiB
  - Free memory:     6765MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     69C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.68GiB
  - Free memory:     7497MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     78C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     75.62GiB
  - Free memory:     4480MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     63C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.48GiB
  - Free memory:     7702MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------

Future Work

Extend MTGPU metrics as mtml evolves
Add automated tests for backend selection
Improve feature parity where possible

Images / Videos

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/XuehaiPan/nvitop/pull/198 **Author:** [@gingerXue](https://github.com/gingerXue) **Created:** 12/19/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feat/mtgpu-support` --- ### 📝 Commits (1) - [`a1c3f09`](https://github.com/XuehaiPan/nvitop/commit/a1c3f09d5bdd9a0991ac1051ff9a103d325528fe) support mthreads-ml-py ### 📊 Changes **3 files changed** (+44 additions, -9 deletions) <details> <summary>View changed files</summary> 📝 `nvitop/api/libnvml.py` (+33 -9) 📝 `nvitop/api/utils.py` (+10 -0) 📝 `pyproject.toml` (+1 -0) </details> ### 📄 Description #### Issue Type - Improvement/feature implementation --- #### Runtime Environment - Operating system and version: `Ubuntu 22.04.4 LT` - Terminal emulator and version: `xterm-256color` - Python version: `3.10.12` - NVML version (driver version): `N/A` - MTML version: `2.2.0` - `nvitop` version or commit: `1.6.2.dev4+g31792dd` - `mthreads-ml-py` version: `2.2.0` - Locale: `C.UTF-8` --- #### Description This PR adds **Mthreads GPU (mtml)** support to `nvitop`, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project. The implementation is designed to be **non-intrusive** and **fully backward compatible** with existing NVML-based workflows. --- #### Motivation and Context `nvitop` currently relies on NVIDIA NVML, which makes it unusable on systems equipped with **MTGPU** devices. In such environments, users lack a lightweight, top-like GPU monitoring tool. This PR aims to: - Extend `nvitop` to support MTGPU-based platforms - Preserve existing behavior on NVIDIA GPUs - Minimize impact on the current code structure #### Design & Implementation - Introduced a **new backend based on `mtml`**, parallel to the existing NVML backend - Runtime detection is used to select the appropriate backend: - `nvml` → NVIDIA GPUs - `mtml` → MTGPU devices - Implemented a compatibility layer to map MTGPU APIs to `nvitop`'s internal data structures ##### Currently Supported Features (MTGPU) - Driver Version - GPU device enumeration - Total / used memory reporting - Basic utilization metrics - Power usage ##### Not Yet Supported - MIG-related features - Processes enumeration and utilization - Cuda driver version information - Persistence Mode - Bus-Id infomation - Advanced performance counters (not available in mtml) --- #### Testing Tested on: - MTGPU platform with `mtml` Manual test cases include: - `nvitop` startup and refresh - MTGpu information - Memory usage display - Mixed error handling when NVML is not present ##### basic api test ```python from nvitop import Device count = Device.count() print(f'There are {count} MUSA devices') devices = Device.all() for device in devices: processes = device.processes() sorted_pids = sorted(processes) print(device) print(f' - Fan speed: {device.fan_speed()}%') print(f' - Temperature: {device.temperature()}C') print(f' - GPU utilization: {device.gpu_utilization()}%') print(f' - Total memory: {device.memory_total_human()}') print(f' - Used memory: {device.memory_used_human()}') print(f' - Free memory: {device.memory_free_human()}') print(f' - Processes ({len(processes)}): {sorted_pids}') for pid in sorted_pids: print(f' - {processes[pid]}') print('-' * 120) ``` ``` There are 8 MUSA devices PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 52C - GPU utilization: 0% - Total memory: 80.00GiB - Used memory: 78.88GiB - Free memory: 1148MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 67C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 73.63GiB - Free memory: 6519MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 67C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 71.03GiB - Free memory: 9187MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 59C - GPU utilization: 59% - Total memory: 80.00GiB - Used memory: 78.23GiB - Free memory: 1810MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 77C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 73.39GiB - Free memory: 6765MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 69C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 72.68GiB - Free memory: 7497MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 78C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 75.62GiB - Free memory: 4480MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 63C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 72.48GiB - Free memory: 7702MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ ``` --- #### Future Work - Extend MTGPU metrics as mtml evolves - Add automated tests for backend selection - Improve feature parity where possible --- #### Images / Videos   <img width="2541" height="1166" alt="image" src="https://github.com/user-attachments/assets/e195d148-eaa5-4394-9199-ba62b94951d6" /> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>