[PR #198] support mthreads gpu monitoring #200

Open
opened 2026-05-05 03:27:44 -06:00 by gitea-mirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/XuehaiPan/nvitop/pull/198
Author: @gingerXue
Created: 12/19/2025
Status: 🔄 Open

Base: mainHead: feat/mtgpu-support


📝 Commits (1)

📊 Changes

3 files changed (+44 additions, -9 deletions)

View changed files

📝 nvitop/api/libnvml.py (+33 -9)
📝 nvitop/api/utils.py (+10 -0)
📝 pyproject.toml (+1 -0)

📄 Description

Issue Type

  • Improvement/feature implementation

Runtime Environment

  • Operating system and version: Ubuntu 22.04.4 LT
  • Terminal emulator and version: xterm-256color
  • Python version: 3.10.12
  • NVML version (driver version): N/A
  • MTML version: 2.2.0
  • nvitop version or commit: 1.6.2.dev4+g31792dd
  • mthreads-ml-py version: 2.2.0
  • Locale: C.UTF-8

Description

This PR adds Mthreads GPU (mtml) support to nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.

The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.


Motivation and Context

nvitop currently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.
In such environments, users lack a lightweight, top-like GPU monitoring tool.

This PR aims to:

  • Extend nvitop to support MTGPU-based platforms
  • Preserve existing behavior on NVIDIA GPUs
  • Minimize impact on the current code structure

Design & Implementation

  • Introduced a new backend based on mtml, parallel to the existing NVML backend
  • Runtime detection is used to select the appropriate backend:
    • nvml → NVIDIA GPUs
    • mtml → MTGPU devices
  • Implemented a compatibility layer to map MTGPU APIs to nvitop's internal data structures
Currently Supported Features (MTGPU)
  • Driver Version
  • GPU device enumeration
  • Total / used memory reporting
  • Basic utilization metrics
  • Power usage
Not Yet Supported
  • MIG-related features
  • Processes enumeration and utilization
  • Cuda driver version information
  • Persistence Mode
  • Bus-Id infomation
  • Advanced performance counters (not available in mtml)

Testing

Tested on:

  • MTGPU platform with mtml

Manual test cases include:

  • nvitop startup and refresh
  • MTGpu information
  • Memory usage display
  • Mixed error handling when NVML is not present
basic api test
from nvitop import Device

count = Device.count()
print(f'There are {count} MUSA devices')
devices = Device.all()

for device in devices:
    processes = device.processes()
    sorted_pids = sorted(processes)
    
    print(device)
    print(f'  - Fan speed:       {device.fan_speed()}%')
    print(f'  - Temperature:     {device.temperature()}C')
    print(f'  - GPU utilization: {device.gpu_utilization()}%')
    print(f'  - Total memory:    {device.memory_total_human()}')
    print(f'  - Used memory:     {device.memory_used_human()}')
    print(f'  - Free memory:     {device.memory_free_human()}')
    print(f'  - Processes ({len(processes)}): {sorted_pids}')
    for pid in sorted_pids:
        print(f'    - {processes[pid]}')
    print('-' * 120)
There are 8 MUSA devices
PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     52C
  - GPU utilization: 0%
  - Total memory:    80.00GiB
  - Used memory:     78.88GiB
  - Free memory:     1148MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.63GiB
  - Free memory:     6519MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     71.03GiB
  - Free memory:     9187MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     59C
  - GPU utilization: 59%
  - Total memory:    80.00GiB
  - Used memory:     78.23GiB
  - Free memory:     1810MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     77C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.39GiB
  - Free memory:     6765MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     69C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.68GiB
  - Free memory:     7497MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     78C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     75.62GiB
  - Free memory:     4480MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     63C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.48GiB
  - Free memory:     7702MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------

Future Work

  • Extend MTGPU metrics as mtml evolves
  • Add automated tests for backend selection
  • Improve feature parity where possible

Images / Videos

image

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/XuehaiPan/nvitop/pull/198 **Author:** [@gingerXue](https://github.com/gingerXue) **Created:** 12/19/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feat/mtgpu-support` --- ### 📝 Commits (1) - [`a1c3f09`](https://github.com/XuehaiPan/nvitop/commit/a1c3f09d5bdd9a0991ac1051ff9a103d325528fe) support mthreads-ml-py ### 📊 Changes **3 files changed** (+44 additions, -9 deletions) <details> <summary>View changed files</summary> 📝 `nvitop/api/libnvml.py` (+33 -9) 📝 `nvitop/api/utils.py` (+10 -0) 📝 `pyproject.toml` (+1 -0) </details> ### 📄 Description #### Issue Type - Improvement/feature implementation --- #### Runtime Environment - Operating system and version: `Ubuntu 22.04.4 LT` - Terminal emulator and version: `xterm-256color` - Python version: `3.10.12` - NVML version (driver version): `N/A` - MTML version: `2.2.0` - `nvitop` version or commit: `1.6.2.dev4+g31792dd` - `mthreads-ml-py` version: `2.2.0` - Locale: `C.UTF-8` --- #### Description This PR adds **Mthreads GPU (mtml)** support to `nvitop`, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project. The implementation is designed to be **non-intrusive** and **fully backward compatible** with existing NVML-based workflows. --- #### Motivation and Context `nvitop` currently relies on NVIDIA NVML, which makes it unusable on systems equipped with **MTGPU** devices. In such environments, users lack a lightweight, top-like GPU monitoring tool. This PR aims to: - Extend `nvitop` to support MTGPU-based platforms - Preserve existing behavior on NVIDIA GPUs - Minimize impact on the current code structure #### Design & Implementation - Introduced a **new backend based on `mtml`**, parallel to the existing NVML backend - Runtime detection is used to select the appropriate backend: - `nvml` → NVIDIA GPUs - `mtml` → MTGPU devices - Implemented a compatibility layer to map MTGPU APIs to `nvitop`'s internal data structures ##### Currently Supported Features (MTGPU) - Driver Version - GPU device enumeration - Total / used memory reporting - Basic utilization metrics - Power usage ##### Not Yet Supported - MIG-related features - Processes enumeration and utilization - Cuda driver version information - Persistence Mode - Bus-Id infomation - Advanced performance counters (not available in mtml) --- #### Testing Tested on: - MTGPU platform with `mtml` Manual test cases include: - `nvitop` startup and refresh - MTGpu information - Memory usage display - Mixed error handling when NVML is not present ##### basic api test ```python from nvitop import Device count = Device.count() print(f'There are {count} MUSA devices') devices = Device.all() for device in devices: processes = device.processes() sorted_pids = sorted(processes) print(device) print(f' - Fan speed: {device.fan_speed()}%') print(f' - Temperature: {device.temperature()}C') print(f' - GPU utilization: {device.gpu_utilization()}%') print(f' - Total memory: {device.memory_total_human()}') print(f' - Used memory: {device.memory_used_human()}') print(f' - Free memory: {device.memory_free_human()}') print(f' - Processes ({len(processes)}): {sorted_pids}') for pid in sorted_pids: print(f' - {processes[pid]}') print('-' * 120) ``` ``` There are 8 MUSA devices PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 52C - GPU utilization: 0% - Total memory: 80.00GiB - Used memory: 78.88GiB - Free memory: 1148MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 67C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 73.63GiB - Free memory: 6519MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 67C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 71.03GiB - Free memory: 9187MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 59C - GPU utilization: 59% - Total memory: 80.00GiB - Used memory: 78.23GiB - Free memory: 1810MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 77C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 73.39GiB - Free memory: 6765MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 69C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 72.68GiB - Free memory: 7497MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 78C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 75.62GiB - Free memory: 4480MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB) - Fan speed: 0% - Temperature: 63C - GPU utilization: 99% - Total memory: 80.00GiB - Used memory: 72.48GiB - Free memory: 7702MiB - Processes (0): [] ------------------------------------------------------------------------------------------------------------------------ ``` --- #### Future Work - Extend MTGPU metrics as mtml evolves - Add automated tests for backend selection - Improve feature parity where possible --- #### Images / Videos <!-- Only if relevant --> <!-- Link or embed images and videos of screenshots, sketches etc. --> <img width="2541" height="1166" alt="image" src="https://github.com/user-attachments/assets/e195d148-eaa5-4394-9199-ba62b94951d6" /> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
gitea-mirror added the
pull-request
label 2026-05-05 03:27:44 -06:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#200
No description provided.