[PR #21] [MERGED] Async Status Collector for Logger Integration (e.g. CSV or TensorBoard) #123

New issue

Closed

opened 2026-05-05 03:26:17 -06:00 by gitea-mirror · 0 comments

gitea-mirror commented

2026-05-05 03:26:17 -06:00

Owner

📋 Pull Request Information

Original PR: https://github.com/XuehaiPan/nvitop/pull/21
Author: @XuehaiPan
Created: 6/22/2022
Status: ✅ Merged
Merged: 6/26/2022
Merged by: @XuehaiPan

Base: mig-support ← Head: collector

📝 Commits (10+)

b8c10f8 feat(core): add metric collector
5fb0628 docs: update README.md for ResourceMetricCollector
2e27014 chore(core/collector): add timestamp to results
b2990dc docs: update README.md for ResourceMetricCollector
ffe8394 feat(core/collector): add reset method
72b7b89 refactor(core/collector): rename methods
aff3767 feat(core/collector): feat log metric min/max
dd651a6 chore(core/collector): use ASCII characters
cddad4a feat(core): pickle support for Device and Process objects
aec57f5 chore(core/collector): do not log stats for running time

📊 Changes

8 files changed (+867 additions, -137 deletions)

View changed files

📝 README.md (+150 -7)
➕ nvitop/callbacks/lightning.py (+6 -0)
➕ nvitop/callbacks/tensorboard.py (+10 -0)
📝 nvitop/callbacks/utils.py (+1 -1)
📝 nvitop/core/__init__.py (+2 -125)
➕ nvitop/core/collector.py (+683 -0)
📝 nvitop/core/device.py (+9 -3)
📝 nvitop/core/process.py (+6 -1)

📄 Description

Issue Type

Improvement/feature implementation

Runtime Environment

Operating system and version: e.g. Ubuntu 20.04 LTS
Terminal emulator and version: GNOME Terminal 3.36.2
Python version: 3.9.13
NVML version (driver version): 470.129
nvitop version or commit: N/A
python-ml-py version: 11.450.51
Locale: en_US.UTF-8

Description

Add a status collector for logger integration.

Core APIs:

collector = ResourceMetricCollector(devices, root_pids, interval=1.0)

collector.activate(tag='<tag>')  # alias: start
collector.deactivate()           # alias: stop
collector.reset(tag='<tag>')
collector.collect()              # -> Dict[str, float]

# Context manager
with collector.context(tag='...'):  # or `with collector(tag='...')`
    ...
    collector.collect()  # -> Dict[str, float]

# Nested context manager
with collector.context(tag='tagA'):
    ...
    collector.collect()      # key='tagA/type/key (unit)'
    
    # nested context
    with collector.context(tag='tagB'):
        ...
        collector.collect()  # key='tagA/tagB/type/key (unit)'

    ...
    collector.collect()      # key='tagA/type/key (unit)'

For example:

>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '3,2,1,0'

>>> from nvitop import ResourceMetricCollector, Device, CudaDevice

>>> collector = ResourceMetricCollector()                          # log all devices and children processes on the GPUs of the current process
>>> collector = ResourceMetricCollector(root_pids={1})             # log all devices and all GPU processes
>>> collector = ResourceMetricCollector(devices=CudaDevice.all())  # use the CUDA ordinal

>>> with collector(tag='<tag>'):
...     # do something
...     collector.collect()  # -> Dict[str, float]
# key -> '<tag>/<scope>/<metric (unit)>/<mean/min/max>'
{
    '<tag>/host/cpu_percent (%)/mean': 8.967849777683456,
    '<tag>/host/cpu_percent (%)/min': 6.1,
    '<tag>/host/cpu_percent (%)/max': 28.1,
    ...,
    '<tag>/host/memory_percent (%)/mean': 21.5,
    '<tag>/host/swap_percent (%)/mean': 0.3,
    '<tag>/host/memory_used (GiB)/mean': 91.0136418208109,
    '<tag>/host/load_average (%) (1 min)/mean': 10.251427386878328,
    '<tag>/host/load_average (%) (5 min)/mean': 10.072539414569503,
    '<tag>/host/load_average (%) (15 min)/mean': 11.91126970422139,
    ...,
    '<tag>/cuda:0 (gpu:3)/memory_used (MiB)/mean': 3.875,
    '<tag>/cuda:0 (gpu:3)/memory_free (MiB)/mean': 11015.562499999998,
    '<tag>/cuda:0 (gpu:3)/memory_total (MiB)/mean': 11019.437500000002,
    '<tag>/cuda:0 (gpu:3)/memory_percent (%)/mean': 0.0,
    '<tag>/cuda:0 (gpu:3)/gpu_utilization (%)/mean': 0.0,
    '<tag>/cuda:0 (gpu:3)/memory_utilization (%)/mean': 0.0,
    '<tag>/cuda:0 (gpu:3)/fan_speed (%)/mean': 22.0,
    '<tag>/cuda:0 (gpu:3)/temperature (C)/mean': 25.0,
    '<tag>/cuda:0 (gpu:3)/power_usage (W)/mean': 19.11166264116916,
    ...,
    '<tag>/cuda:1 (gpu:2)/memory_used (MiB)/mean': 8878.875,
    ...,
    '<tag>/cuda:2 (gpu:1)/memory_used (MiB)/mean': 8182.875,
    ...,
    '<tag>/cuda:3 (gpu:0)/memory_used (MiB)/mean': 9286.875,
    ...,
    '<tag>/pid:12345/host/cpu_percent (%)/mean': 151.34342772112265,
    '<tag>/pid:12345/host/host_memory (MiB)/mean': 44749.72373447514,
    '<tag>/pid:12345/host/host_memory_percent (%)/mean': 8.675082352111717,
    '<tag>/pid:12345/host/running_time (min)': 336.23803206741576,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory (MiB)/mean': 8861.0,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_percent (%)/mean': 80.4,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_utilization (%)/mean': 6.711118172407917,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_sm_utilization (%)/mean': 48.23283397736476,
    ...,
    '<tag>/duration (s)': 7.247399162035435,
    '<tag>/timestamp': 1655909466.9981883
}

The results can be easily logged into TensorBoard or to CSV file.

To TensorBoard:

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

from nvitop import CudaDevice, ResourceMetricCollector
from nvitop.callbacks.tensorboard import add_scalar_dict

# Build networks and prepare datasets
...

# Logger and status collector
writer = SummaryWriter()
collector = ResourceMetricCollector(devices=CudaDevice.all(),  # log all visible CUDA devices and use the CUDA ordinal
                                    root_pids={os.getpid()},   # only log the children processes of the current process
                                    interval=1.0)              # snapshot interval for background daemon thread

# Start training
global_step = 0
for epoch in range(num_epoch):
    with collector(tag='train'):
        for batch in train_dataset:
            with collector(tag='batch'):
                metrics = train(net, batch)
                global_step += 1
                add_scalar_dict(writer, 'train', metrics, global_step=global_step)
                add_scalar_dict(writer, 'resources',      # tag='resources/train/batch/...'
                                collector.collect(),
                                global_step=global_step)

        add_scalar_dict(writer, 'resources',              # tag='resources/train/...'
                        collector.collect(),
                        global_step=epoch)

    with collector(tag='validate'):
        metrics = validate(net, validation_dataset)
        add_scalar_dict(writer, 'validate', metrics, global_step=epoch)
        add_scalar_dict(writer, 'resources',              # tag='resources/validate/...'
                        collector.collect(),
                        global_step=epoch)

To CSV:

import datetime
import time

import pandas as pd

from nvitop import ResourceMetricCollector

collector = ResourceMetricCollector(root_pids={1}, interval=2.0)  # log all devices and all GPU processes
df = pd.DataFrame()

with collector(tag='resources'):
    for _ in range(12):
        # Do something
        time.sleep(5)

        metrics = collector.collect()
        df_metrics = pd.DataFrame.from_records(metrics, index=[len(df)])
        df = pd.concat([df, df_metrics], ignore_index=True)
        # Flush to CSV file ...

df.insert(0, 'time', df['resources/timestamp'].map(datetime.datetime.fromtimestamp))
df.to_csv('results.csv', index=False)

Motivation and Context

Resolves #18
Resolves #20

Testing

N/A

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/XuehaiPan/nvitop/pull/21 **Author:** [@XuehaiPan](https://github.com/XuehaiPan) **Created:** 6/22/2022 **Status:** ✅ Merged **Merged:** 6/26/2022 **Merged by:** [@XuehaiPan](https://github.com/XuehaiPan) **Base:** `mig-support` ← **Head:** `collector` --- ### 📝 Commits (10+) - [`b8c10f8`](https://github.com/XuehaiPan/nvitop/commit/b8c10f8d213673bf6511f6ddc5c13d6db22af317) feat(core): add metric collector - [`5fb0628`](https://github.com/XuehaiPan/nvitop/commit/5fb0628d804200b96bc7bbeb4c9790c2cde2f610) docs: update README.md for `ResourceMetricCollector` - [`2e27014`](https://github.com/XuehaiPan/nvitop/commit/2e2701472bbb89b3dba78bbebdc9e50e7cb71688) chore(core/collector): add timestamp to results - [`b2990dc`](https://github.com/XuehaiPan/nvitop/commit/b2990dc93f3a74029bc7956efab0669efb0fba6d) docs: update README.md for `ResourceMetricCollector` - [`ffe8394`](https://github.com/XuehaiPan/nvitop/commit/ffe83941f5f90748a03c7f427c9f34fc1bc71ad6) feat(core/collector): add `reset` method - [`72b7b89`](https://github.com/XuehaiPan/nvitop/commit/72b7b893ee96fb852e9653fd9b47fca55603dc7d) refactor(core/collector): rename methods - [`aff3767`](https://github.com/XuehaiPan/nvitop/commit/aff3767615f07be354b6e4275595b79aebc7e0bf) feat(core/collector): feat log metric min/max - [`dd651a6`](https://github.com/XuehaiPan/nvitop/commit/dd651a6cb90c665d04c5700510dc74d20b3dfba9) chore(core/collector): use ASCII characters - [`cddad4a`](https://github.com/XuehaiPan/nvitop/commit/cddad4aaf17e378bc82e9d193c8a23b3c3285e3a) feat(core): pickle support for Device and Process objects - [`aec57f5`](https://github.com/XuehaiPan/nvitop/commit/aec57f5fb18e9c91232ecefb56bd2815cbfd9644) chore(core/collector): do not log stats for running time ### 📊 Changes **8 files changed** (+867 additions, -137 deletions) <details> <summary>View changed files</summary> 📝 `README.md` (+150 -7) ➕ `nvitop/callbacks/lightning.py` (+6 -0) ➕ `nvitop/callbacks/tensorboard.py` (+10 -0) 📝 `nvitop/callbacks/utils.py` (+1 -1) 📝 `nvitop/core/__init__.py` (+2 -125) ➕ `nvitop/core/collector.py` (+683 -0) 📝 `nvitop/core/device.py` (+9 -3) 📝 `nvitop/core/process.py` (+6 -1) </details> ### 📄 Description  #### Issue Type  - Improvement/feature implementation #### Runtime Environment  - Operating system and version: e.g. Ubuntu 20.04 LTS - Terminal emulator and version: GNOME Terminal 3.36.2 - Python version: `3.9.13` - NVML version (driver version): `470.129` - `nvitop` version or commit: N/A - `python-ml-py` version: `11.450.51` - Locale: `en_US.UTF-8` #### Description  Add a status collector for logger integration. Core APIs: ```python collector = ResourceMetricCollector(devices, root_pids, interval=1.0) collector.activate(tag='<tag>') # alias: start collector.deactivate() # alias: stop collector.reset(tag='<tag>') collector.collect() # -> Dict[str, float] ``` ```python # Context manager with collector.context(tag='...'): # or `with collector(tag='...')` ... collector.collect() # -> Dict[str, float] ``` ```python # Nested context manager with collector.context(tag='tagA'): ... collector.collect() # key='tagA/type/key (unit)' # nested context with collector.context(tag='tagB'): ... collector.collect() # key='tagA/tagB/type/key (unit)' ... collector.collect() # key='tagA/type/key (unit)' ``` For example: ```python >>> import os >>> os.environ['CUDA_VISIBLE_DEVICES'] = '3,2,1,0' >>> from nvitop import ResourceMetricCollector, Device, CudaDevice >>> collector = ResourceMetricCollector() # log all devices and children processes on the GPUs of the current process >>> collector = ResourceMetricCollector(root_pids={1}) # log all devices and all GPU processes >>> collector = ResourceMetricCollector(devices=CudaDevice.all()) # use the CUDA ordinal >>> with collector(tag='<tag>'): ... # do something ... collector.collect() # -> Dict[str, float] # key -> '<tag>/<scope>/<metric (unit)>/<mean/min/max>' { '<tag>/host/cpu_percent (%)/mean': 8.967849777683456, '<tag>/host/cpu_percent (%)/min': 6.1, '<tag>/host/cpu_percent (%)/max': 28.1, ..., '<tag>/host/memory_percent (%)/mean': 21.5, '<tag>/host/swap_percent (%)/mean': 0.3, '<tag>/host/memory_used (GiB)/mean': 91.0136418208109, '<tag>/host/load_average (%) (1 min)/mean': 10.251427386878328, '<tag>/host/load_average (%) (5 min)/mean': 10.072539414569503, '<tag>/host/load_average (%) (15 min)/mean': 11.91126970422139, ..., '<tag>/cuda:0 (gpu:3)/memory_used (MiB)/mean': 3.875, '<tag>/cuda:0 (gpu:3)/memory_free (MiB)/mean': 11015.562499999998, '<tag>/cuda:0 (gpu:3)/memory_total (MiB)/mean': 11019.437500000002, '<tag>/cuda:0 (gpu:3)/memory_percent (%)/mean': 0.0, '<tag>/cuda:0 (gpu:3)/gpu_utilization (%)/mean': 0.0, '<tag>/cuda:0 (gpu:3)/memory_utilization (%)/mean': 0.0, '<tag>/cuda:0 (gpu:3)/fan_speed (%)/mean': 22.0, '<tag>/cuda:0 (gpu:3)/temperature (C)/mean': 25.0, '<tag>/cuda:0 (gpu:3)/power_usage (W)/mean': 19.11166264116916, ..., '<tag>/cuda:1 (gpu:2)/memory_used (MiB)/mean': 8878.875, ..., '<tag>/cuda:2 (gpu:1)/memory_used (MiB)/mean': 8182.875, ..., '<tag>/cuda:3 (gpu:0)/memory_used (MiB)/mean': 9286.875, ..., '<tag>/pid:12345/host/cpu_percent (%)/mean': 151.34342772112265, '<tag>/pid:12345/host/host_memory (MiB)/mean': 44749.72373447514, '<tag>/pid:12345/host/host_memory_percent (%)/mean': 8.675082352111717, '<tag>/pid:12345/host/running_time (min)': 336.23803206741576, '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory (MiB)/mean': 8861.0, '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_percent (%)/mean': 80.4, '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_utilization (%)/mean': 6.711118172407917, '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_sm_utilization (%)/mean': 48.23283397736476, ..., '<tag>/duration (s)': 7.247399162035435, '<tag>/timestamp': 1655909466.9981883 } ``` The results can be easily logged into [TensorBoard](https://github.com/tensorflow/tensorboard) or to CSV file. - To TensorBoard: ```python import os import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.tensorboard import SummaryWriter from nvitop import CudaDevice, ResourceMetricCollector from nvitop.callbacks.tensorboard import add_scalar_dict # Build networks and prepare datasets ... # Logger and status collector writer = SummaryWriter() collector = ResourceMetricCollector(devices=CudaDevice.all(), # log all visible CUDA devices and use the CUDA ordinal root_pids={os.getpid()}, # only log the children processes of the current process interval=1.0) # snapshot interval for background daemon thread # Start training global_step = 0 for epoch in range(num_epoch): with collector(tag='train'): for batch in train_dataset: with collector(tag='batch'): metrics = train(net, batch) global_step += 1 add_scalar_dict(writer, 'train', metrics, global_step=global_step) add_scalar_dict(writer, 'resources', # tag='resources/train/batch/...' collector.collect(), global_step=global_step) add_scalar_dict(writer, 'resources', # tag='resources/train/...' collector.collect(), global_step=epoch) with collector(tag='validate'): metrics = validate(net, validation_dataset) add_scalar_dict(writer, 'validate', metrics, global_step=epoch) add_scalar_dict(writer, 'resources', # tag='resources/validate/...' collector.collect(), global_step=epoch) ``` - To CSV: ```python import datetime import time import pandas as pd from nvitop import ResourceMetricCollector collector = ResourceMetricCollector(root_pids={1}, interval=2.0) # log all devices and all GPU processes df = pd.DataFrame() with collector(tag='resources'): for _ in range(12): # Do something time.sleep(5) metrics = collector.collect() df_metrics = pd.DataFrame.from_records(metrics, index=[len(df)]) df = pd.concat([df, df_metrics], ignore_index=True) # Flush to CSV file ... df.insert(0, 'time', df['resources/timestamp'].map(datetime.datetime.fromtimestamp)) df.to_csv('results.csv', index=False) ``` #### Motivation and Context    Resolves #18 Resolves #20 #### Testing   N/A --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>