[GH-ISSUE #109] [Question] How to log GPU performance to `wandb` #69

New issue

Closed

opened 2026-05-05 03:24:24 -06:00 by gitea-mirror · 2 comments

gitea-mirror commented

2026-05-05 03:24:24 -06:00

Owner

Originally created by @BitCalSaul on GitHub (Nov 29, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/109

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
I have tried the latest version of nvitop in a new isolated virtual environment.

Motivation

Hey, I am a super fan of the nvitop. I usually used another monitor to see my GPU performance with time. But it's hard to keep it for a record. Thus, I want to use the nvitop with wandb. However, I don't know how to set up it. I'm wondering if you could provide an example for this work, thanks!

Solution

No response

Alternatives

No response

Additional context

No response

Originally created by @BitCalSaul on GitHub (Nov 29, 2023). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/109 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### Motivation Hey, I am a super fan of the nvitop. I usually used another monitor to see my GPU performance with time. But it's hard to keep it for a record. Thus, I want to use the nvitop with wandb. However, I don't know how to set up it. I'm wondering if you could provide an example for this work, thanks! ### Solution _No response_ ### Alternatives _No response_ ### Additional context _No response_

gitea-mirror

2026-05-05 03:24:24 -06:00

closed this issue
added the
question
label

gitea-mirror commented

2026-05-05 03:24:28 -06:00

Author

Owner

@XuehaiPan commented on GitHub (Nov 29, 2023):

Hi @BitCalSaul, thanks for raising this. The usage of logging metrics to wandb is similar to the TensorBoard. You can read the example in the section Resource Metric Collector.

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb

from nvitop import CudaDevice, ResourceMetricCollector

# Build networks and prepare datasets
...

# Logger and status collector
writer = SummaryWriter()
collector = ResourceMetricCollector(devices=CudaDevice.all(),  # log all visible CUDA devices and use the CUDA ordinal
                                    root_pids={os.getpid()},   # only log the descendant processes of the current process
                                    interval=1.0)              # snapshot interval for background daemon thread

# W&B Session
run = wandb.init()

# Start training
global_step = 0
for epoch in range(num_epoch):
    with collector(tag='train'):
        for batch in train_dataset:
            with collector(tag='batch'):
                algorithm_metrics = train(net, batch)

                # Collect batch level resource metrics
                resource_metrics = collector.collect()  # {'train/batch/<name>': value, ...}

                # Add a prefix if necessary
                algorithm_metrics = {
                    f'train/{key}': value for key, value in algorithm_metrics.items()
                }
                resource_metrics = {
                    f'resources/{key}': value for key, value in resource_metrics.items()
                }

                global_step += 1

                # Log metrics to W&B
                metrics = {**algorithm_metrics, **resource_metrics}
                run.log(metrics, step=global_step)

        # Collect epoch level resource metrics
        resource_metrics = collector.collect()  # {'train/<name>': value, ...}
        # Add a prefix if necessary
        resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()}
        run.log(resource_metrics, step=epoch)

    with collector(tag='validate'):
        algorithm_metrics = validate(net, validation_dataset)

        # Collect epoch level resource metrics
        resource_metrics = collector.collect()  # {'validate/<name>': value, ...}

        # Add a prefix if necessary
        algorithm_metrics = {f'validate/{key}': value for key, value in algorithm_metrics.items()}
        resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()}
        
        # Log metrics to W&B
        metrics = {**algorithm_metrics, **resource_metrics}
        run.log(metrics, step=epoch)

You can also send the collector to run in a background daemon thread. See the README for more details.

@XuehaiPan commented on GitHub (Nov 29, 2023): Hi @BitCalSaul, thanks for raising this. The usage of logging metrics to `wandb` is similar to the TensorBoard. You can read the example in the section [Resource Metric Collector](https://github.com/XuehaiPan/nvitop#resource-metric-collector). ```python import os import torch import torch.nn as nn import torch.nn.functional as F import wandb from nvitop import CudaDevice, ResourceMetricCollector # Build networks and prepare datasets ... # Logger and status collector writer = SummaryWriter() collector = ResourceMetricCollector(devices=CudaDevice.all(), # log all visible CUDA devices and use the CUDA ordinal root_pids={os.getpid()}, # only log the descendant processes of the current process interval=1.0) # snapshot interval for background daemon thread # W&B Session run = wandb.init() # Start training global_step = 0 for epoch in range(num_epoch): with collector(tag='train'): for batch in train_dataset: with collector(tag='batch'): algorithm_metrics = train(net, batch) # Collect batch level resource metrics resource_metrics = collector.collect() # {'train/batch/<name>': value, ...} # Add a prefix if necessary algorithm_metrics = { f'train/{key}': value for key, value in algorithm_metrics.items() } resource_metrics = { f'resources/{key}': value for key, value in resource_metrics.items() } global_step += 1 # Log metrics to W&B metrics = {**algorithm_metrics, **resource_metrics} run.log(metrics, step=global_step) # Collect epoch level resource metrics resource_metrics = collector.collect() # {'train/<name>': value, ...} # Add a prefix if necessary resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()} run.log(resource_metrics, step=epoch) with collector(tag='validate'): algorithm_metrics = validate(net, validation_dataset) # Collect epoch level resource metrics resource_metrics = collector.collect() # {'validate/<name>': value, ...} # Add a prefix if necessary algorithm_metrics = {f'validate/{key}': value for key, value in algorithm_metrics.items()} resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()} # Log metrics to W&B metrics = {**algorithm_metrics, **resource_metrics} run.log(metrics, step=epoch) ``` You can also send the collector to run in a background daemon thread. See the README for more details.

gitea-mirror commented

2026-05-05 03:24:29 -06:00

Author

Owner

@BitCalSaul commented on GitHub (Nov 29, 2023):

Thanks for the example and you hard work :)

@BitCalSaul commented on GitHub (Nov 29, 2023): Thanks for the example and you hard work :)

gitea-mirror referenced this issue

2026-05-05 03:26:45 -06:00

[PR #70] [MERGED] fix(api/device): further isolate the CUDA_VISIBLE_DEVICE parser in a subprocess #148