[GH-ISSUE #109] [Question] How to log GPU performance to wandb #69

Closed
opened 2026-05-05 03:24:24 -06:00 by gitea-mirror · 2 comments
Owner

Originally created by @BitCalSaul on GitHub (Nov 29, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/109

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

Motivation

Hey, I am a super fan of the nvitop. I usually used another monitor to see my GPU performance with time. But it's hard to keep it for a record. Thus, I want to use the nvitop with wandb. However, I don't know how to set up it. I'm wondering if you could provide an example for this work, thanks!

Solution

No response

Alternatives

No response

Additional context

No response

Originally created by @BitCalSaul on GitHub (Nov 29, 2023). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/109 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### Motivation Hey, I am a super fan of the nvitop. I usually used another monitor to see my GPU performance with time. But it's hard to keep it for a record. Thus, I want to use the nvitop with wandb. However, I don't know how to set up it. I'm wondering if you could provide an example for this work, thanks! ### Solution _No response_ ### Alternatives _No response_ ### Additional context _No response_
gitea-mirror 2026-05-05 03:24:24 -06:00
  • closed this issue
  • added the
    question
    label
Author
Owner

@XuehaiPan commented on GitHub (Nov 29, 2023):

Hi @BitCalSaul, thanks for raising this. The usage of logging metrics to wandb is similar to the TensorBoard. You can read the example in the section Resource Metric Collector.

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb

from nvitop import CudaDevice, ResourceMetricCollector

# Build networks and prepare datasets
...

# Logger and status collector
writer = SummaryWriter()
collector = ResourceMetricCollector(devices=CudaDevice.all(),  # log all visible CUDA devices and use the CUDA ordinal
                                    root_pids={os.getpid()},   # only log the descendant processes of the current process
                                    interval=1.0)              # snapshot interval for background daemon thread

# W&B Session
run = wandb.init()

# Start training
global_step = 0
for epoch in range(num_epoch):
    with collector(tag='train'):
        for batch in train_dataset:
            with collector(tag='batch'):
                algorithm_metrics = train(net, batch)

                # Collect batch level resource metrics
                resource_metrics = collector.collect()  # {'train/batch/<name>': value, ...}

                # Add a prefix if necessary
                algorithm_metrics = {
                    f'train/{key}': value for key, value in algorithm_metrics.items()
                }
                resource_metrics = {
                    f'resources/{key}': value for key, value in resource_metrics.items()
                }

                global_step += 1

                # Log metrics to W&B
                metrics = {**algorithm_metrics, **resource_metrics}
                run.log(metrics, step=global_step)

        # Collect epoch level resource metrics
        resource_metrics = collector.collect()  # {'train/<name>': value, ...}
        # Add a prefix if necessary
        resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()}
        run.log(resource_metrics, step=epoch)

    with collector(tag='validate'):
        algorithm_metrics = validate(net, validation_dataset)

        # Collect epoch level resource metrics
        resource_metrics = collector.collect()  # {'validate/<name>': value, ...}

        # Add a prefix if necessary
        algorithm_metrics = {f'validate/{key}': value for key, value in algorithm_metrics.items()}
        resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()}
        
        # Log metrics to W&B
        metrics = {**algorithm_metrics, **resource_metrics}
        run.log(metrics, step=epoch)

You can also send the collector to run in a background daemon thread. See the README for more details.

<!-- gh-comment-id:1831304353 --> @XuehaiPan commented on GitHub (Nov 29, 2023): Hi @BitCalSaul, thanks for raising this. The usage of logging metrics to `wandb` is similar to the TensorBoard. You can read the example in the section [Resource Metric Collector](https://github.com/XuehaiPan/nvitop#resource-metric-collector). ```python import os import torch import torch.nn as nn import torch.nn.functional as F import wandb from nvitop import CudaDevice, ResourceMetricCollector # Build networks and prepare datasets ... # Logger and status collector writer = SummaryWriter() collector = ResourceMetricCollector(devices=CudaDevice.all(), # log all visible CUDA devices and use the CUDA ordinal root_pids={os.getpid()}, # only log the descendant processes of the current process interval=1.0) # snapshot interval for background daemon thread # W&B Session run = wandb.init() # Start training global_step = 0 for epoch in range(num_epoch): with collector(tag='train'): for batch in train_dataset: with collector(tag='batch'): algorithm_metrics = train(net, batch) # Collect batch level resource metrics resource_metrics = collector.collect() # {'train/batch/<name>': value, ...} # Add a prefix if necessary algorithm_metrics = { f'train/{key}': value for key, value in algorithm_metrics.items() } resource_metrics = { f'resources/{key}': value for key, value in resource_metrics.items() } global_step += 1 # Log metrics to W&B metrics = {**algorithm_metrics, **resource_metrics} run.log(metrics, step=global_step) # Collect epoch level resource metrics resource_metrics = collector.collect() # {'train/<name>': value, ...} # Add a prefix if necessary resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()} run.log(resource_metrics, step=epoch) with collector(tag='validate'): algorithm_metrics = validate(net, validation_dataset) # Collect epoch level resource metrics resource_metrics = collector.collect() # {'validate/<name>': value, ...} # Add a prefix if necessary algorithm_metrics = {f'validate/{key}': value for key, value in algorithm_metrics.items()} resource_metrics = {f'resources/{key}': value for key, value in resource_metrics.items()} # Log metrics to W&B metrics = {**algorithm_metrics, **resource_metrics} run.log(metrics, step=epoch) ``` You can also send the collector to run in a background daemon thread. See the README for more details.
Author
Owner

@BitCalSaul commented on GitHub (Nov 29, 2023):

Thanks for the example and you hard work :)

<!-- gh-comment-id:1832472399 --> @BitCalSaul commented on GitHub (Nov 29, 2023): Thanks for the example and you hard work :)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#69
No description provided.