[GH-ISSUE #77] [Question] live metrics collector #47

Closed
opened 2026-05-05 03:23:25 -06:00 by gitea-mirror · 5 comments
Owner

Originally created by @mehrazi on GitHub (Jun 28, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/77

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

Questions

Hi
Thanks for your great repo.
I have a question about the values of the metrics that nvitop collects.
If I'm not mistaken, the API returns mean/max/min values for specified intervals, I need to collect the absolute values for each second or every few seconds.
Is there any way to handle this?

Originally created by @mehrazi on GitHub (Jun 28, 2023). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/77 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have read the documentation <https://nvitop.readthedocs.io>. - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### Questions Hi Thanks for your great repo. I have a question about the values of the metrics that nvitop collects. If I'm not mistaken, the API returns mean/max/min values for specified intervals, I need to collect the absolute values for each second or every few seconds. Is there any way to handle this?
gitea-mirror 2026-05-05 03:23:25 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Jul 2, 2023):

I need to collect the absolute values for each second or every few seconds.

Hi @mehrdadazizi72, sorry for the late reply. Have you ever tried to manage the device instances and collect the values manually? That will be fully controlled by your code logic. Such as:

from nvitop import Device, take_snapshots

devices = Device.all()

device_snapshots, _ = take_snapshots(devices, gpu_processes=False)  # synchronized and will block the main thread

# do something

device_snapshots, _ = take_snapshots(devices, gpu_processes=False)  # synchronized and will block the main thread

If I'm not mistaken, the API returns mean/max/min values for specified intervals

Yes. This is the current behavior. There will be some delay in Python-C API conversion and system-driver-device communication. The current collector API asynchronizedly collects metrics in a separate thread to avoid blocking the main program. So the exact value will not return because of this async implementation. If you want to get the exact values, you need to use the synchronized implementation.

In addition, the current implementation does not collect metrics at the exact interval. For example, if the interval=5 and the API call takes 0.1-0.2 seconds, the metrics will be logged at (0, 5.1, 10.3, 15.4, ...) rather than the exact interval as (0, 5, 10, 15, ...).

<!-- gh-comment-id:1616444533 --> @XuehaiPan commented on GitHub (Jul 2, 2023): > I need to collect the absolute values for each second or every few seconds. Hi @mehrdadazizi72, sorry for the late reply. Have you ever tried to manage the `device` instances and collect the values manually? That will be fully controlled by your code logic. Such as: ```python from nvitop import Device, take_snapshots devices = Device.all() device_snapshots, _ = take_snapshots(devices, gpu_processes=False) # synchronized and will block the main thread # do something device_snapshots, _ = take_snapshots(devices, gpu_processes=False) # synchronized and will block the main thread ``` > If I'm not mistaken, the API returns mean/max/min values for specified intervals Yes. This is the current behavior. There will be some delay in Python-C API conversion and system-driver-device communication. The current collector API asynchronizedly collects metrics in a separate thread to avoid blocking the main program. So the exact value will not return because of this async implementation. If you want to get the exact values, you need to use the synchronized implementation. In addition, the current implementation does not collect metrics at the exact interval. For example, if the `interval=5` and the API call takes 0.1-0.2 seconds, the metrics will be logged at `(0, 5.1, 10.3, 15.4, ...)` rather than the exact interval as `(0, 5, 10, 15, ...)`.
Author
Owner

@Mousavi-Parisa commented on GitHub (Jul 4, 2023):

Hi @XuehaiPan!
I need to have a hardware benchmark and plot their exact values, thus I've tried collecting the GPU and CPU different usage metrics using "collect_in_background" in such peace of code:

class NvitopClass:
      def init(self, project_name, main_tag):
          self.logger = SummaryWriter('runs/{}'.format(project_name))
          self.continue_status = True
          self.main_tag = main_tag
      
      def on_collect(self, metrics):  # will be called periodically
          if not self._continue_status:  # closed manually by user
              return False
          current_now = int(time.monotonic()-start_time)
          add_scalar_dict(writer=self._logger, main_tag=self._main_tag, tag_scalar_dict=metrics, global_step=current_now)
          print(current_now)
          return True
  
      def on_stop(self, collector):  # will be called only once at stop
          print('The End of GPU Process Management!')
  
      def on_start(self, collector):  # will be called only once at start
          global start_time
          start_time = time.monotonic()
      
      def start_process(self):
          self._continue_status = True
          
      def stop_process(self):
          self._continue_status = False

nvitop_class = NvitopClass(project_name='test', main_tag='gpu_test')
collector = ResourceMetricCollector(interval=1)
collector_hyperparameters = {'collector': collector,
                             'on_start': nvitop_class.on_start,
                             'on_collect': nvitop_class.on_collect,
                             'on_stop': nvitop_class.on_stop}
nvitop_class.start_process()
collect_in_background(**collector_hyperparameters)

I encountered a problem retrieving the exact values in a way to show the real-time changes. Actually, the interval inaccuracy doesn't matter but the CPU usage and exact values can not be retrieved in synchronized and asynchronized implementation respectively.

Is there a way to have both GPU and CPU metrics logged? sync or async, but the exact values.
Thanks in forward.

<!-- gh-comment-id:1620343539 --> @Mousavi-Parisa commented on GitHub (Jul 4, 2023): Hi @XuehaiPan! I need to have a hardware benchmark and plot their exact values, thus I've tried collecting the GPU and CPU different usage metrics using "collect_in_background" in such peace of code: ```python class NvitopClass: def init(self, project_name, main_tag): self.logger = SummaryWriter('runs/{}'.format(project_name)) self.continue_status = True self.main_tag = main_tag def on_collect(self, metrics): # will be called periodically if not self._continue_status: # closed manually by user return False current_now = int(time.monotonic()-start_time) add_scalar_dict(writer=self._logger, main_tag=self._main_tag, tag_scalar_dict=metrics, global_step=current_now) print(current_now) return True def on_stop(self, collector): # will be called only once at stop print('The End of GPU Process Management!') def on_start(self, collector): # will be called only once at start global start_time start_time = time.monotonic() def start_process(self): self._continue_status = True def stop_process(self): self._continue_status = False nvitop_class = NvitopClass(project_name='test', main_tag='gpu_test') collector = ResourceMetricCollector(interval=1) collector_hyperparameters = {'collector': collector, 'on_start': nvitop_class.on_start, 'on_collect': nvitop_class.on_collect, 'on_stop': nvitop_class.on_stop} nvitop_class.start_process() collect_in_background(**collector_hyperparameters) ``` I encountered a problem retrieving the exact values in a way to show the real-time changes. Actually, the interval inaccuracy doesn't matter but the CPU usage and exact values can not be retrieved in synchronized and asynchronized implementation respectively. Is there a way to have both GPU and CPU metrics logged? sync or async, but the exact values. Thanks in forward.
Author
Owner

@XuehaiPan commented on GitHub (Jul 5, 2023):

Hi, @pmi94, thanks for your comment and the code snippet. Do you mean you want to log the exact value on each snapshot? If the answer is yes, I think this can be done by adding a new field last (currently, we only have min/max/mean).

<!-- gh-comment-id:1621274648 --> @XuehaiPan commented on GitHub (Jul 5, 2023): Hi, @pmi94, thanks for your comment and the code snippet. Do you mean you want to log the exact value on each snapshot? If the answer is yes, I think this can be done by adding a new field `last` (currently, we only have `min/max/mean`).
Author
Owner

@Mousavi-Parisa commented on GitHub (Jul 5, 2023):

Actually I need the exact values' log on each snapshot for CPU as well as GPU, but I think it's only possible for GPU. Right?

<!-- gh-comment-id:1621543927 --> @Mousavi-Parisa commented on GitHub (Jul 5, 2023): Actually I need the exact values' log on each snapshot for CPU as well as GPU, but I think it's only possible for GPU. Right?
Author
Owner

@XuehaiPan commented on GitHub (Jul 6, 2023):

Actually I need the exact values' log on each snapshot for CPU as well as GPU, but I think it's only possible for GPU. Right?

@pmi94 ResourceMetricCollector collects snapshots with both CPU and GPU metrics. But the metrics are only logged when collector.collect() method is called rather than when the snapshot is tasked. The background collect interval is not exactly the same interval as the background snapshot interval. I would find a way to lock these two intervals synchronized.

<!-- gh-comment-id:1623884279 --> @XuehaiPan commented on GitHub (Jul 6, 2023): > Actually I need the exact values' log on each snapshot for CPU as well as GPU, but I think it's only possible for GPU. Right? @pmi94 `ResourceMetricCollector` collects snapshots with both CPU and GPU metrics. But the metrics are only logged when `collector.collect()` method is called rather than when the snapshot is tasked. The background collect interval is not exactly the same interval as the background snapshot interval. I would find a way to lock these two intervals synchronized.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#47
No description provided.