[GH-ISSUE #47] [Feature Request] Collect metrics in a fixed interval for the lifespan of a training job #32

Closed
opened 2026-05-05 03:22:39 -06:00 by gitea-mirror · 8 comments
Owner

Originally created by @hosseinsarshar on GitHub (Nov 12, 2022).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/47

Hi @XuehaiPan,

In your examples to collect metrics using ResourceMetricCollector inside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop.
If a loop takes 5 minutes, we have the metrics at 5 minutes interval.

I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job?

Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots.

Thanks

Originally created by @hosseinsarshar on GitHub (Nov 12, 2022). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/47 Hi @XuehaiPan, In your examples to collect metrics using `ResourceMetricCollector` inside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop. If a loop takes 5 minutes, we have the metrics at 5 minutes interval. I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job? Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots. Thanks
gitea-mirror 2026-05-05 03:22:39 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Nov 12, 2022):

@classicboyir Hi, thanks for the feedback.

I wonder if there is a way to run a process in background to collect the metrics at a certain internal let's say 5 seconds, during the lifespan of a training job?

I think this would be a good use case and I would like to add this into nvitop. It can achieve by running in a separate thread with a callback function, like:

import time
import threading

from nvitop import ResourceMetricCollector


def collect_in_background(
    on_collect,
    collector=None,
    interval=None,
    *,
    on_start=None,
    on_stop=None,
    tag='metrics-daemon',
    start=True,
):
    if collector is None:
        collector = ResourceMetricCollector()
    if interval is None:
        interval = collector.interval
    interval = min(interval, collector.interval)

    def target():
        if on_start is not None:
            on_start(collector)
        try:
            with collector(tag):
                try:
                    while on_collect(collector.collect()):
                        time.sleep(interval)
                except KeyboardInterrupt:
                    pass
        finally:
            if on_stop is not None:
                on_stop(collector)

    daemon = threading.Thread(target=target, daemon=True)
    if start:
        daemon.start()
    return daemon
def main():
    logger = ...

    def on_collect(metrics):
        if logger.is_closed():  # closed manually by user
            return False
        logger.log(metrics)
        return True

    def on_stop(collector):
        if not logger.is_closed():
            logger.close()  # cleanup

    background_collector = ResourceMetricCollector()
    collect_in_background(on_collect, background_collector, interval=5.0, on_stop=on_stop)

    # Use a separate collector for foreground
    # otherwise it will mess with the 'metrics-daemon' tag
    foreground_collector = ResourceMetricCollector()

    for epoch in range(100):
        with foreground_collector('epoch'):
            # Do something
            for batch in range(100):
                with foreground_collector('batch'):
                    # Do something
                    pass

You can define a on_collect, such as log the result to a logger, or just append it in a list:

lst = [] 

def on_collect(metrics):
    lst.append(metrics)
    return True
<!-- gh-comment-id:1312385357 --> @XuehaiPan commented on GitHub (Nov 12, 2022): @classicboyir Hi, thanks for the feedback. > I wonder if there is a way to run a process in background to collect the metrics at a certain internal let's say 5 seconds, during the lifespan of a training job? I think this would be a good use case and I would like to add this into `nvitop`. It can achieve by running in a separate thread with a callback function, like: ```python import time import threading from nvitop import ResourceMetricCollector def collect_in_background( on_collect, collector=None, interval=None, *, on_start=None, on_stop=None, tag='metrics-daemon', start=True, ): if collector is None: collector = ResourceMetricCollector() if interval is None: interval = collector.interval interval = min(interval, collector.interval) def target(): if on_start is not None: on_start(collector) try: with collector(tag): try: while on_collect(collector.collect()): time.sleep(interval) except KeyboardInterrupt: pass finally: if on_stop is not None: on_stop(collector) daemon = threading.Thread(target=target, daemon=True) if start: daemon.start() return daemon ``` ```python def main(): logger = ... def on_collect(metrics): if logger.is_closed(): # closed manually by user return False logger.log(metrics) return True def on_stop(collector): if not logger.is_closed(): logger.close() # cleanup background_collector = ResourceMetricCollector() collect_in_background(on_collect, background_collector, interval=5.0, on_stop=on_stop) # Use a separate collector for foreground # otherwise it will mess with the 'metrics-daemon' tag foreground_collector = ResourceMetricCollector() for epoch in range(100): with foreground_collector('epoch'): # Do something for batch in range(100): with foreground_collector('batch'): # Do something pass ``` You can define a `on_collect`, such as log the result to a logger, or just append it in a `list`: ```python lst = [] def on_collect(metrics): lst.append(metrics) return True ```
Author
Owner

@hosseinsarshar commented on GitHub (Nov 13, 2022):

Love it, thanks for the quick response and look forward to seeing it being natively supported.

<!-- gh-comment-id:1312606306 --> @hosseinsarshar commented on GitHub (Nov 13, 2022): Love it, thanks for the quick response and look forward to seeing it being natively supported.
Author
Owner

@XuehaiPan commented on GitHub (Nov 17, 2022):

@classicboyir Hi, I create a PR #48 to resolve this. Could you try:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector-daemon

and share some user experiences. Then we can get it to merge and release. Thanks!

<!-- gh-comment-id:1318123490 --> @XuehaiPan commented on GitHub (Nov 17, 2022): @classicboyir Hi, I create a PR #48 to resolve this. Could you try: ```bash pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector-daemon ``` and share some user experiences. Then we can get it to merge and release. Thanks!
Author
Owner

@hosseinsarshar commented on GitHub (Nov 18, 2022):

thanks for the update, @XuehaiPan.
I gave this a try, I love it and it works as expected. I do have a suggestion on the design of the method.

I think it'd be better to define collect_in_background as a member of ResourceMetricCollector class and you call it like this: (and use something like begin_collecting_in_background as the function name)

collector = ResourceMetricCollector(interval=5.0)
daemon = collector.begin_collecting_in_background(on_collect, on_stop=on_stop)

Instead of passing a ResourceMetricCollector object, it uses self as the collector and might just need these parameters in the begin_collecting_in_background function:

def begin_collecting_in_background(
        on_collect,
        on_start=None,
        on_stop=None,
        tag='') -> threading.Thread:

And you don't need the start parameter as when you call the begin_collecting_in_background function the intention is to start the background thread. Similarly, interval could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class. Finally it'd return the daemon object to stop the job for the client to manage the thread.

<!-- gh-comment-id:1319455221 --> @hosseinsarshar commented on GitHub (Nov 18, 2022): thanks for the update, @XuehaiPan. I gave this a try, I love it and it works as expected. I do have a suggestion on the design of the method. I think it'd be better to define collect_in_background as a member of ResourceMetricCollector class and you call it like this: (and use something like `begin_collecting_in_background` as the function name) collector = ResourceMetricCollector(interval=5.0) daemon = collector.begin_collecting_in_background(on_collect, on_stop=on_stop) Instead of passing a ResourceMetricCollector object, it uses self as the collector and might just need these parameters in the begin_collecting_in_background function: ```python def begin_collecting_in_background( on_collect, on_start=None, on_stop=None, tag='') -> threading.Thread: ``` And you don't need the `start` parameter as when you call the `begin_collecting_in_background` function the intention is to start the background thread. Similarly, `interval` could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class. Finally it'd return the `daemon` object to stop the job for the client to manage the thread.
Author
Owner

@XuehaiPan commented on GitHub (Nov 18, 2022):

@classicboyir Thanks for the advice, I add a new shortcut method daemonize to the class ResourceMetricCollector:

from nvitop import ResourceMetricCollector

collector = ResourceMetricCollector(...)
collector.daemonize(on_collect_fn, interval=inteval, on_start=on_start, on_stop=on_stop)

it is equivalent to:

from nvitop import ResourceMetricCollector, collect_in_background

collector = ResourceMetricCollector(...)
collect_in_background(on_collect_fn, collector, interval=inteval, on_start=on_start, on_stop=on_stop)

but has fewer imports.


And you don't need the start parameter as when you call the begin_collecting_in_background function the intention is to start the background thread. Similarly, interval could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class.

As for the parameter on_start, I think the user may look up the collector.devices or some other attributes at start-up. This method not only initializes the collector but also does some necessary jobs on start.

For the interval argument, if you omit or pass interval=None, it will use collecor.interval.

<!-- gh-comment-id:1319652495 --> @XuehaiPan commented on GitHub (Nov 18, 2022): @classicboyir Thanks for the advice, I add a new shortcut method `daemonize` to the class `ResourceMetricCollector`: ```python from nvitop import ResourceMetricCollector collector = ResourceMetricCollector(...) collector.daemonize(on_collect_fn, interval=inteval, on_start=on_start, on_stop=on_stop) ``` it is equivalent to: ```python from nvitop import ResourceMetricCollector, collect_in_background collector = ResourceMetricCollector(...) collect_in_background(on_collect_fn, collector, interval=inteval, on_start=on_start, on_stop=on_stop) ``` but has fewer imports. ------ > And you don't need the `start` parameter as when you call the `begin_collecting_in_background` function the intention is to start the background thread. Similarly, `interval` could be eliminated as it grabs the interval parameter of the ResourceMetricCollector class. As for the parameter `on_start`, I think the user may look up the `collector.devices` or some other attributes at start-up. This method not only initializes the `collector` but also does some necessary jobs on start. For the `interval` argument, if you omit or pass `interval=None`, it will use `collecor.interval`.
Author
Owner

@XuehaiPan commented on GitHub (Nov 18, 2022):

This feature is included in nvitop 0.10.2.

<!-- gh-comment-id:1319673766 --> @XuehaiPan commented on GitHub (Nov 18, 2022): This feature is included in `nvitop 0.10.2`.
Author
Owner

@hosseinsarshar commented on GitHub (Nov 18, 2022):

Thanks @XuehaiPan for adding this feature promptly.
Would you also expose a function to stop the background thread when needed?

<!-- gh-comment-id:1320211929 --> @hosseinsarshar commented on GitHub (Nov 18, 2022): Thanks @XuehaiPan for adding this feature promptly. Would you also expose a function to stop the background thread when needed?
Author
Owner

@XuehaiPan commented on GitHub (Nov 18, 2022):

Would you also expose a function to stop the background thread when needed?

@classicboyir You can let the on_collect function return False to stop the thread. Also, the thread is a daemon thread, you can kill it anyway without breaking the main thread.

<!-- gh-comment-id:1320221087 --> @XuehaiPan commented on GitHub (Nov 18, 2022): > Would you also expose a function to stop the background thread when needed? @classicboyir You can let the `on_collect` function return `False` to stop the thread. Also, the thread is a daemon thread, you can kill it anyway without breaking the main thread.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#32
No description provided.