mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 14:15:55 -06:00
[GH-ISSUE #47] [Feature Request] Collect metrics in a fixed interval for the lifespan of a training job #32
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#32
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hosseinsarshar on GitHub (Nov 12, 2022).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/47
Hi @XuehaiPan,
In your examples to collect metrics using
ResourceMetricCollectorinside a training loop, the collector.collect(), collects a snapshot at each epoch/batch loop which misses the the entire period between the previous and current loop.If a loop takes 5 minutes, we have the metrics at 5 minutes interval.
I wonder if there is a way to run a process in background to collect the metrics at a certain interval let's say 5 seconds, during the lifespan of a training job?
Therefore if the entire job took 1hr, with the 5 sec interval, we collect 720 snapshots.
Thanks
@XuehaiPan commented on GitHub (Nov 12, 2022):
@classicboyir Hi, thanks for the feedback.
I think this would be a good use case and I would like to add this into
nvitop. It can achieve by running in a separate thread with a callback function, like:You can define a
on_collect, such as log the result to a logger, or just append it in alist:@hosseinsarshar commented on GitHub (Nov 13, 2022):
Love it, thanks for the quick response and look forward to seeing it being natively supported.
@XuehaiPan commented on GitHub (Nov 17, 2022):
@classicboyir Hi, I create a PR #48 to resolve this. Could you try:
and share some user experiences. Then we can get it to merge and release. Thanks!
@hosseinsarshar commented on GitHub (Nov 18, 2022):
thanks for the update, @XuehaiPan.
I gave this a try, I love it and it works as expected. I do have a suggestion on the design of the method.
I think it'd be better to define collect_in_background as a member of ResourceMetricCollector class and you call it like this: (and use something like
begin_collecting_in_backgroundas the function name)collector = ResourceMetricCollector(interval=5.0)
daemon = collector.begin_collecting_in_background(on_collect, on_stop=on_stop)
Instead of passing a ResourceMetricCollector object, it uses self as the collector and might just need these parameters in the begin_collecting_in_background function:
And you don't need the
startparameter as when you call thebegin_collecting_in_backgroundfunction the intention is to start the background thread. Similarly,intervalcould be eliminated as it grabs the interval parameter of the ResourceMetricCollector class. Finally it'd return thedaemonobject to stop the job for the client to manage the thread.@XuehaiPan commented on GitHub (Nov 18, 2022):
@classicboyir Thanks for the advice, I add a new shortcut method
daemonizeto the classResourceMetricCollector:it is equivalent to:
but has fewer imports.
As for the parameter
on_start, I think the user may look up thecollector.devicesor some other attributes at start-up. This method not only initializes thecollectorbut also does some necessary jobs on start.For the
intervalargument, if you omit or passinterval=None, it will usecollecor.interval.@XuehaiPan commented on GitHub (Nov 18, 2022):
This feature is included in
nvitop 0.10.2.@hosseinsarshar commented on GitHub (Nov 18, 2022):
Thanks @XuehaiPan for adding this feature promptly.
Would you also expose a function to stop the background thread when needed?
@XuehaiPan commented on GitHub (Nov 18, 2022):
@classicboyir You can let the
on_collectfunction returnFalseto stop the thread. Also, the thread is a daemon thread, you can kill it anyway without breaking the main thread.