mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 14:15:55 -06:00
[GH-ISSUE #65] [Feature Request] Refresh rate < 1 sec #38
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#38
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @BlueskyFR on GitHub (Apr 6, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/65
Originally assigned to: @XuehaiPan on GitHub.
Required prerequisites
Motivation
I see the current minimum refresh rate is 1 second.
Could it be something like 0.1 sec so that we can get a more accurate overview of what is happening on the GPU?
Solution
Alternatives
Additional context
@XuehaiPan commented on GitHub (Apr 6, 2023):
Duplicate of #32, #63.
Hi @BlueskyFR, the latency from the NVML API call is relatively high. I think it's meaningless to support small intervals like 0.1 second. If you want a fine-grained report of resource usage, maybe you should use a profiler instead.
<Enter>key. The metrics on the top row will refresh every 1/4 sec.Watch metrics for a specific process (shortcut: Enter / Return).
nvitop.ResourceMetricCollector, see Resource Metric Collector for more information.@BlueskyFR commented on GitHub (Apr 6, 2023):
Thanks for your reply.
Why are calls to NVML so slow?
nvidia-smisupports a resolution up to a 10ms refresh rate for instance@XuehaiPan commented on GitHub (Apr 6, 2023):
@BlueskyFR
nvidia-smicannot achieve this.nvidia-smicommand will take more time (up to seconds (e.g., 3s)) to do a single query.We can "refresh" the "fake" results every 10ms. But the results may be queried seconds ago. They are not accurate.
Here are some benchmark results from my side. You can try
hyperfineon your machine to see the latency.It takes 2 seconds to do a single query. It cannot run under 10ms.
@BlueskyFR commented on GitHub (Apr 6, 2023):
You are maybe using it wrong 😊
You can see my post here for more details -> https://github.com/influxdata/telegraf/issues/8534#issue-761112264
@XuehaiPan commented on GitHub (Apr 6, 2023):
Thanks for the reference.
nvitopalready uses sparse queries withnvidia-ml-pyinstead of a full query usingnvidia-smi. But there are still many things that are slow here. Such as gathering process information, especially when the process number is relatively large (up to hundreds). Also, as I mentioned above, if you don't enable the persistence mode, yournvidia-smiquery will take a much longer time.@BlueskyFR commented on GitHub (Apr 6, 2023):
So I think maybe it is more a design problem?
Maybe the same quantity of information cannot be achieved with nvidia-smi but I doubt it
@XuehaiPan commented on GitHub (Apr 6, 2023):
In your example, you are not querying process information, which is the key feature of
nvitop. If you want accurate metrics data, I still think you should use a profiler instead. A day-to-day monitor should not run under high sample frequency for 7x24. That will lead to high power consumption. If you want to monitor a process for only several minutes, why not use a profiler? It should be the more appropriate tool for your use case.@BlueskyFR commented on GitHub (Apr 13, 2023):
Could be a solution, what profiler do you have in mind for instance?
@XuehaiPan commented on GitHub (Apr 14, 2023):
@BlueskyFR That depends on your use case because profilers need an in-process injection to add hooks to record kernel times. This may need users to update their code. If you are using PyTorch, you may try
torch.profiler.profile(pytorch/kineto). It can collect fine-grained metrics and also come with a web-based GUI integration. You may also try the NVIDIA Nsight Systems, a profiling tool from NVIDIA.curses(setupterm: could not find terminfo database) #72curses(setupterm: could not find terminfo database) #72