[GH-ISSUE #83] [BUG] nvidia-smi pmon 和 nvitop -o 输出的 sm % 不一致且有较大差异

gitea-mirror commented

2026-05-05 03:23:33 -06:00

Owner

Originally created by @hui-zhao-1 on GitHub (Aug 2, 2023).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/83

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

I have read the documentation https://nvitop.readthedocs.io.
I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.2.0

Operating system and version

CentOS Linux 7 (Core)

NVIDIA driver version

470.129.06

NVIDIA-SMI

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1A:00.0 Off |                    0 |
| N/A   53C    P0   254W / 300W |  17231MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   54C    P0   192W / 300W |  15995MiB / 32510MiB |     97%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3D:00.0 Off |                    0 |
| N/A   41C    P0    70W / 300W |  10499MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   37C    P0    69W / 300W |  10981MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   50C    P0   273W / 300W |  18073MiB / 32510MiB |     98%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   49C    P0   241W / 300W |  10141MiB / 32510MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   41C    P0    71W / 300W |  10499MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   36C    P0    70W / 300W |   6493MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Python environment

3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] linux
gpustat==1.1
nvidia-ml-py==11.470.66
nvitop==1.2.0

Problem description

分別使用 nvidia-smi pmon -i 7 和 nvitop -o 7 两个命令，观察 7 号 gpu 上的 sm% 发现两者有较大差别

nvidia-smi pmon -i 7 显示，大部分时间 sm% 为 0 偶尔出现几个不为0 的点，且每次都超过 10%

nvitop -o 7 显示，大部分时间的使用率都为 10% 以下，偶尔出现几个 0%

查看了源代码，怀疑跟 https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/device.py line 1706 有关
这里计算时间戳的时候，有个 - 2_000_000 的操作

我单独抽取这部分代码，写了一个测试用例，发现， - 2_000_000 会对查询出的 sm% 有影响

Steps to Reproduce

import schedule
import time
import pynvml
timestamp = 0
def test():
    global timestamp
    gpu_device_count = pynvml.nvmlDeviceGetCount()
    for gpu_index in range(gpu_device_count):
        if gpu_index != 7:
            continue
        handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
        try:
            processes_util = pynvml.nvmlDeviceGetProcessUtilization(handle,timestamp)
            for process in processes_util:
                print(gpu_index,str(process.pid),process.smUtil,process.timeStamp)
                timestamp = process.timeStamp - 2_000_000
                local_time = time.localtime(timestamp /1000 /1000)
                time_format = time.strftime('%Y-%m-%d %H:%M:%S',local_time)
            print("==============================",time_format,len(processes_util))
        except pynvml.NVMLError_NotFound:
            continue

if __name__ == "__main__":
    pynvml.nvmlInit()
    schedule.every(2).seconds.do(test)
    while True:
        schedule.run_pending()
        time.sleep(2)

Traceback

当保留- 2_000_000 的时候，打印出的时间戳有很多重复项，且间隔较大：
删除 - 2_000_000 的时候，打印的时间戳是正常的

Logs

# 保留- 2_000_000 的日志:
7 212485 0 1690960204701376
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960204701376
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960204701376
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960204702492
============================== 2023-08-02 15:10:02 1
7 212485 0 1690960221427376
============================== 2023-08-02 15:10:19 1
7 212485 0 1690960221427376
============================== 2023-08-02 15:10:19 1
7 212485 0 1690960221427375
============================== 2023-08-02 15:10:19 1
7 212485 0 1690960221427375
============================== 2023-08-02 15:10:19 1
-------------------------------------------------
# 删除- 2_000_000 的日志:
7 212485 0 1690960372104743
============================== 2023-08-02 15:12:52 1
7 212485 0 1690960381974198
============================== 2023-08-02 15:13:01 1
7 212485 0 1690960383981202
============================== 2023-08-02 15:13:03 1
7 212485 0 1690960385988421
============================== 2023-08-02 15:13:05 1
7 212485 0 1690960387995655
============================== 2023-08-02 15:13:07 1
7 212485 0 1690960388832086
============================== 2023-08-02 15:13:08 1
7 212485 0 1690960392010563
============================== 2023-08-02 15:13:12 1

Expected behavior

我没有看懂，https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/device.py line 1706 计算时间戳的时候，为什么要 - 2_000_000 ，所以不确定我的理解对不对，也不确定是不是跟系统或者版本有关
我只是发现 nvidia-smi pmon -i 7 和 nvitop -o 7 的输出不一致，期望的输出应该是一致的

Additional context

https://www.clear.rice.edu/comp422/resources/cuda/pdf/nvml.pdf 的 155 页解释了 nvmlDeviceGetProcessUtilization 的入参中，时间戳的含义。
我的猜测是，nvidia 维护了一个最近 n 秒的 sm % 的 buff，查询的时候
如果传递的timestamp 是 0 就会返回buff 里时间戳最小的一条记录
如果传递的tiemstamp 为 x，就会返回大于等于 x 的时间戳最小的记录
所以，如果每次查询出的时间戳，都 -2_000_000 的话，下次查询出来的还是相同的记录
直到 buff 满了，把之前查询的记录刷掉了，才会返回此时 buffer 的最新数据

纯属猜想，辛苦大佬帮忙解答，万分感谢！！
感谢大佬作出如此好用的工具！！

Originally created by @hui-zhao-1 on GitHub (Aug 2, 2023). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/83 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have read the documentation <https://nvitop.readthedocs.io>. - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### What version of nvitop are you using? 1.2.0 ### Operating system and version CentOS Linux 7 (Core) ### NVIDIA driver version 470.129.06 ### NVIDIA-SMI ```text +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 53C P0 254W / 300W | 17231MiB / 32510MiB | 99% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:1B:00.0 Off | 0 | | N/A 54C P0 192W / 300W | 15995MiB / 32510MiB | 97% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 41C P0 70W / 300W | 10499MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 37C P0 69W / 300W | 10981MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:88:00.0 Off | 0 | | N/A 50C P0 273W / 300W | 18073MiB / 32510MiB | 98% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 | | N/A 49C P0 241W / 300W | 10141MiB / 32510MiB | 99% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 | | N/A 41C P0 71W / 300W | 10499MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:B3:00.0 Off | 0 | | N/A 36C P0 70W / 300W | 6493MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ ``` ### Python environment 3.10.12 (main, Jul 5 2023, 18:54:27) [GCC 11.2.0] linux gpustat==1.1 nvidia-ml-py==11.470.66 nvitop==1.2.0 ### Problem description 分別使用 nvidia-smi pmon -i 7 和 nvitop -o 7 两个命令，观察 7 号 gpu 上的 sm% 发现两者有较大差别 nvidia-smi pmon -i 7 显示，大部分时间 sm% 为 0 偶尔出现几个不为0 的点，且每次都超过 10% ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/cd7da56a-2789-436f-a9d0-14f39e9a8aeb) nvitop -o 7 显示，大部分时间的使用率都为 10% 以下，偶尔出现几个 0% ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/7caba4df-0155-4ac2-ab1a-78fbb196495d) 查看了源代码，怀疑跟 https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/device.py line 1706 有关这里计算时间戳的时候，有个 - 2_000_000 的操作我单独抽取这部分代码，写了一个测试用例，发现， - 2_000_000 会对查询出的 sm% 有影响 ### Steps to Reproduce ```python import schedule import time import pynvml timestamp = 0 def test(): global timestamp gpu_device_count = pynvml.nvmlDeviceGetCount() for gpu_index in range(gpu_device_count): if gpu_index != 7: continue handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index) try: processes_util = pynvml.nvmlDeviceGetProcessUtilization(handle,timestamp) for process in processes_util: print(gpu_index,str(process.pid),process.smUtil,process.timeStamp) timestamp = process.timeStamp - 2_000_000 local_time = time.localtime(timestamp /1000 /1000) time_format = time.strftime('%Y-%m-%d %H:%M:%S',local_time) print("==============================",time_format,len(processes_util)) except pynvml.NVMLError_NotFound: continue if __name__ == "__main__": pynvml.nvmlInit() schedule.every(2).seconds.do(test) while True: schedule.run_pending() time.sleep(2) ``` ### Traceback ```pytb 当保留- 2_000_000 的时候，打印出的时间戳有很多重复项，且间隔较大：删除 - 2_000_000 的时候，打印的时间戳是正常的 ``` ### Logs ```text # 保留- 2_000_000 的日志: 7 212485 0 1690960204701376 ============================== 2023-08-02 15:10:02 1 7 212485 0 1690960204701376 ============================== 2023-08-02 15:10:02 1 7 212485 0 1690960204701376 ============================== 2023-08-02 15:10:02 1 7 212485 0 1690960204702492 ============================== 2023-08-02 15:10:02 1 7 212485 0 1690960221427376 ============================== 2023-08-02 15:10:19 1 7 212485 0 1690960221427376 ============================== 2023-08-02 15:10:19 1 7 212485 0 1690960221427375 ============================== 2023-08-02 15:10:19 1 7 212485 0 1690960221427375 ============================== 2023-08-02 15:10:19 1 ------------------------------------------------- # 删除- 2_000_000 的日志: 7 212485 0 1690960372104743 ============================== 2023-08-02 15:12:52 1 7 212485 0 1690960381974198 ============================== 2023-08-02 15:13:01 1 7 212485 0 1690960383981202 ============================== 2023-08-02 15:13:03 1 7 212485 0 1690960385988421 ============================== 2023-08-02 15:13:05 1 7 212485 0 1690960387995655 ============================== 2023-08-02 15:13:07 1 7 212485 0 1690960388832086 ============================== 2023-08-02 15:13:08 1 7 212485 0 1690960392010563 ============================== 2023-08-02 15:13:12 1 ``` ### Expected behavior 我没有看懂，https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/device.py line 1706 计算时间戳的时候，为什么要 - 2_000_000 ，所以不确定我的理解对不对，也不确定是不是跟系统或者版本有关我只是发现 nvidia-smi pmon -i 7 和 nvitop -o 7 的输出不一致，期望的输出应该是一致的 ### Additional context https://www.clear.rice.edu/comp422/resources/cuda/pdf/nvml.pdf 的 155 页解释了 nvmlDeviceGetProcessUtilization 的入参中，时间戳的含义。我的猜测是，nvidia 维护了一个最近 n 秒的 sm % 的 buff，查询的时候如果传递的timestamp 是 0 就会返回buff 里时间戳最小的一条记录如果传递的tiemstamp 为 x，就会返回大于等于 x 的时间戳最小的记录所以，如果每次查询出的时间戳，都 -2_000_000 的话，下次查询出来的还是相同的记录直到 buff 满了，把之前查询的记录刷掉了，才会返回此时 buffer 的最新数据纯属猜想，辛苦大佬帮忙解答，万分感谢！！感谢大佬作出如此好用的工具！！

gitea-mirror

2026-05-05 03:23:33 -06:00

closed this issue
added the
pynvml

enhancement

api

bug
labels

gitea-mirror commented

2026-05-05 03:23:37 -06:00

Author

Owner

@XuehaiPan commented on GitHub (Aug 2, 2023):

@2581543189 感谢提问！

我没有看懂，main/nvitop/api/device.py line 1706 计算时间戳的时候，为什么要 - 2_000_000 ，所以不确定我的理解对不对，也不确定是不是跟系统或者版本有关

ec53de75b4/nvitop/api/device.py (L1699-L1714)

这里额外减 2_000_000 (即 2 秒)，是为了使得每次 API 调用都尽可能有 sample 返回。这一操作确实会导致该 utilization rate 没法完全反应实时值。另外若某个 pid 对应的进程无 sample 返回，则在 1714 行会将所有 utilization rate 置为 0。若将 timeStamp 设置得过高，可能导致 GPU 有搜集到 sample 但无返回值的情况。

注：根据 man nvidia-smi 的文档，GPU Utilization 采样率为 1-1/6 秒，估计 Process Utilization 的采样率也在差不多的量级上。

NVIDIA NVML 文档：GRID Virtualization APIs nvmlDeviceGetProcessUtilization

@XuehaiPan commented on GitHub (Aug 2, 2023): @2581543189 感谢提问！ > 我没有看懂，[main/nvitop/api/device.py](https://github.com/XuehaiPan/nvitop/blob/main/nvitop/api/device.py?rgh-link-date=2023-08-02T07%3A21%3A45Z) line 1706 计算时间戳的时候，为什么要 - 2_000_000 ，所以不确定我的理解对不对，也不确定是不是跟系统或者版本有关 https://github.com/XuehaiPan/nvitop/blob/ec53de75b4579c319eb6e6b5c1e906d5cb90b561/nvitop/api/device.py#L1699-L1714 这里额外减 `2_000_000` (即 2 秒)，是为了使得每次 API 调用都尽可能有 sample 返回。这一操作确实会导致该 utilization rate 没法完全反应实时值。另外若某个 pid 对应的进程无 sample 返回，则在 1714 行会将所有 utilization rate 置为 0。若将 timeStamp 设置得过高，可能导致 GPU 有搜集到 sample 但无返回值的情况。注：根据 `man nvidia-smi` 的文档，GPU Utilization 采样率为 1-1/6 秒，估计 Process Utilization 的采样率也在差不多的量级上。 NVIDIA NVML 文档：[GRID Virtualization APIs `nvmlDeviceGetProcessUtilization`](https://docs.nvidia.com/deploy/nvml-api/group__nvmlGridQueries.html#group__nvmlGridQueries_1gb0ea5236f5e69e63bf53684a11c233bd)

gitea-mirror commented

2026-05-05 03:23:38 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 3, 2023):

这里额外减 2_000_000 (即 2 秒)，是为了使得每次 API 调用都尽可能有 sample 返回

我想用一个例子说明这样做的坏处：

// nvitop-test.cu
//
// nvcc nvitop-test.cu -o nvitop-test -std=c++11
//
#include<stdio.h>
#include<thread>
#include<chrono>
#include<iostream>
#include<cuda_runtime.h>

void sleep(int milliseconds) {
        std::cout << "start sleep()" << milliseconds << " ms" << std::endl;
        auto start = std::chrono::high_resolution_clock::now();
        std::this_thread::sleep_for(std::chrono::milliseconds(milliseconds));
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double, std::milli> elapsed = end - start;
        std::cout << "stop sleep(): " << elapsed.count() << " ms" << std::endl;
}

void initialData(float* ip, int size) {
        // generate different seed for random number
        time_t t;
        srand((unsigned)time(&t));
        for (int i = 0; i < size; i++) {
                ip[i] = (float)(rand() & 0xFF) / 10.0f;
        }
}

__global__ void testMaxFlopsKernel(float* pData, long nRepeats, float v1, float v2)
{
        int tid = blockIdx.x * blockDim.x + threadIdx.x;
        float s = pData[tid], s2 = 10.0f - s, s3 = 9.0f - s, s4 = 9.0f - s2;
        for (long i = 0; i < nRepeats; i++)
        {
                s = v1 - s * v2;
                s2 = v1 - s * v2;
                s3 = v1 - s2 * v2;
                s4 = v1 - s3 * v2;
        }
        pData[tid] = ((s + s2) + (s3 + s4));
}


int main(int argc, char** argv) {
        // set up device
        int dev = 0;
        cudaSetDevice(dev);

        // set up data size of vectors
        int nElem = 1;
        printf("Vector size %d\n", nElem);
        long nRepeats = 1000000000;
        printf("nRepeats %ld\n", nRepeats);

        // malloc host memory
        size_t nBytes = nElem * sizeof(float);
        float* h_pData;
        h_pData = (float*)malloc(nBytes);

        // initialize data at host side
        initialData(h_pData, nElem);

        // malloc device global memory
        float* d_pData;
        cudaMalloc((float**)&d_pData, nBytes);

        // transfer data from host to device
        cudaMemcpy(d_pData, h_pData, nBytes, cudaMemcpyHostToDevice);

        // invoke kernel at host side
        dim3 block(1, 1, 1);
        dim3 grid(1, 1, 1);

        int index = 0;
        for (index = 0; index <= 1000000; index++) {

                std::cout << "start testMaxFlopsKernel()" << std::endl;
                auto start = std::chrono::steady_clock::now();
                testMaxFlopsKernel << < grid, block >> > (d_pData, nRepeats, 1.0f, 2.0f);
                cudaMemcpy(h_pData, d_pData, nBytes, cudaMemcpyDeviceToHost);
                auto end = std::chrono::steady_clock::now();
                auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
                double time = ms.count();
                std::cout << "stop testMaxFlopsKernel(): " << time << " ms" << std::endl;
                sleep(10000);
        }
        cudaFree(d_pData);
        free(h_pData);
        return(0);
}

上面是一个 cuda 程序，可以通过 nvcc nvitop-test.cu -o nvitop-test -std=c++11 进行编译
这个程序的逻辑是，休眠 10S 然后提交一个 kernel 到 gpu 去运行，我的测试机 v100 执行这个kernel 会花费 2s
对应的日志是:

start sleep()10000 ms
stop sleep(): 10000.1 ms
start testMaxFlopsKernel()
stop testMaxFlopsKernel(): 2286 ms
start sleep()10000 ms
stop sleep(): 10000.1 ms
start testMaxFlopsKernel()
stop testMaxFlopsKernel(): 2317 ms
start sleep()10000 ms
stop sleep(): 10000.1 ms
start testMaxFlopsKernel()
stop testMaxFlopsKernel(): 2299 ms
start sleep()10000 ms
stop sleep(): 10000.1 ms

然后，我写一个 python 程序，把 nvitop 的 sm% 信息收集到 prometheus 中，用grafana 展示出曲线图来说明问题
具体采集监控的代码如下：

cat <<EOF | tee /etc/apt/sources.list
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
EOF

export http_proxy=http://opst:2C8nt8fVEn@10.1.8.50:33128
export https_proxy=http://opst:2C8nt8fVEn@10.1.8.50:33128
export no_proxy=localhost,127.0.0.1,.sensetime.com,.pjlab.org.cn,

pip install flask
pip install schedule
pip install nvitop

# prometheus.py

import os
import re
import threading
from flask import Response, Flask
import prometheus_client
from prometheus_client import Gauge,CollectorRegistry
import asyncio
import schedule
import time
import socket
from nvitop.api import libnvml
from nvitop.gui import Device, colored
import sys
import json

ip_addr=socket.gethostbyname(socket.gethostname())
registry = CollectorRegistry(auto_describe=False)
device_count = 0


def doUpdateMetrics():
    global registry
    newRegistry = CollectorRegistry(auto_describe=False)
    gpu_pid_sm_util =  Gauge("gpu_pid_sm_util", "gpu_pid_sm_util",["ip_addr","gpu_index","pid"], registry=newRegistry)
    gpu_pid_mem_used =  Gauge("gpu_pid_mem_used", "gpu_pid_mem_used",["ip_addr","gpu_index","pid"], registry=newRegistry)
    gpu_pid_mem_total =  Gauge("gpu_pid_mem_total", "gpu_pid_mem_total",["ip_addr","gpu_index","pid"], registry=newRegistry)

    mem_total={}
    sm_util={}
    mem_used={}
    indices = set(range(device_count))
    devices = Device.from_indices(sorted(indices))

    for device in devices:
        mem_total[str(device.index)] = int(device.memory_total() / 1024 / 1024)
        processes = device.processes().values()
        for process in processes:
            sm_util[(str(device.index),str(process.pid))] = process.gpu_sm_utilization()
            mem_used[(str(device.index),str(process.pid))] = int(process._gpu_memory / 1024 / 1024)

    for key in sm_util:
        pid = key[1]
        gpu_index = key[0]
        util = sm_util[key]
        gpu_pid_sm_util.labels(ip_addr,gpu_index,pid).set(util)

    for key in mem_used:
        pid = key[1]
        gpu_index = key[0]
        if gpu_index not in mem_total:
            continue
        total = mem_total[gpu_index]
        used = mem_used[key]
        gpu_pid_mem_total.labels(ip_addr,gpu_index,pid).set(total)
        gpu_pid_mem_used.labels(ip_addr,gpu_index,pid).set(used)
    registry = newRegistry



def updateMetrics():
    global device_count
    loop =  asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    try:
        device_count = Device.count()
    except libnvml.NVMLError_LibraryNotFound:
        print("libnvml.NVMLError_LibraryNotFound")
        return
    except libnvml.NVMLError as ex:
        print(
            '{} {}'.format(colored('NVML ERROR:', color='red', attrs=('bold',)), ex),
            file=sys.stderr,
        )
        return
    schedule.every(2).seconds.do(doUpdateMetrics)
    while True:
        schedule.run_pending()
        time.sleep(2)


app = Flask(__name__)
@app.route("/metrics")
def metrics():
    return Response(prometheus_client.generate_latest(registry),mimetype="text/plain")


if __name__ == "__main__":
    thread1 = threading.Thread(target=updateMetrics)
    thread1.start()
    app.run(host="0.0.0.0",port=5000)

采集完信息以后，这个进程的 sm 使用率监控如下图所示：

@hui-zhao-1 commented on GitHub (Aug 3, 2023): > 这里额外减 2_000_000 (即 2 秒)，是为了使得每次 API 调用都尽可能有 sample 返回我想用一个例子说明这样做的坏处： ```cu // nvitop-test.cu // // nvcc nvitop-test.cu -o nvitop-test -std=c++11 // #include<stdio.h> #include<thread> #include<chrono> #include<iostream> #include<cuda_runtime.h> void sleep(int milliseconds) { std::cout << "start sleep()" << milliseconds << " ms" << std::endl; auto start = std::chrono::high_resolution_clock::now(); std::this_thread::sleep_for(std::chrono::milliseconds(milliseconds)); auto end = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> elapsed = end - start; std::cout << "stop sleep(): " << elapsed.count() << " ms" << std::endl; } void initialData(float* ip, int size) { // generate different seed for random number time_t t; srand((unsigned)time(&t)); for (int i = 0; i < size; i++) { ip[i] = (float)(rand() & 0xFF) / 10.0f; } } __global__ void testMaxFlopsKernel(float* pData, long nRepeats, float v1, float v2) { int tid = blockIdx.x * blockDim.x + threadIdx.x; float s = pData[tid], s2 = 10.0f - s, s3 = 9.0f - s, s4 = 9.0f - s2; for (long i = 0; i < nRepeats; i++) { s = v1 - s * v2; s2 = v1 - s * v2; s3 = v1 - s2 * v2; s4 = v1 - s3 * v2; } pData[tid] = ((s + s2) + (s3 + s4)); } int main(int argc, char** argv) { // set up device int dev = 0; cudaSetDevice(dev); // set up data size of vectors int nElem = 1; printf("Vector size %d\n", nElem); long nRepeats = 1000000000; printf("nRepeats %ld\n", nRepeats); // malloc host memory size_t nBytes = nElem * sizeof(float); float* h_pData; h_pData = (float*)malloc(nBytes); // initialize data at host side initialData(h_pData, nElem); // malloc device global memory float* d_pData; cudaMalloc((float**)&d_pData, nBytes); // transfer data from host to device cudaMemcpy(d_pData, h_pData, nBytes, cudaMemcpyHostToDevice); // invoke kernel at host side dim3 block(1, 1, 1); dim3 grid(1, 1, 1); int index = 0; for (index = 0; index <= 1000000; index++) { std::cout << "start testMaxFlopsKernel()" << std::endl; auto start = std::chrono::steady_clock::now(); testMaxFlopsKernel << < grid, block >> > (d_pData, nRepeats, 1.0f, 2.0f); cudaMemcpy(h_pData, d_pData, nBytes, cudaMemcpyDeviceToHost); auto end = std::chrono::steady_clock::now(); auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start); double time = ms.count(); std::cout << "stop testMaxFlopsKernel(): " << time << " ms" << std::endl; sleep(10000); } cudaFree(d_pData); free(h_pData); return(0); } ``` 上面是一个 cuda 程序，可以通过 nvcc nvitop-test.cu -o nvitop-test -std=c++11 进行编译这个程序的逻辑是，休眠 10S 然后提交一个 kernel 到 gpu 去运行，我的测试机 v100 执行这个kernel 会花费 2s 对应的日志是: ```text start sleep()10000 ms stop sleep(): 10000.1 ms start testMaxFlopsKernel() stop testMaxFlopsKernel(): 2286 ms start sleep()10000 ms stop sleep(): 10000.1 ms start testMaxFlopsKernel() stop testMaxFlopsKernel(): 2317 ms start sleep()10000 ms stop sleep(): 10000.1 ms start testMaxFlopsKernel() stop testMaxFlopsKernel(): 2299 ms start sleep()10000 ms stop sleep(): 10000.1 ms ``` 然后，我写一个 python 程序，把 nvitop 的 sm% 信息收集到 prometheus 中，用grafana 展示出曲线图来说明问题具体采集监控的代码如下： ```bash cat <<EOF | tee /etc/apt/sources.list deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse deb http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse EOF export http_proxy=http://opst:2C8nt8fVEn@10.1.8.50:33128 export https_proxy=http://opst:2C8nt8fVEn@10.1.8.50:33128 export no_proxy=localhost,127.0.0.1,.sensetime.com,.pjlab.org.cn, pip install flask pip install schedule pip install nvitop ``` ```python # prometheus.py import os import re import threading from flask import Response, Flask import prometheus_client from prometheus_client import Gauge,CollectorRegistry import asyncio import schedule import time import socket from nvitop.api import libnvml from nvitop.gui import Device, colored import sys import json ip_addr=socket.gethostbyname(socket.gethostname()) registry = CollectorRegistry(auto_describe=False) device_count = 0 def doUpdateMetrics(): global registry newRegistry = CollectorRegistry(auto_describe=False) gpu_pid_sm_util = Gauge("gpu_pid_sm_util", "gpu_pid_sm_util",["ip_addr","gpu_index","pid"], registry=newRegistry) gpu_pid_mem_used = Gauge("gpu_pid_mem_used", "gpu_pid_mem_used",["ip_addr","gpu_index","pid"], registry=newRegistry) gpu_pid_mem_total = Gauge("gpu_pid_mem_total", "gpu_pid_mem_total",["ip_addr","gpu_index","pid"], registry=newRegistry) mem_total={} sm_util={} mem_used={} indices = set(range(device_count)) devices = Device.from_indices(sorted(indices)) for device in devices: mem_total[str(device.index)] = int(device.memory_total() / 1024 / 1024) processes = device.processes().values() for process in processes: sm_util[(str(device.index),str(process.pid))] = process.gpu_sm_utilization() mem_used[(str(device.index),str(process.pid))] = int(process._gpu_memory / 1024 / 1024) for key in sm_util: pid = key[1] gpu_index = key[0] util = sm_util[key] gpu_pid_sm_util.labels(ip_addr,gpu_index,pid).set(util) for key in mem_used: pid = key[1] gpu_index = key[0] if gpu_index not in mem_total: continue total = mem_total[gpu_index] used = mem_used[key] gpu_pid_mem_total.labels(ip_addr,gpu_index,pid).set(total) gpu_pid_mem_used.labels(ip_addr,gpu_index,pid).set(used) registry = newRegistry def updateMetrics(): global device_count loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) try: device_count = Device.count() except libnvml.NVMLError_LibraryNotFound: print("libnvml.NVMLError_LibraryNotFound") return except libnvml.NVMLError as ex: print( '{} {}'.format(colored('NVML ERROR:', color='red', attrs=('bold',)), ex), file=sys.stderr, ) return schedule.every(2).seconds.do(doUpdateMetrics) while True: schedule.run_pending() time.sleep(2) app = Flask(__name__) @app.route("/metrics") def metrics(): return Response(prometheus_client.generate_latest(registry),mimetype="text/plain") if __name__ == "__main__": thread1 = threading.Thread(target=updateMetrics) thread1.start() app.run(host="0.0.0.0",port=5000) ``` 采集完信息以后，这个进程的 sm 使用率监控如下图所示： ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/4e5da2c4-5a59-4fad-b197-a16161e863de)

gitea-mirror commented

2026-05-05 03:23:38 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 3, 2023):

这个程序明明是休眠 10S 然后工作 2S ，且工作的时候，gpu 使用率是 100%
结果nvitop 统计的结果是，gpu 一直在工作，没有休眠，且使用率在 20% 左右
并没有真实的反应 gpu 的使用情况

@hui-zhao-1 commented on GitHub (Aug 3, 2023): 这个程序明明是休眠 10S 然后工作 2S ，且工作的时候，gpu 使用率是 100% 结果nvitop 统计的结果是，gpu 一直在工作，没有休眠，且使用率在 20% 左右并没有真实的反应 gpu 的使用情况

gitea-mirror commented

2026-05-05 03:23:39 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 3, 2023):

下图是 nvidia-smi pmon 的结果

下图是 https://github.com/NVIDIA/dcgm-exporter 收集的结果，由于采集的间隔是 10s 所以并不准确，但是整体曲线的形状是正确的：

@hui-zhao-1 commented on GitHub (Aug 3, 2023): 下图是 nvidia-smi pmon 的结果 ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/77231f40-483d-4a97-8f42-e62c4d369235) 下图是 https://github.com/NVIDIA/dcgm-exporter 收集的结果，由于采集的间隔是 10s 所以并不准确，但是整体曲线的形状是正确的： ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/5fd4301e-7dca-4752-8418-1cafcf493e07)

gitea-mirror commented

2026-05-05 03:23:39 -06:00

Author

Owner

@XuehaiPan commented on GitHub (Aug 3, 2023):

@2581543189 感谢提供如此详细的复现脚本！（我更新了一下你的 comment 的 Markdown 格式以提高可读性。）

你将 - 2_000_000 去除或减小值后能解决你说的问题吗？我也在本地测试一下结果。

这个程序明明是休眠 10S 然后工作 2S ，且工作的时候，gpu 使用率是 100%
结果nvitop 统计的结果是，gpu 一直在工作，没有休眠，且使用率在 20% 左右
并没有真实的反应 gpu 的使用情况

此处的平均处理为 NVML 内部机制，我未发现有详细文档说明该问题。对于高采样率下的应用场景，额外的 2s 平滑确实可能引入问题。

另外我发现你的复现脚本中使用了：

from nvitop.gui import Device, colored

而 nvitop.gui 子模块中的 API 并不对外暴露，并且使用了 GPL-3.0 协议（nvitop.api 为 Apache-2.0 协议）。可改为：

from nvitop import Device, colored

@XuehaiPan commented on GitHub (Aug 3, 2023): @2581543189 感谢提供如此详细的复现脚本！（我更新了一下你的 comment 的 Markdown 格式以提高可读性。）你将 `- 2_000_000` 去除或减小值后能解决你说的问题吗？我也在本地测试一下结果。 > 这个程序明明是休眠 10S 然后工作 2S ，且工作的时候，gpu 使用率是 100% > 结果nvitop 统计的结果是，gpu 一直在工作，没有休眠，且使用率在 20% 左右 > 并没有真实的反应 gpu 的使用情况此处的平均处理为 NVML 内部机制，我未发现有详细文档说明该问题。对于高采样率下的应用场景，额外的 2s 平滑确实可能引入问题。 ------ 另外我发现你的复现脚本中使用了： ```python from nvitop.gui import Device, colored ``` 而 `nvitop.gui` 子模块中的 API 并不对外暴露，并且使用了 GPL-3.0 协议（`nvitop.api` 为 Apache-2.0 协议）。可改为： ```python from nvitop import Device, colored ```

gitea-mirror commented

2026-05-05 03:23:40 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 3, 2023):

可以用下面的代码验证：

# test.py
import schedule
import time
import pynvml
timestamp = 0
def test():
	global timestamp
	gpu_device_count = pynvml.nvmlDeviceGetCount()
	for gpu_index in range(gpu_device_count):
		handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
		try:
			processes_util = pynvml.nvmlDeviceGetProcessUtilization(handle,timestamp)
			for process in processes_util:
				print(gpu_index,str(process.pid),process.smUtil,process.timeStamp)
				timestamp = process.timeStamp - 2_000_000
				local_time = time.localtime(timestamp /1000 /1000)
				time_format = time.strftime('%Y-%m-%d %H:%M:%S',local_time)
			print("==============================",time_format,processes_util[0].smUtil)
		except pynvml.NVMLError_NotFound:
			continue

if __name__ == "__main__":
	pynvml.nvmlInit()
	schedule.every(1).seconds.do(test)
	while True:
		schedule.run_pending()
		time.sleep(1)

运行之前的 cuda 程序，如果保留 - 2_000_000 的话，输出如下：

============================== 2023-08-03 17:07:26 0
============================== 2023-08-03 17:07:26 0
============================== 2023-08-03 17:07:35 0
============================== 2023-08-03 17:07:34 40
============================== 2023-08-03 17:07:34 44
============================== 2023-08-03 17:07:34 39
============================== 2023-08-03 17:07:34 33
============================== 2023-08-03 17:07:34 29
============================== 2023-08-03 17:07:42 26
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:41 0
============================== 2023-08-03 17:07:49 0
============================== 2023-08-03 17:07:48 64
============================== 2023-08-03 17:07:48 46
============================== 2023-08-03 17:07:48 38
============================== 2023-08-03 17:07:48 33
============================== 2023-08-03 17:07:48 29
============================== 2023-08-03 17:07:48 26
============================== 2023-08-03 17:07:48 23
============================== 2023-08-03 17:07:57 21
============================== 2023-08-03 17:07:56 0

如果删除 - 2_000_000 的话，输出如下：

============================== 2023-08-03 17:08:37 0
============================== 2023-08-03 17:08:38 0
============================== 2023-08-03 17:08:39 0
============================== 2023-08-03 17:08:40 85
============================== 2023-08-03 17:08:41 100
============================== 2023-08-03 17:08:42 45
============================== 2023-08-03 17:08:43 0
============================== 2023-08-03 17:08:44 0
============================== 2023-08-03 17:08:45 0
============================== 2023-08-03 17:08:46 0
============================== 2023-08-03 17:08:47 0
============================== 2023-08-03 17:08:48 0
============================== 2023-08-03 17:08:49 0
============================== 2023-08-03 17:08:50 0
============================== 2023-08-03 17:08:51 0
============================== 2023-08-03 17:08:52 57
============================== 2023-08-03 17:08:53 100
============================== 2023-08-03 17:08:54 70
============================== 2023-08-03 17:08:55 0
============================== 2023-08-03 17:08:56 0
============================== 2023-08-03 17:08:57 0
============================== 2023-08-03 17:08:58 0

@hui-zhao-1 commented on GitHub (Aug 3, 2023): 可以用下面的代码验证： ``` # test.py import schedule import time import pynvml timestamp = 0 def test(): global timestamp gpu_device_count = pynvml.nvmlDeviceGetCount() for gpu_index in range(gpu_device_count): handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index) try: processes_util = pynvml.nvmlDeviceGetProcessUtilization(handle,timestamp) for process in processes_util: print(gpu_index,str(process.pid),process.smUtil,process.timeStamp) timestamp = process.timeStamp - 2_000_000 local_time = time.localtime(timestamp /1000 /1000) time_format = time.strftime('%Y-%m-%d %H:%M:%S',local_time) print("==============================",time_format,processes_util[0].smUtil) except pynvml.NVMLError_NotFound: continue if __name__ == "__main__": pynvml.nvmlInit() schedule.every(1).seconds.do(test) while True: schedule.run_pending() time.sleep(1) ``` 运行之前的 cuda 程序，如果保留 `- 2_000_000` 的话，输出如下： ``` ============================== 2023-08-03 17:07:26 0 ============================== 2023-08-03 17:07:26 0 ============================== 2023-08-03 17:07:35 0 ============================== 2023-08-03 17:07:34 40 ============================== 2023-08-03 17:07:34 44 ============================== 2023-08-03 17:07:34 39 ============================== 2023-08-03 17:07:34 33 ============================== 2023-08-03 17:07:34 29 ============================== 2023-08-03 17:07:42 26 ============================== 2023-08-03 17:07:41 0 ============================== 2023-08-03 17:07:41 0 ============================== 2023-08-03 17:07:41 0 ============================== 2023-08-03 17:07:41 0 ============================== 2023-08-03 17:07:41 0 ============================== 2023-08-03 17:07:41 0 ============================== 2023-08-03 17:07:49 0 ============================== 2023-08-03 17:07:48 64 ============================== 2023-08-03 17:07:48 46 ============================== 2023-08-03 17:07:48 38 ============================== 2023-08-03 17:07:48 33 ============================== 2023-08-03 17:07:48 29 ============================== 2023-08-03 17:07:48 26 ============================== 2023-08-03 17:07:48 23 ============================== 2023-08-03 17:07:57 21 ============================== 2023-08-03 17:07:56 0 ``` 如果删除 `- 2_000_000` 的话，输出如下： ``` ============================== 2023-08-03 17:08:37 0 ============================== 2023-08-03 17:08:38 0 ============================== 2023-08-03 17:08:39 0 ============================== 2023-08-03 17:08:40 85 ============================== 2023-08-03 17:08:41 100 ============================== 2023-08-03 17:08:42 45 ============================== 2023-08-03 17:08:43 0 ============================== 2023-08-03 17:08:44 0 ============================== 2023-08-03 17:08:45 0 ============================== 2023-08-03 17:08:46 0 ============================== 2023-08-03 17:08:47 0 ============================== 2023-08-03 17:08:48 0 ============================== 2023-08-03 17:08:49 0 ============================== 2023-08-03 17:08:50 0 ============================== 2023-08-03 17:08:51 0 ============================== 2023-08-03 17:08:52 57 ============================== 2023-08-03 17:08:53 100 ============================== 2023-08-03 17:08:54 70 ============================== 2023-08-03 17:08:55 0 ============================== 2023-08-03 17:08:56 0 ============================== 2023-08-03 17:08:57 0 ============================== 2023-08-03 17:08:58 0 ```

gitea-mirror commented

2026-05-05 03:23:40 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 3, 2023):

@hui-zhao-1 commented on GitHub (Aug 3, 2023): ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/81b0128f-4995-477f-a66f-2ec028c51956)

gitea-mirror commented

2026-05-05 03:23:41 -06:00

Author

Owner

@XuehaiPan commented on GitHub (Aug 3, 2023):

@2581543189 我开了一个 PR 来删除这个额外的 -2s 的时间值。你可以试试：

pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization

需要注意的是，若在 t 时刻调用了 device.processes()，则 device._timestamp = t。再过了 δt 之后调用 device.processes() 方法时，使用的 timestamp 为 t 而不是 t + δt 附近的值。在 δt 比较大的时候（如 >2s），依然会产生数据被平滑的情况。

另外，使用 nvidia-smi pmon 或者开启 daemon 进程会改变 NVML 的采样率，目前我还不太清楚这是否会对结果产生影响。

@XuehaiPan commented on GitHub (Aug 3, 2023): @2581543189 我开了一个 PR 来删除这个额外的 -2s 的时间值。你可以试试： ```bash pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization ``` 需要注意的是，若在 `t` 时刻调用了 `device.processes()`，则 `device._timestamp = t`。再过了 `δt` 之后调用 `device.processes()` 方法时，使用的 timestamp 为 `t` 而不是 `t + δt` 附近的值。在 `δt` 比较大的时候（如 >2s），依然会产生数据被平滑的情况。另外，使用 `nvidia-smi pmon` 或者开启 daemon 进程会改变 NVML 的采样率，目前我还不太清楚这是否会对结果产生影响。

gitea-mirror commented

2026-05-05 03:23:41 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 4, 2023):

我使用

pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization

重新安装 nvitop 以后，按照之前回复中 cuda 程序的例子，收集监控数据，并没有什么变化，收集到的监控信息如下所示：

然后我就想，如果每次查询process 的时候，_timestamp 不传递上次sample 的时间，而是直接传递当前时间，是否就可以忽略 δt 的影响，于是我 fork 了这个仓库，并进行了对应的修改：

https://github.com/XuehaiPan/nvitop/compare/main...2581543189:nvitop:now-timestamp

使用下面命令安装以后，收集到的统计信息符合预期

pip3 install git+https://github.com/2581543189/nvitop.git@now-timestamp

监控信息如下图：

@hui-zhao-1 commented on GitHub (Aug 4, 2023): 我使用 ``` pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization ``` 重新安装 nvitop 以后，按照之前回复中 cuda 程序的例子，收集监控数据，并没有什么变化，收集到的监控信息如下所示： ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/b9943d92-4f8c-4070-a1e1-5cfeb44927b0) 然后我就想，如果每次查询process 的时候，_timestamp 不传递上次sample 的时间，而是直接传递当前时间，是否就可以忽略 `δt` 的影响，于是我 fork 了这个仓库，并进行了对应的修改： ``` https://github.com/XuehaiPan/nvitop/compare/main...2581543189:nvitop:now-timestamp ``` 使用下面命令安装以后，收集到的统计信息符合预期 ``` pip3 install git+https://github.com/2581543189/nvitop.git@now-timestamp ``` 监控信息如下图： ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/5b81fe44-0267-4a76-b401-90136300bbec)

gitea-mirror commented

2026-05-05 03:23:41 -06:00

Author

Owner

@XuehaiPan commented on GitHub (Aug 4, 2023):

@2581543189 感谢新的反馈。我更新了 PR #85 中的实现，即始终使用 epoch timestamp 来调用 NVML API:

samples = libnvml.nvmlQuery(
    'nvmlDeviceGetProcessUtilization',
    self.handle,
-   self._timestamp,
+   time.time_ns() // 1000,
    default=(),
    )

您可以试试：

pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization

目前还不太清楚这种过于激进的 timestamp 策略（即始终使用调用时刻的 timestamp）是否会导致 sample buffer 中始终为空，或者大概率为空。nvtop 内部的实现是使用上次采样得到的所有 sample 中最大的 timestamp 来作为下次调用时的 timestamp:

be47f8c560/src/extract_gpuinfo_nvidia.c (L571-L608)

更新： 我这边的本地测试表明直接使用 timestamp = time.time_ns() // 1000 会导致 buffer 大部分情况下为空而无 sample 返回。最新的 commit 额外增加了一个 1/4 秒的间隔：

samples = libnvml.nvmlQuery(
    'nvmlDeviceGetProcessUtilization',
    self.handle,
-   self._timestamp,
+   time.time_ns() // 1000 - 250_000,
    default=(),
    )

@XuehaiPan commented on GitHub (Aug 4, 2023): @2581543189 感谢新的反馈。我更新了 PR #85 中的实现，即始终使用 epoch timestamp 来调用 NVML API: ```diff samples = libnvml.nvmlQuery( 'nvmlDeviceGetProcessUtilization', self.handle, - self._timestamp, + time.time_ns() // 1000, default=(), ) ``` 您可以试试： ```bash pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization ``` 目前还不太清楚这种过于激进的 timestamp 策略（即始终使用调用时刻的 timestamp）是否会导致 sample buffer 中始终为空，或者大概率为空。[`nvtop`](https://github.com/Syllo/nvtop) 内部的实现是使用上次采样得到的所有 sample 中最大的 timestamp 来作为下次调用时的 timestamp: https://github.com/Syllo/nvtop/blob/be47f8c560487efc6e6a419d59c69bfbdb819324/src/extract_gpuinfo_nvidia.c#L571-L608 ------ **更新：** 我这边的本地测试表明直接使用 `timestamp = time.time_ns() // 1000` 会导致 buffer 大部分情况下为空而无 sample 返回。最新的 commit 额外增加了一个 1/4 秒的间隔： ```diff samples = libnvml.nvmlQuery( 'nvmlDeviceGetProcessUtilization', self.handle, - self._timestamp, + time.time_ns() // 1000 - 250_000, default=(), ) ```

gitea-mirror commented

2026-05-05 03:23:42 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 4, 2023):

我这边测试也发现了相同的问题 https://github.com/XuehaiPan/nvitop/compare/main...2581543189:nvitop:now-timestamp 这个改动中的 int(datetime.datetime.now().timestamp()) 这个操作，相当于随机对 now 减小了 0~999ms ，监控时间拉长以后，也发现了很多采样为空的现象：

用 pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization 测试，采样为空的现象更加明显

我很好奇，为什么 nvidia-smi pmon 就没有这个问题，想要看它是怎么实现的，但是 google 发现 nvidia-smi 并不开源

目前感觉使用 nvmlDeviceGetProcessUtilization 这个方法，无论 lastSeenTimeStamp 怎么传递，都无法准确的反应 nvitop-test 这个程序 gpu 的真实 sm 使用情况

@hui-zhao-1 commented on GitHub (Aug 4, 2023): 我这边测试也发现了相同的问题 `https://github.com/XuehaiPan/nvitop/compare/main...2581543189:nvitop:now-timestamp` 这个改动中的 int(datetime.datetime.now().timestamp()) 这个操作，相当于随机对 now 减小了 0~999ms ，监控时间拉长以后，也发现了很多采样为空的现象： ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/705f10ef-48e7-4439-9b45-d6d065e5814d) 用 `pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization` 测试，采样为空的现象更加明显 ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/fb6584b1-1d4b-4acf-b28a-43ffcfd12f74) 我很好奇，为什么 nvidia-smi pmon 就没有这个问题，想要看它是怎么实现的，但是 google 发现 nvidia-smi 并不开源目前感觉使用 nvmlDeviceGetProcessUtilization 这个方法，无论 lastSeenTimeStamp 怎么传递，都无法准确的反应 nvitop-test 这个程序 gpu 的真实 sm 使用情况

gitea-mirror commented

2026-05-05 03:23:42 -06:00

Author

Owner

@XuehaiPan commented on GitHub (Aug 4, 2023):

用 pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization 测试，采样为空的现象更加明显

@2581543189 我在新的 commit 中额外增加了 1/4 秒的间隔，我本地测试效果还比较好。如下是使用你 https://github.com/XuehaiPan/nvitop/issues/83#issuecomment-1663404181 中提供的测试程序（修改了部分参数）。左边为 PR #85, 右边为 main (v1.2.0)：

相比于 v1.2.0，修改后的程序的延迟明显更小。图形更符合方波波形，峰两遍的斜坡更窄。

@XuehaiPan commented on GitHub (Aug 4, 2023): > 用 pip3 install git+https://github.com/XuehaiPan/nvitop.git@process-utilization 测试，采样为空的现象更加明显 @2581543189 我在新的 commit 中额外增加了 1/4 秒的间隔，我本地测试效果还比较好。如下是使用你 https://github.com/XuehaiPan/nvitop/issues/83#issuecomment-1663404181 中提供的测试程序（修改了部分参数）。左边为 PR #85, 右边为 main (v1.2.0)： <img width="1864" alt="image" src="https://github.com/XuehaiPan/nvitop/assets/16078332/2f39648f-f84a-4ce3-86c1-409ab640806e"> 相比于 v1.2.0，修改后的程序的延迟明显更小。图形更符合方波波形，峰两遍的斜坡更窄。

gitea-mirror commented

2026-05-05 03:23:43 -06:00

Author

Owner

@hui-zhao-1 commented on GitHub (Aug 4, 2023):

增加了 1/4 秒的间隔的代码我这边验证也是正常的

之前完全采不到样的截图是我自己操作失误导致的，工具是正常的

@hui-zhao-1 commented on GitHub (Aug 4, 2023): 增加了 1/4 秒的间隔的代码我这边验证也是正常的 ![image](https://github.com/XuehaiPan/nvitop/assets/19888114/01f4422f-6b99-4313-8886-c35d82c61de7) 之前完全采不到样的截图是我自己操作失误导致的，工具是正常的

gitea-mirror commented

2026-05-05 03:23:43 -06:00

Author

Owner

@XuehaiPan commented on GitHub (Aug 4, 2023):