[GH-ISSUE #128] [BUG] Memory leaking for Nvitop instances inside docker container #79

Open
opened 2026-05-05 03:24:46 -06:00 by gitea-mirror · 9 comments
Owner

Originally created by @kenvix on GitHub (Jun 26, 2024).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/128

Originally assigned to: @XuehaiPan on GitHub.

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.3.2

Operating system and version

Ubuntu 22.04.4 LTS

NVIDIA driver version

535.104.12

NVIDIA-SMI

Wed Jun 26 20:18:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     Off | 00000000:0E:00.0 Off |                    0 |
|  0%   27C    P8              23W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     Off | 00000000:0F:00.0 Off |                    0 |
|  0%   40C    P0              75W / 300W |    413MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     Off | 00000000:12:00.0 Off |                    0 |
|  0%   43C    P0              79W / 300W |  42629MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     Off | 00000000:27:00.0 Off |                    0 |
|  0%   42C    P0              78W / 300W |  42567MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Python environment

3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] linux
nvidia-ml-py==12.535.133
nvitop==1.3.2

Problem description

Nvitop has memory leaking issue for instances inside docker container. (RAM, not VRAM). Even the operating system takes ten seconds to reclaim memory after SIGKILL a process.

image

image

Steps to Reproduce

Just keep running nvtop about few months. You'll see nvtop consumed a lot of system memory. About 300GB in 77 days for my instance.

Is this caused by nvitop recorded too much vRAM and GPU utilization information but not releasing it?

Traceback

No response

Logs

No response

Expected behavior

No response

Additional context

No response

Originally created by @kenvix on GitHub (Jun 26, 2024). Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/128 Originally assigned to: @XuehaiPan on GitHub. ### Required prerequisites - [X] I have read the documentation <https://nvitop.readthedocs.io>. - [X] I have searched the [Issue Tracker](https://github.com/XuehaiPan/nvitop/issues) that this hasn't already been reported. (comment there if it has.) - [X] I have tried the latest version of nvitop in a new isolated virtual environment. ### What version of nvitop are you using? 1.3.2 ### Operating system and version Ubuntu 22.04.4 LTS ### NVIDIA driver version 535.104.12 ### NVIDIA-SMI ```text Wed Jun 26 20:18:32 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A40 Off | 00000000:0E:00.0 Off | 0 | | 0% 27C P8 23W / 300W | 4MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A40 Off | 00000000:0F:00.0 Off | 0 | | 0% 40C P0 75W / 300W | 413MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A40 Off | 00000000:12:00.0 Off | 0 | | 0% 43C P0 79W / 300W | 42629MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A40 Off | 00000000:27:00.0 Off | 0 | | 0% 42C P0 78W / 300W | 42567MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ ``` ### Python environment 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] linux nvidia-ml-py==12.535.133 nvitop==1.3.2 ### Problem description Nvitop has memory leaking issue for instances inside docker container. (RAM, not VRAM). Even the operating system takes ten seconds to reclaim memory after SIGKILL a process. ![image](https://github.com/XuehaiPan/nvitop/assets/4546175/3287d826-33d0-49e3-93af-3144ebe81911) ![image](https://github.com/XuehaiPan/nvitop/assets/4546175/89834204-6391-4352-a8fb-1efa917a7568) ### Steps to Reproduce Just keep running `nvtop` about few months. You'll see `nvtop` consumed a lot of system memory. About 300GB in 77 days for my instance. Is this caused by nvitop recorded too much vRAM and GPU utilization information but not releasing it? ### Traceback _No response_ ### Logs _No response_ ### Expected behavior _No response_ ### Additional context _No response_
gitea-mirror added the
bug
label 2026-05-05 03:24:46 -06:00
Author
Owner

@XuehaiPan commented on GitHub (Jul 9, 2024):

Hi @kenvix, thanks for raising this.

I tested it locally, but I cannot reproduce this. I use a script to create and terminate 10k processes on the GPU.

import time

import ray
import torch


@ray.remote(num_cpus=1, num_gpus=0.1)
def request_gpu():
    torch.zeros(1000, device='cuda')
    time.sleep(10)

ray.init()
_ = ray.get([request_gpu.remote() for _ in range(10000)])

The memory consumption of nvitop is stable around 260M.

image
<!-- gh-comment-id:2216756410 --> @XuehaiPan commented on GitHub (Jul 9, 2024): Hi @kenvix, thanks for raising this. I tested it locally, but I cannot reproduce this. I use a script to create and terminate 10k processes on the GPU. ```python import time import ray import torch @ray.remote(num_cpus=1, num_gpus=0.1) def request_gpu(): torch.zeros(1000, device='cuda') time.sleep(10) ray.init() _ = ray.get([request_gpu.remote() for _ in range(10000)]) ``` The memory consumption of `nvitop` is stable around 260M. <img width="1097" alt="image" src="https://github.com/XuehaiPan/nvitop/assets/16078332/863f1216-df4b-4004-9ff3-09ae0a2fc1b6">
Author
Owner

@kenvix commented on GitHub (Jul 10, 2024):

Hi, @XuehaiPan

The test code you provided does not seem relevant to this issue. In my case, using tmux or screen to keep nvitop running, you will find that the memory (RAM, not GPU VRAM) usage of nvitop itself will continue to slowly increase over time.

For my example below, I ran it for 12 hours:

image

It used about 4.5G RAM

<!-- gh-comment-id:2219627140 --> @kenvix commented on GitHub (Jul 10, 2024): Hi, @XuehaiPan The test code you provided does not seem relevant to this issue. In my case, using tmux or screen to keep nvitop running, you will find that the memory (RAM, **not** GPU VRAM) usage of nvitop itself will continue to slowly increase over time. For my example below, I ran it for 12 hours: ![image](https://github.com/XuehaiPan/nvitop/assets/4546175/10c3805f-4a59-4d66-b4f7-04932cb5dc1c) It used about 4.5G RAM
Author
Owner

@XuehaiPan commented on GitHub (Jul 10, 2024):

@kenvix could you test nvitop using pipx? Maybe it is caused by a dependency (e.g., the unofficial pynvml package).

Running after 2 days:

image
<!-- gh-comment-id:2219632256 --> @XuehaiPan commented on GitHub (Jul 10, 2024): @kenvix could you test `nvitop` using `pipx`? Maybe it is caused by a dependency (e.g., the unofficial `pynvml` package). Running after 2 days: <img width="1532" alt="image" src="https://github.com/XuehaiPan/nvitop/assets/16078332/4b3399f8-bbc8-4347-a782-6e61bd69c3c8">
Author
Owner

@alexanderfrey commented on GitHub (Oct 25, 2024):

similar problem here. latest version 1.3.2 with nvidia 560.35.03 driver version. approx ram usage for nvitop is whopping 30GB. This happened btw after I installed Ubuntu 24.10. Before that everything was good. Strangley the starting time for nvitop went up to 30sec (!) and during the starting it allocates those 30GB of memory into the ram. I suspect some faulty library...

<!-- gh-comment-id:2437337773 --> @alexanderfrey commented on GitHub (Oct 25, 2024): similar problem here. latest version 1.3.2 with nvidia 560.35.03 driver version. approx ram usage for nvitop is whopping 30GB. This happened btw after I installed Ubuntu 24.10. Before that everything was good. Strangley the starting time for nvitop went up to 30sec (!) and during the starting it allocates those 30GB of memory into the ram. I suspect some faulty library...
Author
Owner

@XuehaiPan commented on GitHub (Oct 25, 2024):

Hi, thanks for the report. It would be very helpful if you can run the following code. We can investigate where the memory leak is coming from. (e.g., device query, process query, psutil, curses).

$ python3 -m pip install nvitop
$ python3 query.py
image
# query.py

import shutil
import time

from nvitop import Device, GpuProcess, NA, colored


def display(devices: list[Device], show_processes: bool = False) -> None:
    fmt = '    {pid:<5}  {username:<8} {cpu:>5}  {host_memory:>8} {time:>8}  {gpu_memory:>8}  {sm:>3}  {command:<}'.format

    lines = [colored(time.strftime('%a %b %d %H:%M:%S %Y'), attrs=('bold',))]
    for device in devices:
        with device.oneshot():
            lines.extend(
                [
                    colored(str(device), color='green', attrs=('bold',)),
                    colored('  - Fan speed:       ', color='blue', attrs=('bold',))
                    + f'{device.fan_speed():>7}%'
                    + colored('        - Total memory: ', color='blue', attrs=('bold',))
                    + f'{device.memory_total_human():>12}',
                    colored('  - Temperature:     ', color='blue', attrs=('bold',))
                    + f'{device.temperature():>7}C'
                    + colored('        - Used memory:  ', color='blue', attrs=('bold',))
                    + f'{device.memory_used_human():>12}',
                    colored('  - GPU utilization: ', color='blue', attrs=('bold',))
                    + f'{device.gpu_utilization():>7}%'
                    + colored('        - Free memory:  ', color='blue', attrs=('bold',))
                    + f'{device.memory_free_human():>12}',
                ],
            )

        if show_processes:
            processes = device.processes()
            if len(processes) > 0:
                processes = GpuProcess.take_snapshots(processes.values(), failsafe=True)
                processes.sort(key=lambda process: (process.username, process.pid))
                lines.extend(
                    [
                        colored(
                            f'  - Processes ({len(processes)}):',
                            color='blue',
                            attrs=('bold',),
                        ),
                        colored(
                            fmt(
                                pid='PID',
                                username='USERNAME',
                                cpu='CPU%',
                                host_memory='HOST-MEM',
                                time='TIME',
                                gpu_memory='GPU-MEM',
                                sm='SM%',
                                command='COMMAND',
                            ),
                            attrs=('bold',),
                        ),
                        *(
                            fmt(
                                pid=snapshot.pid,
                                username=snapshot.username[:7]
                                + ('+' if len(snapshot.username) > 8 else snapshot.username[7:8]),
                                cpu=snapshot.cpu_percent,
                                host_memory=snapshot.host_memory_human,
                                time=snapshot.running_time_human,
                                gpu_memory=(
                                    snapshot.gpu_memory_human
                                    if snapshot.gpu_memory_human is not NA
                                    else 'WDDM:N/A'
                                ),
                                sm=snapshot.gpu_sm_utilization,
                                command=snapshot.command,
                            )
                            for snapshot in processes
                        ),
                    ],
                )
            else:
                lines.append(colored('  - No Running Processes', attrs=('bold',)))

    cols, rows = shutil.get_terminal_size(fallback=(79, 24))
    print('\033[2J', end='', flush=True)
    print('\n'.join(lines[:rows]), end='', flush=True)
    print('\033[H', end='', flush=True)


def main() -> None:
    devices = Device.all()
    while True:
        display(devices)
        time.sleep(1)


if __name__ == '__main__':
    main()
<!-- gh-comment-id:2437538378 --> @XuehaiPan commented on GitHub (Oct 25, 2024): Hi, thanks for the report. It would be very helpful if you can run the following code. We can investigate where the memory leak is coming from. (e.g., device query, process query, psutil, curses). ```console $ python3 -m pip install nvitop $ python3 query.py ``` <img width="664" alt="image" src="https://github.com/user-attachments/assets/165e3a27-fc81-48f7-a9df-da6e8cbae203"> ```python # query.py import shutil import time from nvitop import Device, GpuProcess, NA, colored def display(devices: list[Device], show_processes: bool = False) -> None: fmt = ' {pid:<5} {username:<8} {cpu:>5} {host_memory:>8} {time:>8} {gpu_memory:>8} {sm:>3} {command:<}'.format lines = [colored(time.strftime('%a %b %d %H:%M:%S %Y'), attrs=('bold',))] for device in devices: with device.oneshot(): lines.extend( [ colored(str(device), color='green', attrs=('bold',)), colored(' - Fan speed: ', color='blue', attrs=('bold',)) + f'{device.fan_speed():>7}%' + colored(' - Total memory: ', color='blue', attrs=('bold',)) + f'{device.memory_total_human():>12}', colored(' - Temperature: ', color='blue', attrs=('bold',)) + f'{device.temperature():>7}C' + colored(' - Used memory: ', color='blue', attrs=('bold',)) + f'{device.memory_used_human():>12}', colored(' - GPU utilization: ', color='blue', attrs=('bold',)) + f'{device.gpu_utilization():>7}%' + colored(' - Free memory: ', color='blue', attrs=('bold',)) + f'{device.memory_free_human():>12}', ], ) if show_processes: processes = device.processes() if len(processes) > 0: processes = GpuProcess.take_snapshots(processes.values(), failsafe=True) processes.sort(key=lambda process: (process.username, process.pid)) lines.extend( [ colored( f' - Processes ({len(processes)}):', color='blue', attrs=('bold',), ), colored( fmt( pid='PID', username='USERNAME', cpu='CPU%', host_memory='HOST-MEM', time='TIME', gpu_memory='GPU-MEM', sm='SM%', command='COMMAND', ), attrs=('bold',), ), *( fmt( pid=snapshot.pid, username=snapshot.username[:7] + ('+' if len(snapshot.username) > 8 else snapshot.username[7:8]), cpu=snapshot.cpu_percent, host_memory=snapshot.host_memory_human, time=snapshot.running_time_human, gpu_memory=( snapshot.gpu_memory_human if snapshot.gpu_memory_human is not NA else 'WDDM:N/A' ), sm=snapshot.gpu_sm_utilization, command=snapshot.command, ) for snapshot in processes ), ], ) else: lines.append(colored(' - No Running Processes', attrs=('bold',))) cols, rows = shutil.get_terminal_size(fallback=(79, 24)) print('\033[2J', end='', flush=True) print('\n'.join(lines[:rows]), end='', flush=True) print('\033[H', end='', flush=True) def main() -> None: devices = Device.all() while True: display(devices) time.sleep(1) if __name__ == '__main__': main() ```
Author
Owner

@alexanderfrey commented on GitHub (Oct 28, 2024):

Thanks for replying ! Here is the output:

Mon Oct 28 12:23:22 2024
PhysicalDevice(index=0, name="NVIDIA GeForce RTX 4090", total_memory=23.99GiB)
  - Fan speed:             0%        - Total memory:     23.99GiB
  - Temperature:          41C        - Used memory:      23.36GiB
  - GPU utilization:      94%        - Free memory:      410.9MiB
PhysicalDevice(index=1, name="NVIDIA GeForce RTX 4090", total_memory=23.99GiB)
  - Fan speed:             0%        - Total memory:     23.99GiB
  - Temperature:          41C        - Used memory:      23.36GiB
  - GPU utilization:      89%        - Free memory:      412.9MiB
PhysicalDevice(index=2, name="NVIDIA GeForce RTX 2080 Ti", total_memory=11264MiB)
  - Fan speed:             0%        - Total memory:     11264MiB
  - Temperature:          35C        - Used memory:       3606MiB
  - GPU utilization:       7%        - Free memory:       7365MiB
<!-- gh-comment-id:2441313800 --> @alexanderfrey commented on GitHub (Oct 28, 2024): Thanks for replying ! Here is the output: ```text Mon Oct 28 12:23:22 2024 PhysicalDevice(index=0, name="NVIDIA GeForce RTX 4090", total_memory=23.99GiB) - Fan speed: 0% - Total memory: 23.99GiB - Temperature: 41C - Used memory: 23.36GiB - GPU utilization: 94% - Free memory: 410.9MiB PhysicalDevice(index=1, name="NVIDIA GeForce RTX 4090", total_memory=23.99GiB) - Fan speed: 0% - Total memory: 23.99GiB - Temperature: 41C - Used memory: 23.36GiB - GPU utilization: 89% - Free memory: 412.9MiB PhysicalDevice(index=2, name="NVIDIA GeForce RTX 2080 Ti", total_memory=11264MiB) - Fan speed: 0% - Total memory: 11264MiB - Temperature: 35C - Used memory: 3606MiB - GPU utilization: 7% - Free memory: 7365MiB ```
Author
Owner

@alexanderfrey commented on GitHub (Oct 28, 2024):

btw. this did not result in excessive memory use.

<!-- gh-comment-id:2441316145 --> @alexanderfrey commented on GitHub (Oct 28, 2024): btw. this did not result in excessive memory use.
Author
Owner

@XuehaiPan commented on GitHub (Oct 28, 2024):

btw. this did not result in excessive memory use.

Hey @alexanderfrey, thanks for the report. Could you change the value of show_processes to True in the script and rerun it (hold and run for several hours)?

<!-- gh-comment-id:2441334896 --> @XuehaiPan commented on GitHub (Oct 28, 2024): > btw. this did not result in excessive memory use. Hey @alexanderfrey, thanks for the report. Could you change the value of `show_processes` to `True` in the script and rerun it (hold and run for several hours)?
Author
Owner

@gthelding commented on GitHub (Dec 19, 2024):

@alexanderfrey I have the same trouble running nvitop and nvidia-smi. They eat 52GB of RAM and take 30s to start up. I'm running Ubuntu 24.10 and an RTX 3060 with the 560.35.05 driver.

I've discovered that this is an issue with the nvidia-persistenced service.

If you sudo chmod o-w /var/run/nvidia-persistenced/socket, you can run nvitop as normal.

I thought I'd drop a note here so anyone seeing this will no the issue isn't with nvitop.

<!-- gh-comment-id:2552674215 --> @gthelding commented on GitHub (Dec 19, 2024): @alexanderfrey I have the same trouble running nvitop and nvidia-smi. They eat 52GB of RAM and take 30s to start up. I'm running Ubuntu 24.10 and an RTX 3060 with the 560.35.05 driver. I've discovered that this is an [issue](https://forums.developer.nvidia.com/t/nvidia-smi-uses-all-of-ram-and-swap/295639/16) with the nvidia-persistenced service. If you `sudo chmod o-w /var/run/nvidia-persistenced/socket`, you can run nvitop as normal. I thought I'd drop a note here so anyone seeing this will no the issue isn't with nvitop.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/nvitop#79
No description provided.