mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-15 14:15:55 -06:00
[GH-ISSUE #162] [Question] Can't use nvitop callback in pytorch lightning #102
Labels
No labels
api
bug
bug
cli / tui
dependencies
documentation
documentation
documentation
duplicate
enhancement
exporter
invalid
pull-request
pynvml
question
question
upstream
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/nvitop#102
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @leo1oel on GitHub (Apr 19, 2025).
Original GitHub issue: https://github.com/XuehaiPan/nvitop/issues/162
Originally assigned to: @XuehaiPan on GitHub.
Required prerequisites
Questions
When I use nvitop (version 1.4.2) with the latest pytorch lightning following README, an error occurs.
Error executing job with overrides: []
Traceback (most recent call last):
File "/pasteur/u/yiming/small-vlm/src/vlm/vlm.py", line 120, in main
vlm(cfg)
~~~^^^^^
File "/pasteur/u/yiming/small-vlm/src/vlm/vlm.py", line 106, in vlm
train(cfg.trainer, model, data_module)
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pasteur/u/yiming/small-vlm/src/vlm/train/trainer.py", line 55, in train
trainer.fit(
~~~~~~~~~~~^
model=model,
^^^^^^^^^^^^
datamodule=data_module,
^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit
call._call_and_handle_interrupt(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _run
results = self._run_stage()
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 1056, in _run_stage
self.fit_loop.run()
~~~~~~~~~~~~~~~~~^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
self.advance()
~~~~~~~~~~~~^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
self.epoch_loop.run(self._data_fetcher)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 150, in run
self.advance(data_fetcher)
~~~~~~~~~~~~^^^^^^^^^^^^^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 303, in advance
call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx)
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 227, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pasteur/u/yiming/small-vlm/.venv/lib/python3.13/site-packages/lightning_utilities/core/rank_zero.py", line 41, in wrapped_fn
return fn(*args, **kwargs)
TypeError: GpuStatsLogger.on_train_batch_start() takes 3 positional arguments but 5 were given
@XuehaiPan commented on GitHub (Apr 19, 2025):
Hi @leo1oel, thanks for raising this. I want to note that
nvitopdoes not have a schedule to sync with the third-party dependencies. The callback support is going to be deprecated.Please feel free to copy the callback implementation to your project.
@XuehaiPan commented on GitHub (Apr 19, 2025):
Overall, thanks for using
nvitop. The callback is going to be removed and unmaintained. Sorry about that.