mirror of
https://github.com/XuehaiPan/nvitop.git
synced 2026-05-21 06:45:24 -06:00
docs(examples/monitor-web): refresh dashboard readme
This commit is contained in:
parent
0eccaa5696
commit
b46727b4d5
4 changed files with 145 additions and 56 deletions
|
|
@ -609,6 +609,14 @@ formatting:
|
|||
|
||||
`nvitop` can be easily integrated into other applications. You can use `nvitop` to make your own monitoring tools. The full API references host at <https://nvitop.readthedocs.io>. Runnable reference scripts live in [`examples/`](./examples/).
|
||||
|
||||
<p align="center">
|
||||
<a href="./examples/monitor-web">
|
||||
<img width="100%" src="https://github.com/user-attachments/assets/a8688e16-52ec-4310-bc48-d5b303331481" alt="Web Monitor Dashboard">
|
||||
</a>
|
||||
<br/>
|
||||
A browser dashboard example built on top of <code>nvitop.collect_in_background</code>.
|
||||
</p>
|
||||
|
||||
#### Quick Start
|
||||
|
||||
A minimal script to monitor the GPU devices based on APIs from `nvitop`:
|
||||
|
|
|
|||
|
|
@ -1,13 +1,31 @@
|
|||
# Web Monitor (stdlib HTTP(S))
|
||||
# Web Monitor (HTTP(S) Dashboard)
|
||||
|
||||
A minimal browser dashboard for [`nvitop.collect_in_background`][cib]: the collector ticks on a daemon thread, samples are pushed into a rotating ring buffer (24h by default), and a tiny `http.server`-based router serves both a one-page HTML dashboard and JSON snapshots at `/metrics.json` and `/history.json`. Stdlib only — no Flask, no TensorBoard, no extra dependencies. Supports HTTPS and mutual TLS via the same flag names as [`nvitop-exporter`][exporter].
|
||||
`monitor_web.py` serves a small browser dashboard for [`nvitop.collect_in_background`][cib]. The Python side uses the standard-library `http.server` stack, stores collector samples in a rotating in-memory buffer, and exposes the same data through JSON endpoints. The browser side loads [Plotly] from a CDN for the time-series charts.
|
||||
|
||||
## APIs Used
|
||||
|
||||
- [`nvitop.collect_in_background`][cib]
|
||||
- [`nvitop.ResourceMetricCollector`][collector]
|
||||
- [`nvitop.Device.cuda.all()`][cuda-all]
|
||||
- [`nvitop.colored`][colored] (for the startup banner)
|
||||
- [`nvitop.Device.all()`][device-all]
|
||||
- [`nvitop.bytes2human()`][bytes2human]
|
||||
- [`nvitop.colored()`][colored] for the startup banner
|
||||
|
||||
## What It Shows
|
||||
|
||||
- Host CPU, host memory, swap, and buffer status badges.
|
||||
- A host history chart for CPU percent and host memory percent, labeled with memory usage.
|
||||
- One card per GPU, using the raw NVIDIA/NVML GPU index.
|
||||
- Per-GPU current bars for GPU utilization, memory bandwidth, GPU memory, and power.
|
||||
- One history chart per GPU under the cards, plotting the same four metrics.
|
||||
- History range buttons for `1m`, `5m`, `15m`, `30m`, `1h`, `3h`, `6h`, `12h`, and `24h`.
|
||||
|
||||
Cards display current values from the collector's `/last` metrics. Plot legends display the latest visible sample in the selected range, also using `/last` keys. The JSON payload still includes aggregate keys such as `/mean`, `/min`, `/max`, and `/last`.
|
||||
|
||||
Process snapshots are disabled with `root_pids={}` so the dashboard tracks host and device metrics without collecting per-process GPU rows.
|
||||
|
||||
## Screenshot
|
||||
|
||||

|
||||
|
||||
## Run
|
||||
|
||||
|
|
@ -15,61 +33,80 @@ A minimal browser dashboard for [`nvitop.collect_in_background`][cib]: the colle
|
|||
python3 examples/monitor-web/monitor_web.py --port 5555
|
||||
```
|
||||
|
||||
The startup banner (printed to `stderr`, mirroring [`nvitop-exporter`][exporter]) reports the device count, per-GPU UUIDs, the retention/interval summary, and the three URLs:
|
||||
Open <http://127.0.0.1:5555/> in a browser.
|
||||
|
||||
The backend collector samples every `--interval` seconds, defaulting to `1.0`. The frontend polls `/metrics.json` every second and marks the dashboard stale if the latest sample is too old.
|
||||
|
||||
The startup banner is printed to `stderr`:
|
||||
|
||||
```text
|
||||
INFO: Found 1 device(s).
|
||||
INFO: GPU 0: NVIDIA RTX 6000 Ada Generation (UUID: GPU-...)
|
||||
INFO: Retention 1d at 1.0s interval (max 86400 samples).
|
||||
INFO: Found 4 device(s).
|
||||
INFO: GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-...)
|
||||
INFO: Retention 1d at 1s interval (max 86400 samples).
|
||||
INFO: Serving the dashboard at http://127.0.0.1:5555/
|
||||
INFO: - JSON snapshot: http://127.0.0.1:5555/metrics.json
|
||||
INFO: - JSON history: http://127.0.0.1:5555/history.json
|
||||
INFO: - JSON snapshot: http://127.0.0.1:5555/metrics.json
|
||||
INFO: - JSON history: http://127.0.0.1:5555/history.json
|
||||
```
|
||||
|
||||
The browser dashboard polls `/metrics.json` every `--interval` seconds and renders a card per visible GPU (utilization, memory, temperature, fan, power) plus a host footer.
|
||||
## JSON Endpoints
|
||||
|
||||
Inspect the raw JSON from the CLI:
|
||||
`/metrics.json` returns the latest sample plus metadata:
|
||||
|
||||
- `interval`: collector interval in seconds.
|
||||
- `server_time`: current server timestamp.
|
||||
- `sample_time`: timestamp for the latest collected sample.
|
||||
- `stale_seconds`: age of the latest sample.
|
||||
- `buffer`: count, max count, retention, oldest sample, and newest sample.
|
||||
- `devices`: raw GPU index, name, memory total, and UUID.
|
||||
- `metrics`: raw collector metric keys and numeric values.
|
||||
- `metrics_human`: human-readable memory values for finite MiB/GiB metrics.
|
||||
|
||||
Inspect it from the shell:
|
||||
|
||||
```bash
|
||||
curl -s http://127.0.0.1:5555/metrics.json | python3 -m json.tool | head -30
|
||||
curl -s 'http://127.0.0.1:5555/history.json?limit=10' | python3 -m json.tool | head -30
|
||||
curl -s http://127.0.0.1:5555/metrics.json | python3 -m json.tool | head -60
|
||||
```
|
||||
|
||||
`/history.json` accepts two optional query parameters:
|
||||
`/history.json` returns buffered samples:
|
||||
|
||||
- `?limit=N` — return only the most recent `N` samples.
|
||||
- `?since=EPOCH` — return samples strictly newer than the Unix timestamp `EPOCH`.
|
||||
```bash
|
||||
curl -s 'http://127.0.0.1:5555/history.json?limit=10' | python3 -m json.tool
|
||||
curl -s 'http://127.0.0.1:5555/history.json?since=1779270000' | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Retention
|
||||
Supported query parameters:
|
||||
|
||||
Use `--retention` to size the rotating buffer. The flag accepts `s`/`m`/`h`/`d` suffixes; a bare number is treated as seconds.
|
||||
- `limit=N`: return only the most recent `N` samples.
|
||||
- `since=EPOCH`: return samples strictly newer than the Unix timestamp `EPOCH`.
|
||||
|
||||
JSON responses are strict JSON. Non-finite collector values such as `NaN` and `Infinity` are serialized as `null`.
|
||||
|
||||
## History And Retention
|
||||
|
||||
Use `--retention` to size the rotating buffer. The flag accepts `s`, `m`/`min`, `h`, and `d` suffixes; a bare number is treated as seconds.
|
||||
|
||||
```bash
|
||||
python3 examples/monitor-web/monitor_web.py --retention 12h
|
||||
python3 examples/monitor-web/monitor_web.py --retention 30min --interval 5
|
||||
python3 examples/monitor-web/monitor_web.py --retention 600 # 600 seconds
|
||||
python3 examples/monitor-web/monitor_web.py --retention 600
|
||||
```
|
||||
|
||||
The buffer holds at most `int(retention / interval)` samples, so memory scales as `samples × keys_per_sample × 8 B`. Bump `--interval` (e.g. `--interval 5`) to keep the same retention at lower memory cost.
|
||||
The buffer holds at most `int(retention / interval)` samples. Each sample stores the collector's metric dictionary, including aggregate keys such as `mean`, `min`, `max`, and `last`, so memory use depends on the sample count, the exported metric-key count, and normal Python object overhead. Increase `--interval` to keep the same retention window with fewer stored samples.
|
||||
|
||||
## HTTPS / mTLS
|
||||
## TLS And Mutual TLS
|
||||
|
||||
Generate a throw-away self-signed certificate for local testing:
|
||||
Serve plain HTTP by default, or pass a certificate and key to enable HTTPS:
|
||||
|
||||
```bash
|
||||
openssl req -x509 -newkey rsa:2048 -nodes -days 365 \
|
||||
-subj '/CN=localhost' \
|
||||
-keyout key.pem -out cert.pem
|
||||
```
|
||||
|
||||
Then serve over HTTPS:
|
||||
|
||||
```bash
|
||||
python3 examples/monitor-web/monitor_web.py --port 5555 \
|
||||
--certfile cert.pem --keyfile key.pem
|
||||
```
|
||||
|
||||
For mutual TLS (require the client to present a trusted certificate):
|
||||
To require client certificates, also provide a trusted client CA bundle or CA directory:
|
||||
|
||||
```bash
|
||||
python3 examples/monitor-web/monitor_web.py --port 5555 \
|
||||
|
|
@ -77,12 +114,23 @@ python3 examples/monitor-web/monitor_web.py --port 5555 \
|
|||
--client-cafile ca.pem --client-auth-required
|
||||
```
|
||||
|
||||
The TLS / mTLS flag names match [`nvitop-exporter`][exporter] so the same cert/key combo works for both tools.
|
||||
`--client-cafile` or `--client-capath` must be specified together with `--client-auth-required`.
|
||||
|
||||
## Useful Flags
|
||||
|
||||
- `--bind-address ADDRESS`, `--bind ADDRESS`, `-B ADDRESS`: bind address, default `127.0.0.1`.
|
||||
- `--port PORT`, `-p PORT`: listen port, default `5555`.
|
||||
- `--interval SEC`: collector interval in seconds, minimum `0.25`, default `1.0`.
|
||||
- `--retention DURATION`: history retention, default `1d`.
|
||||
- `--certfile PATH` and `--keyfile PATH`: enable HTTPS.
|
||||
- `--client-cafile PATH` or `--client-capath PATH`: trusted client CAs for mutual TLS.
|
||||
- `--client-auth-required`: require a valid client certificate.
|
||||
|
||||
See [`../README.md`](../README.md) for the full example index.
|
||||
|
||||
[Plotly]: https://plotly.com/javascript/
|
||||
[bytes2human]: https://nvitop.readthedocs.io/en/latest/api/utils.html#nvitop.bytes2human
|
||||
[cib]: https://nvitop.readthedocs.io/en/latest/api/collector.html#nvitop.collect_in_background
|
||||
[collector]: https://nvitop.readthedocs.io/en/latest/api/collector.html#nvitop.ResourceMetricCollector
|
||||
[colored]: https://nvitop.readthedocs.io/en/latest/api/utils.html#nvitop.colored
|
||||
[cuda-all]: https://nvitop.readthedocs.io/en/latest/api/device.html#nvitop.CudaDevice.all
|
||||
[exporter]: https://github.com/XuehaiPan/nvitop/tree/main/nvitop-exporter
|
||||
[device-all]: https://nvitop.readthedocs.io/en/latest/api/device.html#nvitop.Device.all
|
||||
|
|
|
|||
|
|
@ -354,7 +354,9 @@
|
|||
|
||||
const trimChartSamples = (nowEpoch = Date.now() / 1000) => {
|
||||
const minEpoch = nowEpoch - historySeconds();
|
||||
chartSamples = chartSamples.filter((sample) => sample.epoch >= minEpoch);
|
||||
chartSamples = chartSamples.filter(
|
||||
(sample) => sample.epoch >= minEpoch,
|
||||
);
|
||||
};
|
||||
|
||||
const chartSampleCount = () =>
|
||||
|
|
@ -370,7 +372,8 @@
|
|||
const parsed = Date.parse(value);
|
||||
if (Number.isFinite(parsed)) return parsed / 1000;
|
||||
const numeric = Number(value);
|
||||
if (Number.isFinite(numeric)) return numeric > 1e12 ? numeric / 1000 : numeric;
|
||||
if (Number.isFinite(numeric))
|
||||
return numeric > 1e12 ? numeric / 1000 : numeric;
|
||||
return NaN;
|
||||
};
|
||||
|
||||
|
|
@ -378,7 +381,8 @@
|
|||
if (chartXRange === null) return chartSamples;
|
||||
const start = epochFromRangeValue(chartXRange[0]);
|
||||
const end = epochFromRangeValue(chartXRange[1]);
|
||||
if (!Number.isFinite(start) || !Number.isFinite(end)) return chartSamples;
|
||||
if (!Number.isFinite(start) || !Number.isFinite(end))
|
||||
return chartSamples;
|
||||
const minEpoch = Math.min(start, end);
|
||||
const maxEpoch = Math.max(start, end);
|
||||
return chartSamples.filter(
|
||||
|
|
@ -433,8 +437,11 @@
|
|||
latestVisibleDeviceMetric(scope, "power_usage");
|
||||
|
||||
const tracePercentName = (label, value) =>
|
||||
Number.isFinite(value) ? `${label} ${fmtPercentOne(value)}` : `${label} —`;
|
||||
const traceUsageName = (label, value) => `${label} ${fmtGibUsage(value)}`;
|
||||
Number.isFinite(value)
|
||||
? `${label} ${fmtPercentOne(value)}`
|
||||
: `${label} —`;
|
||||
const traceUsageName = (label, value) =>
|
||||
`${label} ${fmtGibUsage(value)}`;
|
||||
const traceMemoryName = (label, usage, pct) =>
|
||||
`${label} ${fmtMibUsage(usage)} (${fmtPercentOne(pct)})`;
|
||||
const tracePowerName = (label, watts) => `${label} ${fmtW(watts)}`;
|
||||
|
|
@ -467,7 +474,10 @@
|
|||
|
||||
const xRangeFromRelayout = (event) => {
|
||||
if (!event) return null;
|
||||
if (Array.isArray(event["xaxis.range"]) && event["xaxis.range"].length === 2) {
|
||||
if (
|
||||
Array.isArray(event["xaxis.range"]) &&
|
||||
event["xaxis.range"].length === 2
|
||||
) {
|
||||
return event["xaxis.range"];
|
||||
}
|
||||
if (
|
||||
|
|
@ -490,14 +500,16 @@
|
|||
const handleChartRelayout = (event) => {
|
||||
const range = xRangeFromRelayout(event);
|
||||
if (range !== null) {
|
||||
const changed = historyRange !== null || !xRangesEqual(chartXRange, range);
|
||||
const changed =
|
||||
historyRange !== null || !xRangesEqual(chartXRange, range);
|
||||
historyRange = null;
|
||||
chartXRange = range;
|
||||
setActiveHistoryRange(null);
|
||||
if (changed) renderAllCharts();
|
||||
} else if (event && event["xaxis.autorange"]) {
|
||||
const restoredRange = selectedHistoryRange || DEFAULT_HISTORY_RANGE;
|
||||
const changed = historyRange !== restoredRange || chartXRange !== null;
|
||||
const changed =
|
||||
historyRange !== restoredRange || chartXRange !== null;
|
||||
historyRange = selectedHistoryRange || DEFAULT_HISTORY_RANGE;
|
||||
chartXRange = null;
|
||||
setActiveHistoryRange(historyButton(historyRange));
|
||||
|
|
@ -578,7 +590,10 @@
|
|||
hovertemplate: "CPU %{y:.1f}%<extra></extra>",
|
||||
line: { color: "#4ade80", width: 2 },
|
||||
mode: "lines",
|
||||
name: tracePercentName("CPU", latestVisiblePercent("cpu_percent")),
|
||||
name: tracePercentName(
|
||||
"CPU",
|
||||
latestVisiblePercent("cpu_percent"),
|
||||
),
|
||||
type: "scatter",
|
||||
x,
|
||||
y: cpu,
|
||||
|
|
@ -588,7 +603,10 @@
|
|||
hovertemplate: "Host Memory %{text}<extra></extra>",
|
||||
line: { color: "#38bdf8", width: 2 },
|
||||
mode: "lines",
|
||||
name: traceUsageName("Host Memory", latestVisibleHostMemoryUsage()),
|
||||
name: traceUsageName(
|
||||
"Host Memory",
|
||||
latestVisibleHostMemoryUsage(),
|
||||
),
|
||||
text: memoryUsage,
|
||||
type: "scatter",
|
||||
x,
|
||||
|
|
@ -651,7 +669,8 @@
|
|||
nextIds.length !== currentIds.length ||
|
||||
nextIds.some((id, index) => id !== currentIds[index])
|
||||
) {
|
||||
for (const id of currentIds) chartRelayoutHandlersAttached.delete(id);
|
||||
for (const id of currentIds)
|
||||
chartRelayoutHandlersAttached.delete(id);
|
||||
chartsEl.replaceChildren(...devices.map(buildGpuChartPanel));
|
||||
}
|
||||
};
|
||||
|
|
@ -671,11 +690,19 @@
|
|||
power: "#f87171",
|
||||
};
|
||||
const gpu = chartSamples.map((sample) => {
|
||||
const value = sampleDeviceMetric(sample, scope, "gpu_utilization");
|
||||
const value = sampleDeviceMetric(
|
||||
sample,
|
||||
scope,
|
||||
"gpu_utilization",
|
||||
);
|
||||
return Number.isFinite(value) ? value : null;
|
||||
});
|
||||
const membw = chartSamples.map((sample) => {
|
||||
const value = sampleDeviceMetric(sample, scope, "memory_utilization");
|
||||
const value = sampleDeviceMetric(
|
||||
sample,
|
||||
scope,
|
||||
"memory_utilization",
|
||||
);
|
||||
return Number.isFinite(value) ? value : null;
|
||||
});
|
||||
const memory = chartSamples.map((sample) => {
|
||||
|
|
@ -733,7 +760,9 @@
|
|||
),
|
||||
},
|
||||
];
|
||||
const legend = document.querySelector(`[data-gpu-legend="${chartId}"]`);
|
||||
const legend = document.querySelector(
|
||||
`[data-gpu-legend="${chartId}"]`,
|
||||
);
|
||||
if (legend) setChartLegend(legend, legendItems);
|
||||
|
||||
const layout = {
|
||||
|
|
@ -822,7 +851,8 @@
|
|||
const sample = { epoch, metrics: metrics || {} };
|
||||
const last = chartSamples[chartSamples.length - 1];
|
||||
if (last && epoch <= last.epoch) {
|
||||
if (epoch === last.epoch) chartSamples[chartSamples.length - 1] = sample;
|
||||
if (epoch === last.epoch)
|
||||
chartSamples[chartSamples.length - 1] = sample;
|
||||
} else {
|
||||
chartSamples.push(sample);
|
||||
}
|
||||
|
|
@ -850,7 +880,9 @@
|
|||
epoch: Number(sample.epoch),
|
||||
metrics: sample.metrics || {},
|
||||
}))
|
||||
.filter((sample) => Number.isFinite(sample.epoch) && sample.epoch > 0);
|
||||
.filter(
|
||||
(sample) => Number.isFinite(sample.epoch) && sample.epoch > 0,
|
||||
);
|
||||
trimChartSamples();
|
||||
renderAllCharts();
|
||||
}
|
||||
|
|
@ -863,7 +895,8 @@
|
|||
historyRange = button.dataset.range || DEFAULT_HISTORY_RANGE;
|
||||
selectedHistoryRange = historyRange;
|
||||
historyWindowSeconds =
|
||||
HISTORY_RANGES[historyRange] || HISTORY_RANGES[DEFAULT_HISTORY_RANGE];
|
||||
HISTORY_RANGES[historyRange] ||
|
||||
HISTORY_RANGES[DEFAULT_HISTORY_RANGE];
|
||||
chartXRange = null;
|
||||
setActiveHistoryRange(button);
|
||||
syncHistory();
|
||||
|
|
@ -992,8 +1025,8 @@
|
|||
`${memUsedHuman} (${fmtPercentOne(memPct)})`;
|
||||
const tempEl = card.querySelector('[data-val="temp"]');
|
||||
tempEl.textContent = fmtC(temp);
|
||||
// Temperature is degrees Celsius, not a fraction — no bar, but keep the
|
||||
// warn/danger color cue on the numeric value (>=70 warn, >=90 danger).
|
||||
// Temperature is degrees Celsius, not a fraction — no bar, but keep the warn/danger color
|
||||
// cue on the numeric value (>=70 warn, >=90 danger).
|
||||
tempEl.style.color = !Number.isFinite(temp)
|
||||
? ""
|
||||
: temp >= 90
|
||||
|
|
|
|||
|
|
@ -134,9 +134,9 @@ def cprint(text: str = '', *, file: TextIO | None = None) -> None:
|
|||
class MetricStore:
|
||||
"""Lock-protected rotating buffer of collector samples.
|
||||
|
||||
Each entry is ``(epoch_timestamp, metrics_dict)``. The buffer keeps at most
|
||||
``int(retention_seconds / interval)`` samples; older entries are evicted automatically by
|
||||
:class:`deque`.
|
||||
Each entry is ``(timestamp, metrics_dict)``.
|
||||
The buffer keeps at most ``int(retention / interval)`` samples; older entries are evicted
|
||||
automatically by :class:`deque`.
|
||||
"""
|
||||
|
||||
def __init__(self, *, retention_seconds: float, interval: float) -> None:
|
||||
|
|
@ -267,9 +267,9 @@ class MonitorRequestHandler(http.server.BaseHTTPRequestHandler):
|
|||
self._send_json(payload)
|
||||
|
||||
def _send_json(self, payload: object) -> None:
|
||||
# `allow_nan=False` makes strict JSON; ``_finite()`` first maps
|
||||
# `math.nan`/`math.inf` (which the collector emits for missing samples)
|
||||
# to `None` so the browser's `JSON.parse` accepts the body.
|
||||
# `allow_nan=False` makes strict JSON; ``_finite()`` first maps `math.nan`/`math.inf` (which
|
||||
# the collector emits for missing samples) to `None` so the browser's `JSON.parse` accepts
|
||||
# the body.
|
||||
body = json.dumps(_finite(payload), allow_nan=False, default=float).encode('utf-8')
|
||||
self.send_response(200)
|
||||
self.send_header('Content-Type', 'application/json; charset=utf-8')
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue