Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

frittentheke · 2024-07-05T09:46:12Z

What is the version?

3.3.5-3.4.1

What happened?

When activating MIG we saw duplicated and plain wrong metrics in the provided Grafana dashboard (https://github.com/NVIDIA/dcgm-exporter/tree/main/grafana).

The issue seems to be two-fold, with Grafana as well as the raw metrics themselves:

Firstly the dashboard: Legends, ... and PromQL queries used to fetch metrics do not take MIG into account. So metrics returning MIG subdevices (GPU_I_ID) are not considered.
GPU metrics regarding have not been up
Secondly the metrics:

Even if the queries were updated via e.g some aggregations like max(), avg() or sum() to avoid duplication, there are some metrics reported back per GPU_I_ID, that do not have this granularity. See me comment Attributing GPU power among MIG instances. #257 (comment). So if the power draw is not measured per GPU_I_ID you cannot return it individually as you would be returning false values.
Reading DCGM_FI_DEV_GPU_UTIL with MIG devices DCGM#80 (comment) it seems the GPU metrics should be replaced by DCGM_FI_PROF_* .
But apparently there are even more open discussions around that: DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG DCGM#138, Question about DCGM fields DCGM#64 and Questions about fields identifiers DCGM#48.
This comment by @bstollenvidia seems to sum up quite nicely how things work: Question about DCGM fields DCGM#64 (comment)

What did you expect to happen?

Provided MIG and other ways of partitioning GPUs (vGPU, time-slicing, ...) is quite common, I'd expect the exporter and the provided dashboard to take those into account.

Metrics that are available per-subdevice should be returned, if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

What is the GPU model?

H100s, using different MIG profiles and whole GPUs

What is the environment?

Kubernetes

How did you deploy the dcgm-exporter and what is the configuration?

Kubernetes with GPU-Operator

How to reproduce the issue?

Enable MIG on a GPU and look at the dashboard.

Anything else we need to know?

There are multiple issues with DCGM or the operator open:

The text was updated successfully, but these errors were encountered:

…name) * Change PromQL queries to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>

…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <[email protected]>

nvvfedorov · 2024-07-08T14:35:19Z

@frittentheke, Thank you for reporting about the issue. Am I right that the main request is the following: Metrics that are available per-subdevice should be returned; if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

frittentheke · 2024-07-08T14:53:17Z

Thanks @nvvfedorov for your fast response!

@frittentheke, Thank you for reporting about the issue. Am I right that the main request is the following: Metrics that are available per-subdevice should be returned; if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

Yes. Please also see my PR ([dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname) #355) in which I (had to) apply aggregations like max() to work around this for the dashboard. If you'd consider removing those duplicated metrics, I gladly simplify the PromQL queries for the dashboard / my PR (again).
If I may add another matter to my findings (which I also hit during the dashboard rework) - the exported labels are somewhat mixed-case with about all variants possible: Hostname vs. DCGM_FI_DRIVER_VERSION vs gpu vs. modelName. Please also consider cleaning this up. Especially when trying to join time-series via a set of labels it's really painful.

frittentheke added the bug Something isn't working label Jul 5, 2024

frittentheke mentioned this issue Jul 5, 2024

Attributing GPU power among MIG instances. #257

Closed

frittentheke changed the title ~~Duplicated / wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values~~ Duplicated/missing and wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values Jul 5, 2024

frittentheke changed the title ~~Duplicated/missing and wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values~~ Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values Jul 5, 2024

This was referenced Jul 5, 2024

How to configure dcgm metrics for MIG? NVIDIA/gpu-operator#798

Closed

Questions about fields identifiers NVIDIA/DCGM#48

Open

This was referenced Jul 8, 2024

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname) #355

Open

Fix power calculation #288

Open

glowkey mentioned this issue Jul 9, 2024

Seeking community feedback on potential new feature: Standardize labels for next major release #356

Open

frittentheke mentioned this issue Jul 16, 2024

exported_namespace lable conflict with Prometheus metrics NVIDIA/gpu-operator#811

Open

schwesig mentioned this issue Sep 19, 2024

tb decided on: How to calculate GPUs when sliced nerc-project/operations#643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

frittentheke commented Jul 5, 2024 •

edited

Loading

nvvfedorov commented Jul 8, 2024

frittentheke commented Jul 8, 2024 •

edited

Loading

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

Comments

frittentheke commented Jul 5, 2024 • edited Loading

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

nvvfedorov commented Jul 8, 2024

frittentheke commented Jul 8, 2024 • edited Loading

frittentheke commented Jul 5, 2024 •

edited

Loading

frittentheke commented Jul 8, 2024 •

edited

Loading