Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

Open
frittentheke opened this issue Jul 5, 2024 · 2 comments · May be fixed by #355
Labels
bug Something isn't working

Comments

@frittentheke
Copy link

frittentheke commented Jul 5, 2024

What is the version?

3.3.5-3.4.1

What happened?

When activating MIG we saw duplicated and plain wrong metrics in the provided Grafana dashboard (https://github.com/NVIDIA/dcgm-exporter/tree/main/grafana).

The issue seems to be two-fold, with Grafana as well as the raw metrics themselves:

  1. Firstly the dashboard: Legends, ... and PromQL queries used to fetch metrics do not take MIG into account. So metrics returning MIG subdevices (GPU_I_ID) are not considered.
    GPU metrics regarding have not been up

  2. Secondly the metrics:

What did you expect to happen?

Provided MIG and other ways of partitioning GPUs (vGPU, time-slicing, ...) is quite common, I'd expect the exporter and the provided dashboard to take those into account.

Metrics that are available per-subdevice should be returned, if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

What is the GPU model?

H100s, using different MIG profiles and whole GPUs

What is the environment?

Kubernetes

How did you deploy the dcgm-exporter and what is the configuration?

Kubernetes with GPU-Operator

How to reproduce the issue?

Enable MIG on a GPU and look at the dashboard.

Anything else we need to know?

There are multiple issues with DCGM or the operator open:

@frittentheke frittentheke added the bug Something isn't working label Jul 5, 2024
@frittentheke frittentheke changed the title Duplicated / wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values Duplicated/missing and wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values Jul 5, 2024
@frittentheke frittentheke changed the title Duplicated/missing and wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values Jul 5, 2024
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this issue Jul 8, 2024
…name)

* Change PromQL queries to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <[email protected]>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this issue Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_GPU_UTIL instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <[email protected]>
frittentheke added a commit to frittentheke/dcgm-exporter that referenced this issue Jul 8, 2024
…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <[email protected]>
@nvvfedorov
Copy link
Collaborator

@frittentheke, Thank you for reporting about the issue. Am I right that the main request is the following: Metrics that are available per-subdevice should be returned; if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

@frittentheke
Copy link
Author

frittentheke commented Jul 8, 2024

Thanks @nvvfedorov for your fast response!

@frittentheke, Thank you for reporting about the issue. Am I right that the main request is the following: Metrics that are available per-subdevice should be returned; if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

  1. Yes. Please also see my PR ([dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname) #355) in which I (had to) apply aggregations like max() to work around this for the dashboard. If you'd consider removing those duplicated metrics, I gladly simplify the PromQL queries for the dashboard / my PR (again).

  2. If I may add another matter to my findings (which I also hit during the dashboard rework) - the exported labels are somewhat mixed-case with about all variants possible: Hostname vs. DCGM_FI_DRIVER_VERSION vs gpu vs. modelName. Please also consider cleaning this up. Especially when trying to join time-series via a set of labels it's really painful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants