Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong query result from aggregation Operators #703

Open
paulojmdias opened this issue Dec 13, 2024 · 1 comment · May be fixed by #704
Open

Wrong query result from aggregation Operators #703

paulojmdias opened this issue Dec 13, 2024 · 1 comment · May be fixed by #704
Labels

Comments

@paulojmdias
Copy link

paulojmdias commented Dec 13, 2024

I have the following setup with 3 server groups

promxy
  -> Grafana mimir dc1
      -> dc1
      -> google-us-dc1
      -> google-us-dc2
  -> Grafana mimir dc2
      -> dc2
      -> google-us-dc1
      -> google-us-dc2
  -> Grafana mimir dc3
      -> dc3
      -> google-eu-dc1
      -> google-eu-dc2

When we do the following query count(up{label_key="label_value"}) by (region) we have the following results:

{region="dc1"}                       342
{region="google-us-dc1"}    31
{region="google-us-dc2"}    31
{region="dc2"}                       341
{region="dc3"}                       30
{region="google-eu-dc1"}     25
{region="google-eu-dc2"}     24
{region="google-eu-dc3"}     29
{region="google-eu-dc4"}     36
{region="google-eu-dc5"}     29

If we remove the aggregator and do the query count(up{label_key="label_value"}) I expect to have the value 918, but the truth is promxy are returning the max value from the 3 server groups we have, which is 404 and in this case comes from the sum from the data which resides on Grafana mimir dc1

{region="dc1"}                       342
{region="google-us-dc1"}    31
{region="google-us-dc2"}    31

I also did a test, I added a dedicated label to each server group, named __dc__, and when we do the query count(count(up{stack="persistence"}) without (__dc__)), we have the desired value which is 918.
However, let's go and do the expected query count(up{stack="persistence"}). We will have the value 980 since they are counting the values from google-us-dc1 and google-us-dc2 twice because when we add custom labels per server group, we are saying the data on each server group is unique, which is not the case.

Although we are using Mimir, in the end, is a Prometheus query API that we are using, so I don't feel it is related.

We are not overriding the prefer_max option and we are using the version v0.0.91.

I already tried to debug in Promxy code, but I ran without ideas and I decided to open this issue. I'm open to contribute either way if I find something 🙌

@jacksontj
Copy link
Owner

Thanks for reaching out, lets jump into it!

I have the following setup with 3 server groups

I believe there may be a typo in this example; as described this configuration has some overlapping DCs (google-us-dc1 is in mimir dc1 and dc2). Given that the example below has eu-dc1..5 -- I'm assuming mimir dc2 was supposed to be eu? (since otherwise i don't see eu dc3,4,5).

but the truth is promxy are returning the max value from the 3 server groups we have,

This sounds like maybe the servergroup configuration isn't quite right -- as the NodeReplacer (that does the max/rewrite) is done at the top-level. All of the servergroup merging is done lower down. So this does sound like an issue with the servergroup configuration rather than the aggregation rewrite in NodeReplacer.

Although we are using Mimir, in the end, is a Prometheus query API that we are using, so I don't feel it is related.

This seems correct; this seems like an issue with the promxy servergroup config not quite matching your setup.

We are not overriding the prefer_max option and we are using the version v0.0.91.

If we are running into prefer_max we are definitely hitting a servergroup configuration issue. The prefer_max is intended to handle merging of data within a servergroup (defined as a set of API endpoints that "have the same data").

I ran without ideas and I decided to open this issue

I'd be happy to give a hand here! Could you provide your promxy config? Or at least the servergroup configuration. As well as re-iterating the downstreams, their data, and desired merging behavior. I think from there we'll be able to make some progress :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants