Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add balancer meter fixes #4470 #4471

Closed
wants to merge 8 commits into from

Conversation

EdColeman
Copy link
Contributor

@EdColeman EdColeman commented Apr 18, 2024

Adds two gauges that can be used to monitor balancing

  • manager.balancer.balancing.total - snapshot of the number of current migrations being performed.
  • manager.balancer.need.balancing.total - snapshot of the total number of migrations needed to balancing.

This fixes #4470

@EdColeman EdColeman requested a review from ddanielr April 18, 2024 20:46
@EdColeman EdColeman self-assigned this Apr 18, 2024
@EdColeman
Copy link
Contributor Author

I confirmed that with uno instance configured to use a LoggingProvider, the meters appear in the manager metrics log:

METRICS: 2024-04-19T12:08:37,303, manager.balancer.balancing.total{host=localhost,instance.name=uno,port=9999,process.name=manager} value=0
METRICS: 2024-04-19T12:08:37,303, manager.balancer.need.balancing.total{host=localhost,instance.name=uno,port=9999,process.name=manager} value=0

@EdColeman EdColeman force-pushed the add_balancer_meter branch from 4598f9c to d0d9997 Compare April 24, 2024 15:51
@EdColeman EdColeman marked this pull request as ready for review April 24, 2024 17:01
Comment on lines +946 to +947
balancerMetrics.setMigratingCount(params.migrationsOut().size());
balancerMetrics.setNeedMigrationCount(migrations.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the migrations variable in the Manager contains all of the in-flight migrations. The params.migrationsOut contains the tablets that need to be migrated from this call to the balancer. I think based on what I'm seeing here, it's double counting. I wonder if we just want the size of the migrations list as that will encompass newly added, newly removed, and migrations waiting to happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to capture the newly added migrations, then it might make sense to capture the migration completions (which are in the TabletGroupWatcher) and cancellations (which happen in a couple places in the Manager).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to complete the #4470 ask, just recording migrations.size() would be sufficient.
migrationsOut().size() seems useful as a secondary metric for alerting if balancer progress stalls.

Doing some general tests and with a single balancer call, both of the currently implemented metrics come back with the same values.

I think keeping migrations.size() under the migrations.needed metric makes sense.
in-progress wording may need to be tweaked a bit to better describe last set of migrations generated.

For users, a possible alert condition would be migrationsOut().size() is 0 while migrations.size() is still greater than 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put up #4699 as a continuation of these changes.

Ed Coleman added 2 commits May 28, 2024 19:27
@ddanielr ddanielr self-assigned this May 29, 2024
@ctubbsii ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024
@ddanielr
Copy link
Contributor

Closed in favor of #4699

@ddanielr ddanielr closed this Jul 27, 2024
@ctubbsii ctubbsii removed this from the 2.1.3 milestone Jul 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants