Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute,storage: expose wallclock lag history in SQL #29449

Merged
merged 5 commits into from
Sep 12, 2024

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Sep 9, 2024

This PR adds a new builtin source, mz_wallclock_lag_history, that reports the minute-by-minute maximum wallclock lag (diff between write frontier and wallclock time) observed for each storage and compute collection. Both controllers take part in populating the history, each contributing lags for the collections they are responsible for.

mz_wallclock_lag_history reports per-replica lag for compute objects and global lag (based on the persisted frontier) for storage collections. Where there is overlap (i.e. for materialized views), both are reported. There is also a builtin view, mz_wallclock_global_lag_history that exposes only global lags for all collections (by taking the minimum of all recorded lags). This structure was previously discussed here.

To give the console fast access to the wallclock lag data, an index view mz_wallclock_global_lag_recent_history is also provided. This view has a temporal filter to 1 day, to keep the index's memory usage in check.

The first commit is from #29423 and can be ignored here.

Motivation

  • This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#8235

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@teskje teskje force-pushed the mz_wallclock_lag_history branch 2 times, most recently from 9921713 to 4262a02 Compare September 10, 2024 12:11
@teskje teskje marked this pull request as ready for review September 10, 2024 14:29
@teskje teskje requested review from a team and benesch as code owners September 10, 2024 14:29
Copy link

shepherdlybot bot commented Sep 10, 2024

Risk Score:83 / 100 Bug Hotspots:2 Resilience Coverage:50%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Feature Flag
  • (Required) Integration Test 🔍 Detected
  • (Required) Observability
  • (Required) QA Review 🔍 Detected
  • (Required) Run Nightly Tests
  • Unit Test
Risk Summary:

The pull request carries a high risk score of 83, driven by predictors such as the average line count in files and executable lines within files, with two file hotspots involved. Historically, PRs with these predictors have been 152% more likely to cause a bug compared to the repository baseline. Notably, the observed bug trend in the repository is currently decreasing.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../src/lib.rs 98
../src/controller.rs 96

@teskje teskje force-pushed the mz_wallclock_lag_history branch 2 times, most recently from b269186 to e01f772 Compare September 10, 2024 17:43
src/storage-client/src/healthcheck.rs Outdated Show resolved Hide resolved
$ postgres-connect name=mz_system url=postgres://mz_system:materialize@${testdrive.materialize-internal-sql-addr}

$ postgres-execute connection=mz_system
ALTER SYSTEM SET wallclock_lag_refresh_interval = '1s'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chefskiss:

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome, @teskje — thank you! I didn't review the storage controller changes in full detail, so ideally @petrosagg has time to take a second look there tomorrow, but if not I'm comfortable with you merging anyway so that this makes the release cut.

@teskje
Copy link
Contributor Author

teskje commented Sep 12, 2024

I deployed this branch on my staging env and let it run for a day to observe the memory usage of the new index.

This is without any user collections, so just the system sources/tables/indexes/subscribes. There are 303 of these currently (plus a bunch of slow-path SELECTs randomly showing up).

The arrangement size stabilizes at 26 MiB on my env. That's roughly 88 KiB per collection. The environment with the most collections right now has 1258, which would end up at ~108 MiB.

If we assume twice that amount, to account for the higher per-record usage for the switch from uint8 to Interval, and other overheads our arrangement size metrics don't track, we end up at 216 MiB in the worst case (and 65 MiB for the median). All of the production environments easily have that much to spare, so it's safe to merge this PR.

Even if somehow my estimations are wildly off, we still have two means to adjust this in prod: scaling up the size of mz_catalog_server, and reducing the interval with which new wallclock lag events are emitted.

@teskje teskje force-pushed the mz_wallclock_lag_history branch from e01f772 to 68d6735 Compare September 12, 2024 09:16
@teskje teskje requested a review from petrosagg September 12, 2024 09:18
@teskje teskje force-pushed the mz_wallclock_lag_history branch 2 times, most recently from 0b23628 to 1b91cf0 Compare September 12, 2024 13:08
This commit introduces the `mz_wallclock_lag_history` source, as well as
a view + index that retains the last day of events, as well as the
corresponding `IntrospectionType`.
This commit makes the compute controller write the maximum wallclock lag
of all compute collections to introspection every minute.
This commit makes the storage controller makes the maximum wallclock lag
to introspection for all storage collections every minute.
Also adds a dyncfg to configure the refresh interval, to avoid having
the test take several minutes.
@teskje teskje force-pushed the mz_wallclock_lag_history branch from 1b91cf0 to 807d26e Compare September 12, 2024 13:13
Copy link
Contributor

@petrosagg petrosagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@teskje
Copy link
Contributor Author

teskje commented Sep 12, 2024

TFTRs!

@teskje teskje merged commit eb78cb0 into MaterializeInc:main Sep 12, 2024
86 checks passed
@teskje teskje deleted the mz_wallclock_lag_history branch September 12, 2024 14:47
@github-actions github-actions bot locked and limited conversation to collaborators Sep 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants