compute,storage: expose wallclock lag history in SQL #29449

teskje · 2024-09-09T12:03:23Z

This PR adds a new builtin source, mz_wallclock_lag_history, that reports the minute-by-minute maximum wallclock lag (diff between write frontier and wallclock time) observed for each storage and compute collection. Both controllers take part in populating the history, each contributing lags for the collections they are responsible for.

mz_wallclock_lag_history reports per-replica lag for compute objects and global lag (based on the persisted frontier) for storage collections. Where there is overlap (i.e. for materialized views), both are reported. There is also a builtin view, mz_wallclock_global_lag_history that exposes only global lags for all collections (by taking the minimum of all recorded lags). This structure was previously discussed here.

To give the console fast access to the wallclock lag data, an index view mz_wallclock_global_lag_recent_history is also provided. This view has a temporal filter to 1 day, to keep the index's memory usage in check.

The first commit is from #29423 and can be ignored here.

Motivation

This PR adds a known-desirable feature.

Part of MaterializeInc/database-issues#8235

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

src/storage-client/src/healthcheck.rs

test/testdrive/wallclock-lag.td

shepherdlybot · 2024-09-10T14:30:51Z

Mitigations

Completing required mitigations increases Resilience Coverage.

(Required) Code Review 🔍 Detected
(Required) Feature Flag
(Required) Integration Test 🔍 Detected
(Required) Observability
(Required) QA Review 🔍 Detected
(Required) Run Nightly Tests
Unit Test

Risk Summary:

The pull request carries a high risk score of 83, driven by predictors such as the average line count in files and executable lines within files, with two file hotspots involved. Historically, PRs with these predictors have been 152% more likely to cause a bug compared to the repository baseline. Notably, the observed bug trend in the repository is currently decreasing.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File	Percentile
../src/lib.rs	98
../src/controller.rs	96

src/storage-client/src/healthcheck.rs

doc/user/content/sql/system-catalog/mz_internal.md

src/storage-client/src/healthcheck.rs

benesch · 2024-09-12T04:43:43Z

test/testdrive/wallclock-lag.td

+$ postgres-connect name=mz_system url=postgres://mz_system:materialize@${testdrive.materialize-internal-sql-addr}
+
+$ postgres-execute connection=mz_system
+ALTER SYSTEM SET wallclock_lag_refresh_interval = '1s'


:chefskiss:

benesch

This looks awesome, @teskje — thank you! I didn't review the storage controller changes in full detail, so ideally @petrosagg has time to take a second look there tomorrow, but if not I'm comfortable with you merging anyway so that this makes the release cut.

teskje · 2024-09-12T08:17:44Z

I deployed this branch on my staging env and let it run for a day to observe the memory usage of the new index.

This is without any user collections, so just the system sources/tables/indexes/subscribes. There are 303 of these currently (plus a bunch of slow-path SELECTs randomly showing up).

The arrangement size stabilizes at 26 MiB on my env. That's roughly 88 KiB per collection. The environment with the most collections right now has 1258, which would end up at ~108 MiB.

If we assume twice that amount, to account for the higher per-record usage for the switch from uint8 to Interval, and other overheads our arrangement size metrics don't track, we end up at 216 MiB in the worst case (and 65 MiB for the median). All of the production environments easily have that much to spare, so it's safe to merge this PR.

Even if somehow my estimations are wildly off, we still have two means to adjust this in prod: scaling up the size of mz_catalog_server, and reducing the interval with which new wallclock lag events are emitted.

This commit introduces the `mz_wallclock_lag_history` source, as well as a view + index that retains the last day of events, as well as the corresponding `IntrospectionType`.

This commit makes the compute controller write the maximum wallclock lag of all compute collections to introspection every minute.

This commit makes the storage controller makes the maximum wallclock lag to introspection for all storage collections every minute.

Also adds a dyncfg to configure the refresh interval, to avoid having the test take several minutes.

petrosagg

LGTM!

teskje · 2024-09-12T14:47:23Z

TFTRs!

teskje force-pushed the mz_wallclock_lag_history branch 2 times, most recently from 9921713 to 4262a02 Compare September 10, 2024 12:11

teskje commented Sep 10, 2024

View reviewed changes

src/storage-client/src/healthcheck.rs Outdated Show resolved Hide resolved

teskje commented Sep 10, 2024

View reviewed changes

test/testdrive/wallclock-lag.td Show resolved Hide resolved

teskje marked this pull request as ready for review September 10, 2024 14:29

teskje requested review from a team and benesch as code owners September 10, 2024 14:29

teskje requested review from ParkMyCar and petrosagg September 10, 2024 14:29

teskje force-pushed the mz_wallclock_lag_history branch 2 times, most recently from b269186 to e01f772 Compare September 10, 2024 17:43

petrosagg reviewed Sep 11, 2024

View reviewed changes

src/storage-client/src/healthcheck.rs Outdated Show resolved Hide resolved

doc/user/content/sql/system-catalog/mz_internal.md Outdated Show resolved Hide resolved

teskje mentioned this pull request Sep 11, 2024

controller: move frontier recording into storage controller #29423

Merged

5 tasks

benesch approved these changes Sep 12, 2024

View reviewed changes

benesch reviewed Sep 12, 2024

View reviewed changes

teskje force-pushed the mz_wallclock_lag_history branch from e01f772 to 68d6735 Compare September 12, 2024 09:16

teskje requested a review from petrosagg September 12, 2024 09:18

teskje force-pushed the mz_wallclock_lag_history branch 2 times, most recently from 0b23628 to 1b91cf0 Compare September 12, 2024 13:08

teskje added 5 commits September 12, 2024 15:12

Add wallclock lag source and introspection type

95b203e

This commit introduces the `mz_wallclock_lag_history` source, as well as a view + index that retains the last day of events, as well as the corresponding `IntrospectionType`.

compute: emit wallclock lag introspection

b466e06

This commit makes the compute controller write the maximum wallclock lag of all compute collections to introspection every minute.

storage: emit wallclock lag introspection

359b965

This commit makes the storage controller makes the maximum wallclock lag to introspection for all storage collections every minute.

doc/user: document mz_wallclock_lag_history

8dfaf3d

test: add tests for mz_wallclock_lag_history

807d26e

Also adds a dyncfg to configure the refresh interval, to avoid having the test take several minutes.

teskje force-pushed the mz_wallclock_lag_history branch from 1b91cf0 to 807d26e Compare September 12, 2024 13:13

petrosagg approved these changes Sep 12, 2024

View reviewed changes

teskje merged commit eb78cb0 into MaterializeInc:main Sep 12, 2024
86 checks passed

teskje deleted the mz_wallclock_lag_history branch September 12, 2024 14:47

github-actions bot locked and limited conversation to collaborators Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute,storage: expose wallclock lag history in SQL #29449

compute,storage: expose wallclock lag history in SQL #29449

teskje commented Sep 9, 2024 •

edited

Loading

shepherdlybot bot commented Sep 10, 2024 •

edited

Loading

benesch Sep 12, 2024

benesch left a comment

teskje commented Sep 12, 2024

petrosagg left a comment

teskje commented Sep 12, 2024

compute,storage: expose wallclock lag history in SQL #29449

compute,storage: expose wallclock lag history in SQL #29449

Conversation

teskje commented Sep 9, 2024 • edited Loading

Motivation

Checklist

shepherdlybot bot commented Sep 10, 2024 • edited Loading

Mitigations

benesch Sep 12, 2024

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

teskje commented Sep 12, 2024

petrosagg left a comment

Choose a reason for hiding this comment

teskje commented Sep 12, 2024

teskje commented Sep 9, 2024 •

edited

Loading

shepherdlybot bot commented Sep 10, 2024 •

edited

Loading