-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compute,storage: expose wallclock lag history in SQL #29449
Conversation
9921713
to
4262a02
Compare
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:The pull request carries a high risk score of 83, driven by predictors such as the average line count in files and executable lines within files, with two file hotspots involved. Historically, PRs with these predictors have been 152% more likely to cause a bug compared to the repository baseline. Notably, the observed bug trend in the repository is currently decreasing. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. Bug Hotspots:
|
b269186
to
e01f772
Compare
$ postgres-connect name=mz_system url=postgres://mz_system:materialize@${testdrive.materialize-internal-sql-addr} | ||
|
||
$ postgres-execute connection=mz_system | ||
ALTER SYSTEM SET wallclock_lag_refresh_interval = '1s' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:chefskiss:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome, @teskje — thank you! I didn't review the storage controller changes in full detail, so ideally @petrosagg has time to take a second look there tomorrow, but if not I'm comfortable with you merging anyway so that this makes the release cut.
I deployed this branch on my staging env and let it run for a day to observe the memory usage of the new index. This is without any user collections, so just the system sources/tables/indexes/subscribes. There are 303 of these currently (plus a bunch of slow-path The arrangement size stabilizes at 26 MiB on my env. That's roughly 88 KiB per collection. The environment with the most collections right now has 1258, which would end up at ~108 MiB. If we assume twice that amount, to account for the higher per-record usage for the switch from Even if somehow my estimations are wildly off, we still have two means to adjust this in prod: scaling up the size of |
e01f772
to
68d6735
Compare
0b23628
to
1b91cf0
Compare
This commit introduces the `mz_wallclock_lag_history` source, as well as a view + index that retains the last day of events, as well as the corresponding `IntrospectionType`.
This commit makes the compute controller write the maximum wallclock lag of all compute collections to introspection every minute.
This commit makes the storage controller makes the maximum wallclock lag to introspection for all storage collections every minute.
Also adds a dyncfg to configure the refresh interval, to avoid having the test take several minutes.
1b91cf0
to
807d26e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
TFTRs! |
This PR adds a new builtin source,
mz_wallclock_lag_history
, that reports the minute-by-minute maximum wallclock lag (diff between write frontier and wallclock time) observed for each storage and compute collection. Both controllers take part in populating the history, each contributing lags for the collections they are responsible for.mz_wallclock_lag_history
reports per-replica lag for compute objects and global lag (based on the persisted frontier) for storage collections. Where there is overlap (i.e. for materialized views), both are reported. There is also a builtin view,mz_wallclock_global_lag_history
that exposes only global lags for all collections (by taking the minimum of all recorded lags). This structure was previously discussed here.To give the console fast access to the wallclock lag data, an index view
mz_wallclock_global_lag_recent_history
is also provided. This view has a temporal filter to 1 day, to keep the index's memory usage in check.The first commit is from #29423 and can be ignored here.
Motivation
Part of MaterializeInc/database-issues#8235
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.