Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REFRESH options for MVs -- Compute #23819

Merged
merged 1 commit into from
Jan 4, 2024
Merged

Conversation

ggevay
Copy link
Contributor

@ggevay ggevay commented Dec 11, 2023

This is the Compute part of the first part of the REFRESH options for materialized views epic. It implements all the refresh options mentioned in the design doc, but it doesn't implement automatic replica management yet.

@jkosh44, could you please review the changes in interval.rs?

@teskje, could you please review sequence_create_materialized_view + the Compute parts + refresh_schedule.rs?

Note that in sequence_create_materialized_view I moved the timestamp selection outside of the catalog_transact_with_side_effects, as discussed with Matt here.

Motivation

Temporary Limitations

No automated replica management yet, as noted above.

If there is an index on the materialized view, then we have a problem after envd restarts, because the index's since will be bootstrapped to be 1 sec before the next refresh, which is most likely in the future, making queries on the MV block until the next refresh. We should enhance the frontier selection in the bootstrapping at some point, but for now I'll just tell the customer to please don't create indexes on these MVs.

This first version simply relies on Adapter read holds to keep the sinces close to the present moment. Eventually, we'll want to have custom compaction windows on REFRESH EVERY MVs (and their indexes), because relying on Adapter read holds precludes SERIALIZABLE reads from REFRESH EVERY MVs selecting somewhat older timestamps, making SERIALIZABLE reads not faster than STRICT SERIALIZABLE reads.

Tips for reviewer

Checklist

@ggevay ggevay force-pushed the refresh-every branch 5 times, most recently from 32289ce to f421c70 Compare December 12, 2023 12:15
@ggevay ggevay marked this pull request as ready for review December 12, 2023 12:19
@ggevay ggevay requested a review from a team as a code owner December 12, 2023 12:19
@ggevay ggevay requested review from a team December 12, 2023 12:19
@ggevay ggevay requested a review from benesch as a code owner December 12, 2023 12:19
@ggevay ggevay requested review from maddyblue, jkosh44 and teskje and removed request for maddyblue December 12, 2023 12:19
Copy link

shepherdlybot bot commented Dec 12, 2023

Risk Score:83 / 100 Bug Hotspots:4 Resilience Coverage:80%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Feature Flag
  • (Required) Integration Test 🔍 Detected
  • (Required) Observability
  • (Required) QA Review 🔍 Detected
  • Unit Test 🔍 Detected
Risk Summary:

The pull request poses a high risk with a score of 83, driven by predictors such as the average line count and executable lines within files, which historically are associated with a 138% higher likelihood of introducing bugs compared to the repository baseline. The repository has seen a decreasing trend in bugs, but with recent spikes in bug fixes, and a similar pattern is predicted with spikes in riskier pull requests. Additionally, the changes involve 3 files that are known hotspots for bugs.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../src/coord.rs 98
../sequencer/inner.rs 97
../coord/command_handler.rs 92
../statement/ddl.rs 97

@nrainer-materialize nrainer-materialize requested a review from a team December 12, 2023 13:08
@ggevay ggevay added A-compute Area: compute A-ADAPTER Topics related to the ADAPTER layer labels Dec 12, 2023
src/sql/src/names.rs Outdated Show resolved Hide resolved
@benesch
Copy link
Member

benesch commented Dec 12, 2023

Exciting that you've gotten this working end to end, Gabor! This PR is at a size that's really hard to handle in GitHub reviews, though. Would recommend trying to split out the parsing, planning, and actual implementation into separate PRs. The parser stuff looks basically ready to go, for example, and could merge basically ASAP (just modulo the comments I dropped above). Totally okay for a parser-only PR to immediately error in the planner with "REFRESH not yet supported", and then for the next plannning-only PR to immediately error in the coordinator with "REFRESH not yet supported".

@ggevay
Copy link
Contributor Author

ggevay commented Dec 12, 2023

Ok, I'll split this into 3 PRs.

Thanks for the comments!

@ggevay ggevay marked this pull request as draft December 12, 2023 15:23
@philip-stoev
Copy link
Contributor

@ggevay is this ready for testing at this time?

@ggevay
Copy link
Contributor Author

ggevay commented Dec 13, 2023

@philip-stoev, not yet. Unfortunately, there are several issues with it at the moment, which I'm fixing now, but I got a bit sick, so it's going slowly.

@ggevay
Copy link
Contributor Author

ggevay commented Dec 13, 2023

Status: There are several minor issues that should be easy to fix, but there is one serious panic in slts. I figured out what's causing the panic with Matt's help, but it's not entirely clear how to fix it. Working on it.

@ggevay ggevay force-pushed the refresh-every branch 2 times, most recently from 1c5a0e5 to dee4de6 Compare December 13, 2023 17:58
@ggevay ggevay force-pushed the refresh-every branch 4 times, most recently from 912ebd7 to 991fdbb Compare January 3, 2024 20:33
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 2 separate panics with my one-off draft that hackily converts all MVs to REFRESH EVERY '2 seconds':

thread 'timely:work-0' panicked at /cargo/git/checkouts/timely-dataflow-70b80d81d6cabd62/de20aa8/timely/src/dataflow/operators/capability.rs:139:13: Attempted to downgrade Capability { time: 1704332460774, internal: "..." } to 774, which is not beyond the capability's time.

and

thread 'coordinator' panicked at src/sql/src/plan/statement/ddl.rs:2148:29: ALIGNED TO should have been filled in by purification

I'm guessing those are both unexpected. The non-matching rows mostly looked like they are not wrong results, but I'm not sure about the controller-frontiers:

controller-frontiers.td:305:1: error: non-matching rows: expected:
[["0", "<null>"]]
got:
[["751", "<null>"]]
Poor diff:
- 0 <null>
+ 751 <null>

Test results are in https://buildkite.com/materialize/tests/builds/72256 and https://buildkite.com/materialize/nightlies/builds/5842

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

I found 2 separate panics with my #24210 that hackily converts all MVs to REFRESH EVERY '2 seconds':

@def- thank you very much for this test!

I've fixed both panics:

  • The Attempted to downgrade Capability panic was fortunately already fixed yesterday evening (caught by @antiguru from looking at the code) by 56d8e12
  • The ALIGNED TO panic was caused by forgetting to purify EXPLAIN CREATE MATERIALIZED VIEW, fixed by 4e9f070

I pushed the fixes to your PR, and running CI again. Hopefully, there will be no panics now.

The non-matching rows might be ok, but I'll definitely investigate them.

Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that the REFRESH option can be repeated? Only the last will have an effect, but the first one is still evaluated, so this panics:

CREATE SOURCE counter        
  FROM LOAD GENERATOR COUNTER
  (TICK INTERVAL '1s');      

CREATE MATERIALIZED VIEW mv8 WITH (REFRESH AT mz_unsafe.mz_panic('heh'), REFRESH EVERY '1 second') AS SELECT * FROM counter;

Here I'm only seeing the refresh every 10 seconds, not at 2 seconds after the current time:

CREATE MATERIALIZED VIEW mv7 WITH (REFRESH AT mz_now()::text::int8 + 2000, REFRESH EVERY '10 seconds') AS SELECT * FROM counter;     COPY (SUBSCRIBE (SELECT * FROM mv8)) TO STDOUT;

I'm not sure if this is how we handle options in general or if we want to block this off since someone could accidentally use it and expect both refreshes to have an effect.

@ggevay ggevay marked this pull request as ready for review January 4, 2024 13:05
@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

Is it expected that the REFRESH option can be repeated? Only the last will have an effect, but the first one is still evaluated, so this panics:

@def- Definitely both refresh options should take effect! I'll investigate.

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

@def-

The first one is expected to panic, because both REFRESH options are evaluated, so the mz_unsafe.mz_panic will be called.

The second one creates mv7, but then reads from mv8. If I make the SUBSCRIBE read from mv7, then the refresh that is 2 seconds after the creation does happen for me. However, there is actually a bug here: I treated the last REFRESH AT to be the last refresh ever, so the SUBSCRIBE terminates, when actually it should go on indefinitely. I'll fix this in a moment.

Edit: I pushed a fix: 6c1f168

src/adapter/src/coord.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some questions about what setting until actually guarantees about the times that arrive at a dataflow output, and I'd like to see a test for the bootstrap as-of selection. But given that this will be behind a feature flag and disabled by default, I'm fine with merging this version.

src/compute/src/sink/persist_sink.rs Outdated Show resolved Hide resolved
src/compute/src/sink/persist_sink.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, not seeing any panics anymore.

The non-matching rows might be ok, but I'll definitely investigate them.

I took a quick look based on the lastest runs in https://buildkite.com/materialize/tests/builds/72270 and https://buildkite.com/materialize/nightlies/builds/5848
These look worrying:

controller-frontiers.td:305:1: error: non-matching rows: expected:
[["0", "<null>"]]
got:
[["603", "<null>"]]
Poor diff:
- 0 <null>
+ 603 <null>
kafka-avro-upsert-sinks.td:186:1: error: missing record 0: Record {
    headers: [
        "1",
    ],
    key: Some(
        {
            (
                "a",
                Long(1),
            ),
        },
    ),
    value: Some(
        {
            (
                "a",
                Long(1),
            ),
            (
                "b",
                Long(1),
            ),
        },
    ),
}
temporal.td:18:1: error: executing query failed: db error: ERROR: Timestamp (0) is not valid for all inputs: [Antichain { elements: [700] }]: ERROR: Timestamp (0) is not valid for all inputs: [Antichain { elements: [700] }]
     |
  17 |
  18 | > SELECT * FROM one_bound1 AS OF 0
     | ^
> SELECT * FROM foo;
rows didn't match; sleeping to see if dataflow catches up 50ms 75ms 113ms 169ms 253ms 380ms 570ms 854ms 1s 2s 3s 4s 6s 10s 15s 22s 33s 22s
^^^ +++
timestamps-debezium-kafka.td:160:1: error: non-matching rows: expected:
[["1", "1"], ["2", "2"]]
got:
[]
Poor diff:
- 1 1
- 2 2
> SELECT
  frontiers.read_frontier,
  frontiers.write_frontier
FROM mz_internal.mz_frontiers frontiers
JOIN mz_materialized_views mvs
  ON frontiers.object_id = mvs.id
WHERE
  mvs.name = 'mv2'
rows didn't match; sleeping to see if dataflow catches up 50ms 75ms 113ms 169ms 253ms 380ms 570ms 854ms 1s 2s 3s 4s 6s 10s 15s 22s 33s 22s
^^^ +++
controller-frontiers.td:305:1: error: non-matching rows: expected:
[["0", "<null>"]]
got:
[["6", "<null>"]]
Poor diff:
- 0 <null>
+ 6 <null>

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

@def-

These look worrying:

Thanks, I'll check them!

Btw. we just now did some changes with @antiguru and @teskje, so I'll rebase the testing PR now to re-run. Saving the links to the old run that your above comment refers to:

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

@def- The non-matching rows where mz_internal.mz_frontiers is involved is expected, because REFRESH options modify how the frontiers move. So the non-matching rows in controller-frontiers.td can be ignored.

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

@def- The

temporal.td:18:1: error: executing query failed: db error: ERROR: Timestamp (0) is not valid for all inputs:

is also expected: REFRESH options modify what times are valid to read from: originally 0 was valid because this MV has only constant inputs, but with the REFRESH EVERY the first valid read time will be the time of the first refresh, which will be mz_now() by default.

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

@def-
kafka-avro-upsert-sinks.td: The non-matching rows here are also expected: The distribution of data among those 3 kafka-verify-data calls is based on data timestamps in the MV, which is modified by the REFRESH options (times are rounded up to the nearest refresh).

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

@def-
timestamps-debezium-kafka.td: This one is tricky, but also expected: This test is manually controlling the progress of the Kafka topics. A REFRESH EVERY mat view is reading from such a topic, and it gets stuck, because the topic never makes enough progress to reach the time of the first refresh. (Td waiting for the rows to match up is not helping, because wall-clock time is not helping the progress of the Kafka topic.)

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

@def-
I went through all the "These look worrying:". Thanks for flagging them!

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as-is. There are several places that could panic, but all seem they could only be triggered when something upstreams fails, which isn't ideal, but something we could revisit later.

I did not review adapter and testing changes.

@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

Thanks @antiguru!

I'm making some cosmetic changes, and then run Nightly and then merge it.

We'll address @def-'s coverage concerns in a follow-up PR that will add lots of new tests.

@ggevay ggevay merged commit e62c192 into MaterializeInc:main Jan 4, 2024
145 of 155 checks passed
@ggevay
Copy link
Contributor Author

ggevay commented Jan 4, 2024

Merged! Thank you very much for all the discussions and reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-compute Area: compute
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants