-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement predicate pruning for like
expressions (prefix matching)
#12978
Conversation
cc @alamb |
This is very clever -- I will review it tomorrow |
(false, true) => Operator::ILikeMatch, | ||
(true, true) => Operator::NotILikeMatch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is dead code as if like_expr.case_insensitive() {
catches the case insensitive case
I think this code would be clearer if it just matched on like_expr.negated()
(or alternately returned unhandled_hook.handle(expr);
directly for these last 2 cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will change it to do so. I think I was getting a bit ahead of myself to implement ILIKE
support, which as per the comment should be possible, maybe you can show me how to construct the physical expression to call lower()
and upper()
on another expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb -- this is very cool.
I think there are a few more tests needed but otherwise the implementation looks very solid. Thank you
@@ -1610,6 +1625,93 @@ fn build_statistics_expr( | |||
Ok(statistics_expr) | |||
} | |||
|
|||
fn extract_string_literal(expr: &Arc<dyn PhysicalExpr>) -> Result<&String> { | |||
if let Some(lit) = expr.as_any().downcast_ref::<phys_expr::Literal>() { | |||
if let ScalarValue::Utf8(Some(s)) = lit.value() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should probably also handle the cases ScalarValue::LargeUtf8
, ScalarValue::Utff8View
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Dictionary
!
if prefix.is_empty() { | ||
return plan_err!("Empty prefix in LIKE expression"); | ||
} | ||
Ok(Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also have to test if there are other occurences of %
in the string 🤔 (like foo%bar%
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is pretty simple (truncate at the first one) but I agree another test would be nice.
// column LIKE '%foo%' => min <= '' && '' <= max => true | ||
// column LIKE 'foo' => min <= 'foo' && 'foo' <= max | ||
|
||
// I *think* that ILIKE could be handled by making the min lowercase and max uppercase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I agree. Figuring out how to make those call would be the trick
fn build_like_match( | ||
expr_builder: &mut PruningExpressionBuilder, | ||
) -> Result<Arc<dyn PhysicalExpr>> { | ||
// column LIKE literal => (min, max) LIKE literal split at % => min <= split literal && split literal <= max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the LIKE literal split at %
part
column LIKE literal
is the same as column = literal
if there are no wild cards, so you should be able to use the same rules as equality I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right that's the point, by splitting it at the first %
we are able to apply the same rules as equality:
column LIKE literal -> (min, max) LIKE (literal split at %) -> min <= split literal && split literal <= max
vs
column = literal -> (min, max) = literal -> min <= literal && literal <= max
let expected_ret = &[true, true, false, false, true, true]; | ||
|
||
prune_with_expr( | ||
// s1 LIKE 'A%' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comments say s1 LIKE A%
but the code builds a different expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I've tried to get all of these comments right, I think they can be removed tbh the expression is pretty self explanatory, but left them for now, let me know if you'd prefer that I remove them or if you want to keep them if there are any obviously wrong
let expected_ret = &[true, true, false, false, true, true]; | ||
|
||
prune_with_expr( | ||
// s1 LIKE 'A%' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please add tests for other combinations:
s1 LIKE 'A'
s1 LIKE '%A%'
I think it is important the matching is doing the right thing
I also think it is important to to cover cases for NOT LIKE as well
let expected_ret = &[true, true, true, true, true, true]; | ||
|
||
prune_with_expr( | ||
// s1 LIKE 'A%' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// s1 LIKE 'A%' | |
// s1 LIKE '%A' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is still wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I haven't pushed pending discussing the correctness of the general approach.
@alamb can these stats be truncated? I know stats in pages truncate large strings, e.g. if the min value is |
@adriangb in theory I think parquet statistics can be truncated.
I don't think we can use statistics for substring match -- we can only use statistics for equality and prefix matching so like The predicate would be transformed into |
Consider the values |
Okay @alamb I pushed a pretty big rework. Lots of new test cases, lots of comments explaining what's going on. I removed the I will note that I am a bit concerned about the interaction of truncated stats and how we apply these filters. Take the (absurd) case of stats that were truncated so that all you have is This is important because for the case of a max stat of |
Argh the only ClickBench query this maybe could improve is 23: WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' Since the |
I am wondering if simple |
Good question. Based on performance of some queries I saw I'd say no, but it it's worth double checking. Any suggestions as to a definitive easy way to check? I guess I can run I don't see where |
I would take a look at the (physical plan) of queries involving like first to see if it still uses like or is transformed to another function. |
I made a big parquet file as follows: import random
import string
import polars as pl
df = pl.DataFrame({'col': ["A" + "".join(random.choices(string.ascii_letters, k=1_000)) for _ in range(1_000_000)]})
df.write_parquet('data.parquet', compression='uncompressed') This came out to ~1GB. I then uploaded it to a GCS bucket. I ran queries The explain plans reflect that as well:
The like query also has:
So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning. I also think pruning should handle whatever that produces ( |
like
expressions
That certainly makes sense to me Perhaps we could implement some sort of simplification in https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs for Then we can implement the rules in pruning predicate for starts_with 🤔 |
At most we could simplify the Do you have any thoughts on my concerns for possibly truncated stats, in particular how |
I think it’s fine to support Of course, the pruning needs to be correct :). We could add (negative) test cases / add issues if the already implemented logic for |
Well there's no tests for hypothetical cases with truncated stats. All of the tests are against the stats themselves with no indication of how those are meant to correspond with the original data. There were no unit tests of Utf8 filtering at all as far as I can tell. The current implementation of |
I suggest we (I can help tomorrow):
|
Sounds good. All of that said, I think this PR is currently as correct as |
let (min_lit, max_lit) = if let Some(wildcard_index) = first_wildcard_index { | ||
let prefix = &s[..wildcard_index]; | ||
let prefix_min_lit = Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some( | ||
format!("{prefix}\u{10ffff}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'highest character' should be appended to the max range, not the min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing similar thing for Trino (trinodb/trino@6aea881), i decided to stay within ASCII characters to avoid any potential issues due to misinterpretation of "difficult" code points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm my intuition was that you want to add the highest character to the upper lower bound of the min value such that 'A%'
can match 'AB'
. Assuming a column with only 1 row "AB"
and the query 'AB' like 'A%'
:
'AB' <= 'A\u{10ffff}' and 'A' <= 'AB'
->t
'AB' <= 'A' and 'A\u{10ffff}' <= 'AB'
->f
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to stay within ASCII
? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if comment is related to the confusing naming in https://github.com/apache/datafusion/pull/12978/files#r1810513072?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean to
stay within ASCII
? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?
We can if only we want to. The code will do whatever we ask it to do.
Then the question is whether we want to. If we apply the predicate locally in memory only, then no need to be cautious, no need for "stay within ASCII". If we later interop with other systems (eg send a plan somewhere or TableProvider calls remote system), then it might be beneficial to restrict ourselves.
Hmm my intuition was that you want to add the highest character to the upper lower bound
i agree with this
... of the min value
min value of what?
all column values need to be in the range [like_constant_prefix, like_constant_prefix[0..-1] + \u10ffff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... of the min value
min value of what?
i get it now
So for a column we have stats: min value and max value. Let's call them col_min
and col_max
.
For like AB%
we derive lower and upper bound (AB
and AB\u10ffff
which is actually incorrect, will comment about this elsewhere).
For pruning we need to check whether [col_min, col_max] ∩ [lower_bound, upper_bound)
is non-empty (note the upper_bound will be non-inclusive)
It's empty when upper_bound <= col_min OR col_max < lower_bound
It's non-empty when upper_bound > col_min AND col_max >= lower_bound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For correct upper bound and why exclusive see #12978 (comment)
// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true! | ||
// Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach. | ||
// ANSI SQL specifies two wildcards: % and _. % matches zero or more characters, _ matches exactly one character. | ||
let first_wildcard_index = s.find(['%', '_']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not support escape characters, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> SELECT '' LIKE '' ESCAPE '%';
Execution error: LIKE does not support escape_char
So I think not.
But just for reference, how would you suggest escape characters be handled? I've never used them in practice (I think at that point I'd just go for a regex).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But just for reference, how would you suggest escape characters be handled?
simple state automata
(eg https://github.com/trinodb/trino/blob/8f279a84bfe7f6f8301a44d26b01c981fc2156f9/core/trino-main/src/main/java/io/trino/type/LikeFunctions.java#L87-L102)
// **IMPORTANT** we need to make sure that the min and max are in the range of the prefix | ||
// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that | ||
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'. | ||
// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This important, but a bit difficult to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to clarifications on the wording. The point I'm trying to make is why we have to append characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this whole comment. If someone understand how LIKE works, they will get the code without comment.
If someone doesn't understand how LIKE works, they won't understand the code even with the comment.
let min_expr = Arc::new(phys_expr::BinaryExpr::new( | ||
min_column_expr.clone(), | ||
Operator::LtEq, | ||
min_lit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_lit
actually represents max value (upper bound)
would you consider swapping the naming of the variables?
Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min_lit actually represents max value (upper bound)
would you consider swapping the naming of the variables?
I'm open to any suggestions on naming but I do think it is confusing because min_lit
is the upper bound on col_min
and max_lit
is the lower bound on col_max
🤯
Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part
👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is confusing because
min_lit
is the upper bound oncol_min
andmax_lit
is the lower bound oncol_max
🤯
Yes, it is.
i understand now that you're fitting this into terminology of existing code.
i am not sure what the right naming would be. maybe neutral: lower_bound and upper_bound?
let prefix_lit = | ||
Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(s.clone())))); | ||
(prefix_lit.clone(), prefix_lit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such case we should produce single Eq check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by a single eq check. We are basically saying col like 'constant'
-> col = 'constant'
-> col_min <= 'constant' and 'constant' <= col_max
let s = extract_string_literal(scalar_expr)?; | ||
// **IMPORTANT** we need to make sure that the min and max are in the range of the prefix | ||
// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that | ||
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for A%
pattern the lower bound is A
(obvious)
what should be the upper bound?
A\u{10ffff}
is not a correct upper bound since A\u{10ffff}\u{10ffff}
is even bigger but still matches A%
input.
The correct upper bound would be:
A\u{10ffff}\u{10ffff}\u{10ffff}...\u{10ffff}
include -- up to max length of the column, so potentially very very long, so absolutely not practicalB
(exclusive).
Thus to calculate upper bound you need (pseudo-code)
let s = extract_string_literal(scalar_expr)?;
let first_wildcard_index = ...;
let prefix = &s[..wildcard_index];
let last_incrementable_character = /* find last code point of `prefix` that can be incremented
if we choose to stay within ascii, this will be a code point < 127
otherwise it will be any code point != the max code point (0x10FFFF) */;
if last_incrementable_character not found {
// For `%`, or `\u{10ffff}...\u{10ffff}%` patterns, we cannot calculate an upper bound
return None
}
let upper_bound =
prefix[..last_incrementable_character-1] + // take prefix of the prefix up to and excluding the last character that can be incremented
str(prefix[last_incrementable_character] + 1) // take last character and increment it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about this explanation. Maybe you can provide a failing test case that would help me understand? There is already a test case for 'A%'
and it is as far as I can tell doing the correct thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try add a test case for 'A%'
where stats are min=AB
max=A\u{10ffff}\u{10ffff}\u{10ffff}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I added a test case in e29ed50. It worked as expected, let me know if I got the expected outcomes wrong or missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #12978 (comment) on how to make the test expose the problem
// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that | ||
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'. | ||
// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true! | ||
// Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link isn't useful. That page conveniently avoids any details that are important. Please remove the link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a better resource or name for this transformation? I'm happy to point at any documentation or prior art in Trino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just not finding this link useful being being inspirational. But source of inspiration doesn't need to be reflected in the code comment 🙂
I am not asking for replacing if any other link.
I personally find trinodb/trino@6aea881 valuable because (1) I know this code as its author and (2) it might actually be correct. Once we get the code here correct, that link wouldn't be useful either.
a0a1c37
to
ae3426d
Compare
@alamb I re-arranged some of the comments on assertions in ae3426d which I feel like helped a lot with readability of the tests. There's a couple other tests with a similar pattern that I think could benefit. I was also thinking about doing some more black-box testing: I think given any |
Hi @alamb I took a stab at fuzz tests in f6c5314. |
Thanks @adriangb -- I have put this PR on my review queue and will attempt to re-review it carefully tomorrow |
Clearly I failed to review this -- I will do so hopefully later today but may be tomorrow |
This is still on my list, hopefully other people can check it out too |
This is my top priority after DF 44 is released: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb -- I think this PR is ready to go
One thing I noticed is that the fuzz test takes over a minute on my machine:
SLOW [> 60.000s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
PASS [ 65.772s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
------------
Summary [ 72.749s] 47 tests run: 47 passed (1 slow), 0 skipped
andrewlamb@Mac:~/Software/datafusion$
Is there some way to make it faster? Maybe with multiple threads or crank down the number of things to teset?
"~", | ||
"ß", | ||
"℣", | ||
"%", // this one is useful for like/not like tests since it will result in randomly inserted wildcards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// of "fo" that may have originally been "foz" or anything else with the prefix "fo". | ||
/// E.g. `increment_utf8("foo") >= "foo"` and `increment_utf8("foo") >= "fooz"` | ||
/// In this example `increment_utf8("foo") == "fop" | ||
fn increment_utf8(data: &str) -> Option<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be ok to potentially replace this with the implementation from @etseidl in apache/arrow-rs#6870 ?
If so, I can file a ticket to do so as a follow on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed that implementation but yes I think we should consider it!
Co-authored-by: Andrew Lamb <[email protected]>
Yeah this is what I was hinting at in #12978 (comment). I'm happy to throw threads at it for a start, and restricting the search space might be necessary but I think requires a more careful eye to minimize how much valuable testing is discarded. The other thing that I think we can do is speed up the tests themselves, in particular minimizing unnecessary round trips to Parquet, but I'm not sure where the right places to hook in would be that still give us a realistic test but remove the need to re-parse the same data over and over again. |
Awesome -- I'll try and find time later today or tomorrow to give it a critical eye. Otherwise I'll plan to merge this PR later today or tomorrow as well. |
|
Thank you again @adriangb for bearing with us -- I know this took a long time However, I am pretty stoked that we now have this optimization and it is an example of the very careful engineering required for this kind of optimization. The fact we are at this point in DataFusion is pretty sweet in my mind 🚀 |
The idea is that we can push certain
like
expressions down into statistics pruning.For example, a filter like
url LIKE 'https://www.google.com%'
can be (basically, some caveats and other tricks) used to filterurl_min <= 'https://www.google.com' and 'https://www.google.com' <= url_max
such that a row group that only hashttps://www.example.com
would be excluded.Closes #507
Closes #13253