Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add =~ operator for where #5607

Open
chrismo opened this issue Jan 27, 2025 · 2 comments
Open

Add =~ operator for where #5607

chrismo opened this issue Jan 27, 2025 · 2 comments

Comments

@chrismo
Copy link

chrismo commented Jan 27, 2025

In the Slack, I asked, is there a way to use regex with where? search /foobar/ works - is there a where equivalent? Steve mentioned

$ echo '"foo" "bar" "baz"' | super -c 'where grep(/f|z/)' -
"foo"
"baz"

What about adding a =~ operator so where =~ /(f|z)/ could work too?

@philrz
Copy link
Contributor

philrz commented Jan 27, 2025

@chrismo: Thanks for the suggestion. I'll discuss it with the Dev team, but here's some initial thoughts.

Was its use in bash what had you thinking of =~ specifically? If you use other tools that use the operator similarly it'd be good to know just so we know which kinds of users it would resonate with. Looking at bash specifically, I suspect that if we added this it might look a little different from what you proposed. To use that [[ $line =~ [[:space:]]*(a)?b ]] example from the bash docs as a starting point, it's effectively a test-style expression, so, test is playing the same role as where does in the Zed/SuperSQL language. Then there's the explicit reference to what's being searched, which is the contents of the line variable in that case. So if we followed that pattern, I expect it would look like where line =~ /(f|z)/, so, searching for the regexp (f|z) inside the contents of a field named line, or if the goal were to search within every value in the input stream then it'd be where this =~ /(f|z)/. I point all this out because your proposed where =~ /(f|z)/ doesn't feel quite right because there's nothing on the left side of the operator. We often have implicit this in other spots in the language (e.g., the second argument of grep), but optional arguments seem a bit more common in function calls than with equality-style operators. So my examples effectively become shorthand for where grep(/(f|z)/, line) and where grep(/(f|z)/), respectively. Do you have any reactions to this?

Given that the direction of the project is toward SQL compliance with helpful extensions and shorthand, it's also worthwhile to survey what the various SQL solutions have here. For example, DuckDB's regexp support looks roughly similar to what we've got today in that it's function centric, and there's actually some direct overlap such as how we both have regexp_replace() (though ours has some compatibility differences vs. theirs and Postgres, per #5564), and their regexp_matches() looks similar to what we're doing with grep() (but once again with some differences in specifics). I see Postgres and DuckDB both have the SIMILAR TO which is described as similar to LIKE but with regexp support, so if/when we add support for that (as we very likely will, since one of our goals is to eventually be a drop-in alternative to Postgres) it seems that could be a place to also offer a shorthand like =~. Postgres also has operators for POSIX Regular Expressions like ~ ("String matches regular expression, case sensitively") and ~* ("String matches regular expression, case-insensitively") which sadly seems to add to the confusion since it also uses the ~ but in a different way than in bash.

In conclusion, there's definitely still work for us to do with regexp in general to be more in line with what users would expect coming from another SQL solution like Postgres, and as part of that effort it seems TBD which shorthands we'll adopt first and if we'd go even further by also supporting our own shorthands like =~ as a courtesy to users coming from tools like bash. Happy to hear additional thoughts from you or anyone else watching this issue so we have that to think about while discussing our plans to address.

@chrismo
Copy link
Author

chrismo commented Jan 27, 2025

Was its use in bash what had you thinking of =~ specifically?

Ruby uses it, for one - not sure about other languages.

if the goal were to search within every value in the input stream then it'd be where this =~ /(f|z)/

Yeah, sorry - I didn't think about this ... that makes more sense. Really just the focus is on =~ in place of == to enable using an inlined regex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants