-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: Change Ident
to enforce a max length
#23082
sql: Change Ident
to enforce a max length
#23082
Conversation
9853c76
to
eb815c4
Compare
This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?
Buggy File Hotspots:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Most of my comments are nits, but I had a couple of questions/comments.
name: Some(Ident::new_unchecked(index_name)), | ||
on_name: RawItemName::Name(mz_sql::normalize::unresolve(view_name)), | ||
in_cluster: Some(RawClusterName::Resolved(cluster_id.to_string())), | ||
key_parts: Some( | ||
keys.iter() | ||
.map(|i| match view_desc.get_unambiguous_name(*i) { | ||
Some(n) => Expr::Identifier(vec![Ident::new(n.to_string())]), | ||
Some(n) => Expr::Identifier(vec![Ident::new_unchecked(n.to_string())]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably create a follow-up GH issue to try and remove these types of new_unchecked
uses. It's not obvious to me that these are guaranteed to fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed https://github.com/MaterializeInc/materialize/issues/23191 to track this. It's not obvious to me either, but with the soft_assert!
in place, I feel much more confident that we'll atleast catch these issues earlier
@@ -339,6 +339,13 @@ impl RustType<ProtoColumnName> for ColumnName { | |||
} | |||
} | |||
|
|||
impl From<ColumnName> for mz_sql_parser::ast::Ident { | |||
fn from(value: ColumnName) -> Self { | |||
// Note: ColumnNames are known to be less than the max length of an Ident (I think?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the code base, I see 80 instances of initializing a ColumnName
I have no idea if we check in all those places that the name is less than the max allowed characters. We probably want to do something similar to ColumnName
where we add a constructor that checks the length. We should probably add a followup GH issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed https://github.com/MaterializeInc/materialize/issues/23192 to track this. I agree, no idea if we check all, or even any, ColumnName
s to make sure they're under our max length. I figured starting with Ident
was a good place though
src/sql-lexer/src/lexer.rs
Outdated
/// Newtype wrapper around [`String`] whose _byte_ length is guaranteed to be less than or equal to | ||
/// [`MAX_IDENTIFIER_LENGTH`]. | ||
#[derive(Debug, Clone, PartialEq)] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to put this empty newline in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, removed!
src/sql-lexer/src/lexer.rs
Outdated
if s.len() > MAX_IDENTIFIER_LENGTH { | ||
return Err(s); | ||
} | ||
|
||
Ok(SmallString(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for doing this check in both the Lexer and Parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lexer is what creates Token
s, so it's cleanest to do the check there for SQL parsing. Then I was met with a dilemma about do we move the Ident
struct one layer lower into mz_sql_lexer
or create this second type?
We could probably move it one layer lower, and then re-export it from the sql-parser
crate, but this felt a little cleaner? Not sure, what do you think?
candidate.append_lossy(suffix.clone()); | ||
if is_valid(&candidate)? { | ||
return Ok(candidate); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc-comments don't mention that we might truncate the prefix. We should probably add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
src/sql-parser/src/ast/defs/name.rs
Outdated
const MAX_SUFFIX_LENGTH: usize = Ident::MAX_LENGTH - 8; | ||
|
||
let mut suffix: String = suffix.into(); | ||
mz_ore::soft_assert!(suffix.len() <= MAX_SUFFIX_LENGTH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to add this requirement to the doc-comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
})) | ||
} | ||
|
||
/// Append the provided `suffix`, truncating `self` as necessary to satisfy our invariants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that suffix
can be truncated too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a doc test to show this
src/sql-parser/src/ast/defs/name.rs
Outdated
.chars() | ||
.take_while(|c| { | ||
byte_length += c.len_utf8(); | ||
byte_length < Self::MAX_LENGTH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be MAX_SUFFIX_LENGTH
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updated!
src/sql-parser/src/ast/defs/name.rs
Outdated
// Note: using unchecked here is okay because SmallString is known to be less than or equal | ||
// to our max length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SmallString
and Ident
use different constant for their max lengths. They happen to be equal now, but they might diverge in the future by accident. Would it be possible to use the same constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated SmallString
to be MaxLenString
, which takes a maximum length as a generic constant. Then updated the From<...>
impl on Ident
so we only implement it for MaxLenString
s with a max len of 255, this should prevent skew in the future.
Now that I think about it, we could probably push down all of these methods onto MaxLenString
, and then Ident
becomes a type wrapper around it. If you don't mind I'll probably save this as a followup though?
eb815c4
to
5d34654
Compare
5d34654
to
192f6cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
192f6cc
to
fd735b9
Compare
This PR adds a max length to the
Ident
struct. In #20999 we added the concept of a "max_identifier_length" which was enforced when parsing SQL, but it was not enforced for identifiers created internally, e.g. when generating a name for a progress subsource by appending "_progress" to the source name.The largest changes in this PR are the following APIs:
Ident::new(...)
will return anErr(IdentError)
if the provided string is longer than our max, which is 255 bytes.ident!("some static str")
macro, which enforces our invariants at compile time. This prevents a pattern likeIdent::new("some static str").expect("known correct")
from polluting the code base.Ident::new_unchecked(...)
checks the invariants with asoft_assert!
. This pattern is an escape hatch for cases we know are valid, but can't express in the type system, e.g. appending a number to a static string.Wherever possible I used
Ident::new(...)
orident!(...)
, but there were some callsites that didn't already return an error, and didn't have a&'static str
as an argument. For those cases I usedIdent::new_unchecked(...)
to prevent this PR from getting too large. At the very least by usingnew_unchecked(...)
our tests will catch invalidIdent
s via soft asserts.Motivation
Fixes https://github.com/MaterializeInc/database-issues/issues/6813
But also has the goal of eliminating these kinds of bugs entirely. Chatted about this on Slack and it seemed desirable.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.