Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Change Ident to enforce a max length #23082

Merged
merged 3 commits into from
Nov 15, 2023

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented Nov 9, 2023

This PR adds a max length to the Ident struct. In #20999 we added the concept of a "max_identifier_length" which was enforced when parsing SQL, but it was not enforced for identifiers created internally, e.g. when generating a name for a progress subsource by appending "_progress" to the source name.

The largest changes in this PR are the following APIs:

  • Ident::new(...) will return an Err(IdentError) if the provided string is longer than our max, which is 255 bytes.
  • ident!("some static str") macro, which enforces our invariants at compile time. This prevents a pattern like Ident::new("some static str").expect("known correct") from polluting the code base.
  • Ident::new_unchecked(...) checks the invariants with a soft_assert!. This pattern is an escape hatch for cases we know are valid, but can't express in the type system, e.g. appending a number to a static string.

Wherever possible I used Ident::new(...) or ident!(...), but there were some callsites that didn't already return an error, and didn't have a &'static str as an argument. For those cases I used Ident::new_unchecked(...) to prevent this PR from getting too large. At the very least by using new_unchecked(...) our tests will catch invalid Idents via soft asserts.

Motivation

Fixes https://github.com/MaterializeInc/database-issues/issues/6813

But also has the goal of eliminating these kinds of bugs entirely. Chatted about this on Slack and it seemed desirable.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • Internal only change

@ParkMyCar ParkMyCar force-pushed the ident/enforce-max-len branch 5 times, most recently from 9853c76 to eb815c4 Compare November 13, 2023 22:08
@ParkMyCar ParkMyCar marked this pull request as ready for review November 13, 2023 22:22
@ParkMyCar ParkMyCar requested a review from a team November 13, 2023 22:22
@ParkMyCar ParkMyCar requested a review from a team as a code owner November 13, 2023 22:22
@ParkMyCar ParkMyCar requested review from a team and jkosh44 November 13, 2023 22:22
Copy link

shepherdlybot bot commented Nov 13, 2023

This PR has higher risk. Make sure to carefully review the file hotspots. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 81 / 100 61% 13
Buggy File Hotspots:
File Percentile
../plan/error.rs 90
../src/parser.rs 95
../src/util.rs 97
../src/rbac.rs 95
../session/vars.rs 99
../src/names.rs 90
../src/catalog.rs 95
../ast/transform.rs 92
../plan/statement.rs 92
../src/protocol.rs 93
../statement/dml.rs 93
../src/pure.rs 94
../statement/ddl.rs 98

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Most of my comments are nits, but I had a couple of questions/comments.

Comment on lines +192 to +198
name: Some(Ident::new_unchecked(index_name)),
on_name: RawItemName::Name(mz_sql::normalize::unresolve(view_name)),
in_cluster: Some(RawClusterName::Resolved(cluster_id.to_string())),
key_parts: Some(
keys.iter()
.map(|i| match view_desc.get_unambiguous_name(*i) {
Some(n) => Expr::Identifier(vec![Ident::new(n.to_string())]),
Some(n) => Expr::Identifier(vec![Ident::new_unchecked(n.to_string())]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably create a follow-up GH issue to try and remove these types of new_unchecked uses. It's not obvious to me that these are guaranteed to fit.

Copy link
Member Author

@ParkMyCar ParkMyCar Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed https://github.com/MaterializeInc/materialize/issues/23191 to track this. It's not obvious to me either, but with the soft_assert! in place, I feel much more confident that we'll atleast catch these issues earlier

@@ -339,6 +339,13 @@ impl RustType<ProtoColumnName> for ColumnName {
}
}

impl From<ColumnName> for mz_sql_parser::ast::Ident {
fn from(value: ColumnName) -> Self {
// Note: ColumnNames are known to be less than the max length of an Ident (I think?).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the code base, I see 80 instances of initializing a ColumnName I have no idea if we check in all those places that the name is less than the max allowed characters. We probably want to do something similar to ColumnName where we add a constructor that checks the length. We should probably add a followup GH issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed https://github.com/MaterializeInc/materialize/issues/23192 to track this. I agree, no idea if we check all, or even any, ColumnNames to make sure they're under our max length. I figured starting with Ident was a good place though

/// Newtype wrapper around [`String`] whose _byte_ length is guaranteed to be less than or equal to
/// [`MAX_IDENTIFIER_LENGTH`].
#[derive(Debug, Clone, PartialEq)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to put this empty newline in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, removed!

Comment on lines 104 to 108
if s.len() > MAX_IDENTIFIER_LENGTH {
return Err(s);
}

Ok(SmallString(s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for doing this check in both the Lexer and Parser?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lexer is what creates Tokens, so it's cleanest to do the check there for SQL parsing. Then I was met with a dilemma about do we move the Ident struct one layer lower into mz_sql_lexer or create this second type?

We could probably move it one layer lower, and then re-export it from the sql-parser crate, but this felt a little cleaner? Not sure, what do you think?

src/sql-parser/src/ast/defs/name.rs Outdated Show resolved Hide resolved
Comment on lines +155 to +158
candidate.append_lossy(suffix.clone());
if is_valid(&candidate)? {
return Ok(candidate);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc-comments don't mention that we might truncate the prefix. We should probably add that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

const MAX_SUFFIX_LENGTH: usize = Ident::MAX_LENGTH - 8;

let mut suffix: String = suffix.into();
mz_ore::soft_assert!(suffix.len() <= MAX_SUFFIX_LENGTH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add this requirement to the doc-comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

}))
}

/// Append the provided `suffix`, truncating `self` as necessary to satisfy our invariants.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that suffix can be truncated too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a doc test to show this

.chars()
.take_while(|c| {
byte_length += c.len_utf8();
byte_length < Self::MAX_LENGTH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be MAX_SUFFIX_LENGTH?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated!

Comment on lines 262 to 263
// Note: using unchecked here is okay because SmallString is known to be less than or equal
// to our max length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmallString and Ident use different constant for their max lengths. They happen to be equal now, but they might diverge in the future by accident. Would it be possible to use the same constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated SmallString to be MaxLenString, which takes a maximum length as a generic constant. Then updated the From<...> impl on Ident so we only implement it for MaxLenStrings with a max len of 255, this should prevent skew in the future.

Now that I think about it, we could probably push down all of these methods onto MaxLenString, and then Ident becomes a type wrapper around it. If you don't mind I'll probably save this as a followup though?

@ParkMyCar ParkMyCar force-pushed the ident/enforce-max-len branch from eb815c4 to 5d34654 Compare November 14, 2023 16:18
@ParkMyCar ParkMyCar requested a review from benesch as a code owner November 14, 2023 16:18
@ParkMyCar ParkMyCar force-pushed the ident/enforce-max-len branch from 5d34654 to 192f6cc Compare November 14, 2023 16:26
Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@ParkMyCar ParkMyCar force-pushed the ident/enforce-max-len branch from 192f6cc to fd735b9 Compare November 15, 2023 14:15
@ParkMyCar ParkMyCar enabled auto-merge (squash) November 15, 2023 14:15
@ParkMyCar ParkMyCar merged commit 1e51779 into MaterializeInc:main Nov 15, 2023
@ParkMyCar ParkMyCar deleted the ident/enforce-max-len branch June 17, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants