Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Spanned to retrieve source locations on AST nodes #1435

Merged
merged 38 commits into from
Nov 26, 2024

Conversation

Nyrox
Copy link
Contributor

@Nyrox Nyrox commented Sep 19, 2024

This PR adds a new trait Spanned to retrieve source spans on AST nodes, see #161 , by recursively traversing the AST and combining spans. This approach is in contrast to the one taken in #839 and #790 by trying to minimise the amount of breaking changes for downstream users, by avoiding wrapping everything in WithSpan<T>.

Main Changes

  • Spanned trait and Span type added
  • TokenWithLocation now stores Span, instead of Location
  • Ident now stores a source span (omitted from Partial/-EQ for compatibility and tests)
  • Some AST nodes store source tokens where not intrusive
  • Parser::parse_keyword_token added

The general philosophy of the PR is to be "good enough" without breaking things. As a result certain expressions will have broken or incorrect spans. I imagine these can be cleaned up in future PRs (which might require breaking changes).

f.e. many expressions do not include keywords in their span i.e.

<expr> IS NOT NULL
|....|

will have it's source span simply reported as <expr>::span and there is many such cases, some of which are easier to fix than others. For expressions we can't generate spans for, I use Span::EMPTY as a sort of sentinel value to indicate missing information.

With this approach the only downstream changes a user should have to do to upgrade, should be adding additional fields when matching on AST nodes.

Future Work

  • Store spans for ast::value::Value. This seems like a breaking change to me, which would require a WithSpan<T> like type
  • Store keyword TokenWithLocation for expressions that currently don't have them
  • Implement spans for the rest of the AST, namely Statements. I focused on getting Queries done.

@Nyrox Nyrox changed the title Implement Spanned to retrieve sourcec locations on AST nodes Implement Spanned to retrieve source locations on AST nodes Sep 19, 2024
@Nyrox
Copy link
Contributor Author

Nyrox commented Oct 2, 2024

@alamb Hey, we've started using this functionality internally with pretty great success so far. It's still a draft for now, because of the missing todo!'s, but I would appreciate feedback on the overall design and if you think this can get merged in the foreseeable future once issues are addressed. 😄

@alamb
Copy link
Contributor

alamb commented Oct 3, 2024

Thanks @Nyrox -- I'll try and take a look over the next few days

cc @lovasoa @iffyio and @jmhain

src/parser/mod.rs Outdated Show resolved Hide resolved
@yuyang-ok
Copy link
Contributor

How is this PR going? is it going to merge or what?

@Nyrox Nyrox marked this pull request as ready for review October 8, 2024 10:50
@Nyrox
Copy link
Contributor Author

Nyrox commented Oct 8, 2024

Alright, I have now gotten rid of all the todo!s and warnings and un-drafted the PR. There is a lot of missing implementations of spans and I have documented those (in hindsight maybe a derive macro would have been the way to go here, someone better at writing those than me is free to take a shot at it 😅 ).

Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @Nyrox! took a quick look and left some comments inline, I'll make some time to do another pass

src/ast/mod.rs Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nyrox! The changes look reasonable to me overall given the discussion in the GH issue. Left some comments, one mostly wondering around the equality behavior now that the token location is embedded within the AST

src/ast/mod.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/tokenizer.rs Show resolved Hide resolved
tests/sqlparser_common.rs Outdated Show resolved Hide resolved
src/tokenizer.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much Thank you so much @Nyrox and @iffyio and @lovasoa -- this is epic work.

Also, thank you to @lustefaniak @yuyang-ok for your comments

Many people have tried this feature but non have prevailed. 👏 . If we are ever colocated I totally owe you an in person meet up / 🍻 with a beverage of your choice

In terms of next steps:

  1. I will file a ticket with the current state of the project / spans which can hopefully let us spread out the work for adding span information to the rest of parse tree over time.
  2. I think it would be good to consider changing the offsets in Location to be u32 rather than usize which would reduce the memory requirements significantly I thin.
  3. I have a few ideas to improve the documentation, but I will propose some follow on PRs to do that.

BTW I tried updating DataFusion to use this change here apache/datafusion#13546 and it went quite smoothly

Let's leave this PR open for a few more days to get any more feedback and then plan to merge it in 🚀

@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

I started organizing follow on work here:

I also have been going through the code and adding docs / examples. So far I am quite pleased

@alamb
Copy link
Contributor

alamb commented Nov 25, 2024

I plan to merge this PR tomorrow unless there are any other comments

@Dandandan
Copy link
Contributor

Did we run some benchmarks (e.g. cargo bench)?

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

Did we run some benchmarks (e.g. cargo bench)?

No, I did not (it is not clear to me we have such a thing). Let me look

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

cd sqlparser_bench
cargo bench
git remote add Nyrox https://github.com/Nyrox/sqlparser-rs.git
git fetch Nyrox
git checkout Nyrox/main
cargo bench

Here is the benchmark result (it appears to be about 10%-15% slower according to the benchmark):

sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [3.1255 µs 3.1265 µs 3.1276 µs]
                        change: [+15.235% +17.353% +19.853%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [21.672 µs 21.683 µs 21.694 µs]
                        change: [+11.446% +11.549% +11.653%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

Here is the flamegraph for anyone who is interested (you can download it locally to get zoom / etc):

sqlparser-bench-flamegraph

sqlparser-bench-flamegraph

@Dandandan
Copy link
Contributor

Dandandan commented Nov 26, 2024

Thanks, looks great. I think 15% degradation is fully worth it (and might be gained back if someone looks at optimizing sqlparser-rs) :)

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

Thanks, looks great. I think 15% degradation is fully worth it (and might be gained back if someone looks at optimizing sqlparser-rs) :)

Yeah, I was looking at the flamegraph and there are a bunch of obvious things to improve performance (like changing next_token not to copy each token)

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

I filed a ticket to discuss improving performance:

@Nyrox
Copy link
Contributor Author

Nyrox commented Nov 26, 2024

I also noticed that there is a bunch of calls to <Location as Display>::fmt which seems quite strange to me. Not sure what the test is doing, but I don't think the parser should be calling that on a non-error path? :p

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

I also noticed that there is a bunch of calls to ::fmt which seems quite strange to me. Not sure what the test is doing, but I don't think the parser should be calling that on a non-error path? :p

I agree -- it is strange that Parser::expected woud show up so much. One place I see it used is generating errors 🤔

https://github.com/apache/datafusion-sqlparser-rs/blob/2e90e105a74bf9f50f2bad6c22992759ddb06880/src/parser/mod.rs#L3424-L3429

Screenshot 2024-11-26 at 9 51 11 AM

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

Found the benchmark problem 🤦

And fix:

I will rerun the benchmarks with actually parsing queriers

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

Amusingly when I ran with the fixed benchmarks the result is basically the same (15% slower)

sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [6.5723 µs 6.5856 µs 6.6010 µs]
                        change: [+14.669% +15.020% +15.364%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  11 (11.00%) high mild
  1 (1.00%) high severe
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [32.234 µs 32.253 µs 32.277 µs]
                        change: [+14.270% +14.402% +14.538%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

fixed-flamegraph

fixed-flamegraph

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

I am more convinced than ever that we could make a huge performance improvement by doing something like:

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

🚀 -- thanks again @Nyrox @iffyio @lovasoa @lustefaniak @mkarbo @Dandandan and @yuyang-ok for helping push this along. It is pretty amazing we'll finally get new things

cc @ankrgyl it's finally happening

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

🚀

@alamb alamb merged commit 3c8fd74 into apache:main Nov 26, 2024
8 checks passed
@ankrgyl
Copy link
Contributor

ankrgyl commented Nov 26, 2024

Amazing!!!

@alamb
Copy link
Contributor

alamb commented Nov 26, 2024

BTW if anyone has time to help review another PR, this one adds a bunch of documentation and examples for this feature:

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

BTW here is a PR from @davisp that recovers all the performance lost adding tokens (and then some) ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants