Implement `Spanned` to retrieve source locations on AST nodes #1435

Nyrox · 2024-09-19T11:28:35Z

This PR adds a new trait Spanned to retrieve source spans on AST nodes, see #161 , by recursively traversing the AST and combining spans. This approach is in contrast to the one taken in #839 and #790 by trying to minimise the amount of breaking changes for downstream users, by avoiding wrapping everything in WithSpan<T>.

Main Changes

Spanned trait and Span type added
TokenWithLocation now stores Span, instead of Location
Ident now stores a source span (omitted from Partial/-EQ for compatibility and tests)
Some AST nodes store source tokens where not intrusive
Parser::parse_keyword_token added

The general philosophy of the PR is to be "good enough" without breaking things. As a result certain expressions will have broken or incorrect spans. I imagine these can be cleaned up in future PRs (which might require breaking changes).

f.e. many expressions do not include keywords in their span i.e.

<expr> IS NOT NULL
|....|

will have it's source span simply reported as <expr>::span and there is many such cases, some of which are easier to fix than others. For expressions we can't generate spans for, I use Span::EMPTY as a sort of sentinel value to indicate missing information.

With this approach the only downstream changes a user should have to do to upgrade, should be adding additional fields when matching on AST nodes.

Future Work

Store spans for ast::value::Value. This seems like a breaking change to me, which would require a WithSpan<T> like type
Store keyword TokenWithLocation for expressions that currently don't have them
Implement spans for the rest of the AST, namely Statements. I focused on getting Queries done.

Nyrox · 2024-10-02T11:49:53Z

@alamb Hey, we've started using this functionality internally with pretty great success so far. It's still a draft for now, because of the missing todo!'s, but I would appreciate feedback on the overall design and if you think this can get merged in the foreseeable future once issues are addressed. 😄

alamb · 2024-10-03T11:09:14Z

Thanks @Nyrox -- I'll try and take a look over the next few days

cc @lovasoa @iffyio and @jmhain

src/parser/mod.rs

yuyang-ok · 2024-10-08T06:22:47Z

How is this PR going? is it going to merge or what?

Nyrox · 2024-10-08T10:52:47Z

Alright, I have now gotten rid of all the todo!s and warnings and un-drafted the PR. There is a lot of missing implementations of spans and I have documented those (in hindsight maybe a derive macro would have been the way to go here, someone better at writing those than me is free to take a shot at it 😅 ).

iffyio

Thanks for tackling this @Nyrox! took a quick look and left some comments inline, I'll make some time to do another pass

src/ast/mod.rs

src/ast/spans.rs

Co-authored-by: Ifeanyi Ubah <[email protected]>

iffyio

Thanks @Nyrox! The changes look reasonable to me overall given the discussion in the GH issue. Left some comments, one mostly wondering around the equality behavior now that the token location is embedded within the AST

src/ast/mod.rs

src/ast/spans.rs

src/tokenizer.rs

tests/sqlparser_common.rs

src/tokenizer.rs

src/ast/spans.rs

alamb

Thank you so much Thank you so much @Nyrox and @iffyio and @lovasoa -- this is epic work.

Also, thank you to @lustefaniak @yuyang-ok for your comments

Many people have tried this feature but non have prevailed. 👏 . If we are ever colocated I totally owe you an in person meet up / 🍻 with a beverage of your choice

In terms of next steps:

I will file a ticket with the current state of the project / spans which can hopefully let us spread out the work for adding span information to the rest of parse tree over time.
I think it would be good to consider changing the offsets in Location to be u32 rather than usize which would reduce the memory requirements significantly I thin.
I have a few ideas to improve the documentation, but I will propose some follow on PRs to do that.

BTW I tried updating DataFusion to use this change here apache/datafusion#13546 and it went quite smoothly

Let's leave this PR open for a few more days to get any more feedback and then plan to merge it in 🚀

alamb · 2024-11-25T15:49:36Z

I started organizing follow on work here:

[EPIC] Complete Span (source location) information / feature #1548

I also have been going through the code and adding docs / examples. So far I am quite pleased

alamb · 2024-11-25T15:49:58Z

I plan to merge this PR tomorrow unless there are any other comments

Dandandan · 2024-11-26T12:26:24Z

Did we run some benchmarks (e.g. cargo bench)?

alamb · 2024-11-26T12:33:47Z

Did we run some benchmarks (e.g. cargo bench)?

No, I did not (it is not clear to me we have such a thing). Let me look

alamb · 2024-11-26T12:51:01Z

cd sqlparser_bench
cargo bench
git remote add Nyrox https://github.com/Nyrox/sqlparser-rs.git
git fetch Nyrox
git checkout Nyrox/main
cargo bench

Here is the benchmark result (it appears to be about 10%-15% slower according to the benchmark):

sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [3.1255 µs 3.1265 µs 3.1276 µs]
                        change: [+15.235% +17.353% +19.853%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [21.672 µs 21.683 µs 21.694 µs]
                        change: [+11.446% +11.549% +11.653%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

alamb · 2024-11-26T12:54:50Z

Here is the flamegraph for anyone who is interested (you can download it locally to get zoom / etc):

sqlparser-bench-flamegraph

Dandandan · 2024-11-26T13:19:35Z

Thanks, looks great. I think 15% degradation is fully worth it (and might be gained back if someone looks at optimizing sqlparser-rs) :)

alamb · 2024-11-26T13:45:55Z

Thanks, looks great. I think 15% degradation is fully worth it (and might be gained back if someone looks at optimizing sqlparser-rs) :)

Yeah, I was looking at the flamegraph and there are a bunch of obvious things to improve performance (like changing next_token not to copy each token)

alamb · 2024-11-26T14:08:41Z

I filed a ticket to discuss improving performance:

[EPIC] Improve sqlparser performance #1557

Nyrox · 2024-11-26T14:09:11Z

I also noticed that there is a bunch of calls to <Location as Display>::fmt which seems quite strange to me. Not sure what the test is doing, but I don't think the parser should be calling that on a non-error path? :p

alamb · 2024-11-26T15:08:16Z

I also noticed that there is a bunch of calls to ::fmt which seems quite strange to me. Not sure what the test is doing, but I don't think the parser should be calling that on a non-error path? :p

I agree -- it is strange that Parser::expected woud show up so much. One place I see it used is generating errors 🤔

https://github.com/apache/datafusion-sqlparser-rs/blob/2e90e105a74bf9f50f2bad6c22992759ddb06880/src/parser/mod.rs#L3424-L3429

alamb · 2024-11-26T15:50:05Z

Found the benchmark problem 🤦

Sqlparser Benchmarks are erroring #1559

And fix:

Fix error in benchmark queries #1560

I will rerun the benchmarks with actually parsing queriers

alamb · 2024-11-26T16:04:41Z

Amusingly when I ran with the fixed benchmarks the result is basically the same (15% slower)

sqlparser-rs parsing benchmark/sqlparser::select
                        time:   [6.5723 µs 6.5856 µs 6.6010 µs]
                        change: [+14.669% +15.020% +15.364%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  11 (11.00%) high mild
  1 (1.00%) high severe
sqlparser-rs parsing benchmark/sqlparser::with_select
                        time:   [32.234 µs 32.253 µs 32.277 µs]
                        change: [+14.270% +14.402% +14.538%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

fixed-flamegraph

alamb · 2024-11-26T16:09:49Z

I am more convinced than ever that we could make a huge performance improvement by doing something like:

Improve performance by not copying Tokens as much #1558

alamb · 2024-11-26T16:22:22Z

🚀 -- thanks again @Nyrox @iffyio @lovasoa @lustefaniak @mkarbo @Dandandan and @yuyang-ok for helping push this along. It is pretty amazing we'll finally get new things

cc @ankrgyl it's finally happening

alamb · 2024-11-26T16:22:26Z

🚀

ankrgyl · 2024-11-26T17:13:41Z

Amazing!!!

alamb · 2024-11-26T18:51:33Z

BTW if anyone has time to help review another PR, this one adds a bunch of documentation and examples for this feature:

Update comments / docs for Spanned #1549

alamb · 2024-12-12T15:25:19Z

BTW here is a PR from @davisp that recovers all the performance lost adding tokens (and then some) ❤️

Improve parsing performance by reducing token cloning #1587

Nyrox added 7 commits September 16, 2024 10:36

feat(tokenizer): add source location spans to tokens

1a77bac

feat: begin work on trait Spanned

d818012

implement a bunch more stuff

079a4e2

Merge branch 'feat/ast-source-locations'

b97a781

fix: restore old behaviour of location display

df9ab1e

implement spans for eveeeeen more ast nodes

b718c76

feat: more ast nodes

8986a1e

Nyrox changed the title ~~Implement Spanned to retrieve sourcec locations on AST nodes~~ Implement Spanned to retrieve source locations on AST nodes Sep 19, 2024

start working on better tests

aeb4f3a

alamb mentioned this pull request Sep 23, 2024

Syntax highlight for keywords datafusion-contrib/datafusion-dft#149

Merged

Nyrox added 3 commits September 25, 2024 14:31

feat: implement spans for Wildcard projections

4de3209

make union_spans public

a04888a

enable serde feat for spans and locations

1b2b03d

Nyrox force-pushed the main branch from b7335cb to 1b2b03d Compare September 30, 2024 08:55

lovasoa reviewed Oct 3, 2024

View reviewed changes

src/parser/mod.rs Outdated Show resolved Hide resolved

feat: implement remaining ast nodes

5f60bdc

Nyrox added 2 commits October 8, 2024 12:38

fix unused variable warnings

ea8a6b1

undo parse_keyword signature change

6a9250a

Nyrox marked this pull request as ready for review October 8, 2024 10:50

fix: diverging hash and partialeq implementations

0804e99

iffyio reviewed Oct 8, 2024

View reviewed changes

src/ast/mod.rs Show resolved Hide resolved

src/ast/spans.rs Outdated Show resolved Hide resolved

src/ast/spans.rs Outdated Show resolved Hide resolved

src/ast/spans.rs Outdated Show resolved Hide resolved

src/ast/spans.rs Outdated Show resolved Hide resolved

Nyrox and others added 4 commits October 9, 2024 09:53

Update src/ast/spans.rs

eb9ff9a

Co-authored-by: Ifeanyi Ubah <[email protected]>

improve docs & un-pub union_spans

a93cebc

move union_spans to top of file

734264a

replace old tests

441ceb1

iffyio reviewed Oct 10, 2024

View reviewed changes

alamb approved these changes Nov 24, 2024

View reviewed changes

This was referenced Nov 25, 2024

[EPIC] Complete Span (source location) information / feature #1548

Open

Update comments / docs for Spanned #1549

Merged

Merge branch 'main' of https://github.com/apache/datafusion-sqlparser-rs

b24c9fe

This was referenced Nov 26, 2024

Document micro benchmarks #1555

Merged

[EPIC] Improve sqlparser performance #1557

Open

alamb mentioned this pull request Nov 26, 2024

Sqlparser Benchmarks are erroring #1559

Closed

alamb merged commit 3c8fd74 into apache:main Nov 26, 2024
8 checks passed

This was referenced Nov 26, 2024

Rename TokenWithLocation to TokenWithSpan, in backwards compatible way #1562

Merged

Create test pattern for Spans #1563

Open

eliaperantoni mentioned this pull request Dec 5, 2024

Add related source code locations to errors apache/datafusion#13662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `Spanned` to retrieve source locations on AST nodes #1435

Implement `Spanned` to retrieve source locations on AST nodes #1435

Nyrox commented Sep 19, 2024 •

edited

Loading

Nyrox commented Oct 2, 2024

alamb commented Oct 3, 2024

yuyang-ok commented Oct 8, 2024

Nyrox commented Oct 8, 2024

iffyio left a comment

iffyio left a comment

alamb left a comment

alamb commented Nov 25, 2024

alamb commented Nov 25, 2024

Dandandan commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024 •

edited

Loading

alamb commented Nov 26, 2024

Dandandan commented Nov 26, 2024 •

edited

Loading

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

Nyrox commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024 •

edited

Loading

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

ankrgyl commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Dec 12, 2024

Implement Spanned to retrieve source locations on AST nodes #1435

Implement Spanned to retrieve source locations on AST nodes #1435

Conversation

Nyrox commented Sep 19, 2024 • edited Loading

Main Changes

Future Work

Nyrox commented Oct 2, 2024

alamb commented Oct 3, 2024

yuyang-ok commented Oct 8, 2024

Nyrox commented Oct 8, 2024

iffyio left a comment

Choose a reason for hiding this comment

iffyio left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Nov 25, 2024

alamb commented Nov 25, 2024

Dandandan commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024 • edited Loading

alamb commented Nov 26, 2024

Dandandan commented Nov 26, 2024 • edited Loading

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

Nyrox commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024 • edited Loading

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Nov 26, 2024

ankrgyl commented Nov 26, 2024

alamb commented Nov 26, 2024

alamb commented Dec 12, 2024

Implement `Spanned` to retrieve source locations on AST nodes #1435

Implement `Spanned` to retrieve source locations on AST nodes #1435

Nyrox commented Sep 19, 2024 •

edited

Loading

alamb commented Nov 26, 2024 •

edited

Loading

Dandandan commented Nov 26, 2024 •

edited

Loading

alamb commented Nov 26, 2024 •

edited

Loading