-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace multiple calls to withColumn
with single select
to simplify query plans
#888
base: main
Are you sure you want to change the base?
Replace multiple calls to withColumn
with single select
to simplify query plans
#888
Conversation
eab287b
to
26c1031
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool stuff!! lgtm will let others stamp
71b2a6c
to
4b752bb
Compare
lgtm |
…fy query plans * Define implicit tableUtils to fix test
e88e47e
to
c9b22d0
Compare
Thanks for the review! |
c9b22d0
to
5f8401c
Compare
@thomaschow -- need another stamp, I had to rebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@thomaschow @nikhilsimha -- the branch got out of date and I had to merge main again so this needs another stamp. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, purely a review
Summary
Refactors some of the join code to avoid multiple calls to
withColumn
andwithColumnRenamed
.Why / Goal
Performance. Multiple calls to
withColumn
(andwithColumnRenamed
) are known to cause performance issues when done on too many columns, as it may generate very complex query plans. At Stripe, some of the jobs with too many columns on the RHS (e.g. ~2k) may fail due toStackOverflowError
s on the driver when generating the query plan.The following code snippet illustrates the issue, given a DF with 2k columns:
On a notebook, this code took me ~3 minutes to run:
While this code took ~1 second:
Test Plan
Checklist
Reviewers
@pengyu-hou @jbrooks-stripe