Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace multiple calls to withColumn with single select to simplify query plans #888

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

earangol-stripe
Copy link
Contributor

@earangol-stripe earangol-stripe commented Nov 27, 2024

Summary

Refactors some of the join code to avoid multiple calls to withColumn and withColumnRenamed.

Why / Goal

Performance. Multiple calls to withColumn (and withColumnRenamed) are known to cause performance issues when done on too many columns, as it may generate very complex query plans. At Stripe, some of the jobs with too many columns on the RHS (e.g. ~2k) may fail due to StackOverflowErrors on the driver when generating the query plan.

The following code snippet illustrates the issue, given a DF with 2k columns:

val range = 0 until 2000
val columns = range.map(i => s"column$i")
val schema = columns.map(c => StructField(c, IntegerType, true))
val df = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], StructType(schema))

On a notebook, this code took me ~3 minutes to run:

val renamedDf = df.columns.foldLeft(df) {
  case (partialDf, c) =>
    df.withColumnRenamed(c, s"renamed_$c")
}
renamedDf.explain() // slow

While this code took ~1 second:

val renamedColumns = df.columns.map(c => col(c).as(s"renamed_$c"))
val renamedDf = df.select(renamedColumns:_*)
renamedDf.explain() // fast

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

Checklist

  • Documentation update

Reviewers

@pengyu-hou @jbrooks-stripe

@earangol-stripe earangol-stripe force-pushed the earangol--prefix-column-names-select branch 2 times, most recently from eab287b to 26c1031 Compare December 3, 2024 19:17
@earangol-stripe earangol-stripe marked this pull request as ready for review December 3, 2024 19:57
Copy link
Contributor

@nikhilsimha nikhilsimha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool stuff!! lgtm will let others stamp

@earangol-stripe earangol-stripe force-pushed the earangol--prefix-column-names-select branch from 71b2a6c to 4b752bb Compare December 6, 2024 14:44
@thomaschow
Copy link

lgtm

@earangol-stripe earangol-stripe force-pushed the earangol--prefix-column-names-select branch from e88e47e to c9b22d0 Compare December 9, 2024 15:59
@earangol-stripe
Copy link
Contributor Author

Thanks for the review!

@earangol-stripe earangol-stripe force-pushed the earangol--prefix-column-names-select branch from c9b22d0 to 5f8401c Compare December 9, 2024 16:58
@earangol-stripe
Copy link
Contributor Author

@thomaschow -- need another stamp, I had to rebase main.

Copy link
Contributor

@nikhilsimha nikhilsimha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@earangol-stripe
Copy link
Contributor Author

@thomaschow @nikhilsimha -- the branch got out of date and I had to merge main again so this needs another stamp. Thanks!

Copy link

@thomaschow thomaschow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, purely a review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants