Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regressions since DataFusion 15.x #5060

Closed
andygrove opened this issue Jan 25, 2023 · 5 comments
Closed

Performance regressions since DataFusion 15.x #5060

andygrove opened this issue Jan 25, 2023 · 5 comments
Labels
bug Something isn't working performance Make DataFusion faster

Comments

@andygrove
Copy link
Member

andygrove commented Jan 25, 2023

Describe the bug

I noticed that some benchmark queries are slower in DF 16.

image

q17 is a good example. DF 16 uses a cross join instead of an inner join.

DataFusion 15.0.0

Projection: CAST(SUM(lineitem.l_extendedprice) AS Float64) / Float64(7) AS avg_yearly
  Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]
    Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__sq_1.__value AS Decimal128(30, 15))
      Inner Join: part.p_partkey = __sq_1.l_partkey, lineitem.l_partkey = __sq_1.l_partkey
        Inner Join: lineitem.l_partkey = part.p_partkey
          TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice]
          Filter: part.p_brand = Utf8("Brand#51") AND part.p_container = Utf8("JUMBO BAG")
            TableScan: part projection=[p_partkey, p_brand, p_container], partial_filters=[part.p_brand = Utf8("Brand#51"), part.p_container = Utf8("JUMBO BAG")]
        SubqueryAlias: __sq_1
          Projection: lineitem.l_partkey, Float64(0.2) * CAST(AVG(lineitem.l_quantity) AS Float64) AS __value
            Aggregate: groupBy=[[lineitem.l_partkey]], aggr=[[AVG(lineitem.l_quantity)]]
              TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice]

DataFusion 16.0.0

Projection: SUM(lineitem.l_extendedprice) / Float64(7) AS avg_yearly
  Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]
    Filter: part.p_partkey = lineitem.l_partkey AND part.p_brand = Utf8("Brand#51") AND part.p_container = Utf8("JUMBO BAG") AND lineitem.l_quantity < (<subquery>)
      Subquery:
        Projection: Float64(0.2) * AVG(lineitem.l_quantity)
          Aggregate: groupBy=[[]], aggr=[[AVG(lineitem.l_quantity)]]
            Filter: lineitem.l_partkey = part.p_partkey
              TableScan: lineitem
      CrossJoin:
        TableScan: lineitem
        TableScan: part

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Configuration settings used:

    "datafusion.explain.logical_plan_only": "false",
    "datafusion.optimizer.repartition_aggregations": "true",
    "datafusion.catalog.has_header": "false",
    "datafusion.execution.parquet.pruning": "true",
    "datafusion.catalog.default_schema": "public",
    "datafusion.optimizer.top_down_join_key_reordering": "true",
    "datafusion.execution.batch_size": "8192",
    "datafusion.optimizer.enable_round_robin_repartition": "true",
    "datafusion.optimizer.filter_null_join_keys": "true",
    "datafusion.optimizer.hash_join_single_partition_threshold": "1048576",
    "datafusion.execution.coalesce_batches": "true",
    "datafusion.catalog.create_default_catalog_and_schema": "true",
    "datafusion.execution.collect_statistics": "false",
    "datafusion.execution.parquet.skip_metadata": "true",
    "datafusion.optimizer.repartition_joins": "true",
    "datafusion.optimizer.repartition_windows": "true",
    "datafusion.execution.time_zone": "+00:00",
    "datafusion.execution.parquet.reorder_filters": "true",
    "datafusion.explain.physical_plan_only": "false",
    "datafusion.catalog.default_catalog": "datafusion",
    "datafusion.execution.target_partitions": "24",
    "datafusion.catalog.information_schema": "false",
    "datafusion.optimizer.prefer_hash_join": "true",
    "datafusion.execution.parquet.pushdown_filters": "true",
    "datafusion.execution.parquet.enable_page_index": "true",
    "datafusion.optimizer.skip_failed_rules": "true",
    "datafusion.optimizer.max_passes": "3"
@andygrove andygrove added the bug Something isn't working label Jan 25, 2023
@andygrove andygrove changed the title Performance regressions in DataFusion Performance regressions in DataFusion 16.x Jan 25, 2023
@andygrove andygrove added the performance Make DataFusion faster label Jan 25, 2023
@andygrove andygrove changed the title Performance regressions in DataFusion 16.x Performance regressions since DataFusion 15.x Jan 25, 2023
@Dandandan
Copy link
Contributor

The expected plan in
https://github.com/apache/arrow-datafusion/blob/master/benchmarks/expected-plans/q17.txt

is still showing the inner join?

@andygrove
Copy link
Member Author

Yes, I checked that as well. I wonder if it this bug is triggered by the specific options I am setting (noted in the issue)

@andygrove
Copy link
Member Author

andygrove commented Jan 25, 2023

oh, wait .. the API changed between 15 and 16 and df.logical_plan now returns the unoptimized plan ...

@mingmwang
Copy link
Contributor

mingmwang commented Mar 27, 2023

There is something wrong to the q17 plan in the latest DataFusion.

Projection: CAST(SUM(lineitem.l_extendedprice) AS Float64) / Float64(7) AS avg_yearly
  Aggregate: groupBy=[[]], aggr=[[SUM(lineitem.l_extendedprice)]]
    Projection: lineitem.l_extendedprice
      Filter: CAST(lineitem.l_quantity AS Decimal128(30, 15)) < CAST(__scalar_sq_1.__value AS Decimal128(30, 15)) AND __scalar_sq_1.l_partkey = lineitem.l_partkey
        Projection: lineitem.l_partkey, lineitem.l_quantity, lineitem.l_extendedprice, __scalar_sq_1.l_partkey, __scalar_sq_1.__value
          Inner Join: part.p_partkey = __scalar_sq_1.l_partkey
            Filter: part.p_partkey = lineitem.l_partkey AND lineitem.l_partkey = part.p_partkey
              Inner Join: lineitem.l_partkey = part.p_partkey
                TableScan: lineitem projection=[l_partkey, l_quantity, l_extendedprice]
                Projection: part.p_partkey
                  Filter: part.p_brand = Utf8("Brand#23") AND part.p_container = Utf8("MED BOX")
                    TableScan: part projection=[p_partkey, p_brand, p_container]
            SubqueryAlias: __scalar_sq_1
              Projection: lineitem.l_partkey, Float64(0.2) * CAST(AVG(lineitem.l_quantity) AS Float64) AS __value
                Aggregate: groupBy=[[lineitem.l_partkey]], aggr=[[AVG(lineitem.l_quantity)]]
                  TableScan: lineitem projection=[l_partkey, l_quantity]

@mingmwang
Copy link
Contributor

#5646

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

3 participants