Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results returned for TPC-H Query 8 #6794

Closed
osawyerr opened this issue Jun 28, 2023 · 11 comments · Fixed by #7233
Closed

Incorrect results returned for TPC-H Query 8 #6794

osawyerr opened this issue Jun 28, 2023 · 11 comments · Fixed by #7233
Labels
bug Something isn't working

Comments

@osawyerr
Copy link

osawyerr commented Jun 28, 2023

Describe the bug

Datafusion gives incorrect results when running TPC-H Query 8 with parquet files.

To Reproduce

  1. Generate TPC-H parquet files for scale factor 10
  2. Open datafusion-cli and create external tables pointing to files
create external table lineitem stored as parquet location '/path/to/lineitem/lineitem_1687987398_default/';
create external table customer stored as parquet location '/path/to/customer/customer_1687987384_default/';
create external table nation stored as parquet location '/path/to/nation/nation_1687988005_default/';
create external table orders stored as parquet location '/path/to/orders/orders_1687988005_default/';
create external table region stored as parquet location '/path/to/region/region_1687988223_default/';
create external table supplier stored as parquet location '/path/to/supplier/supplier_1687988223_default/';
create external table part stored as parquet location '/path/to/part/part_1687988133_default/';
  1. Run TPC-H Query 8
select
  o_year, sum(case when nation = 'BRAZIL' then volume else 
0
 end) / sum(volume) as mkt_share
from
  (
    select
      extract(year from o_orderdate) as o_year,
      l_extendedprice * (
1
 - l_discount) as volume,
      n2.n_name as nation
    from part, supplier, lineitem, orders, customer, nation n1, nation n2, region
    where
      p_partkey = l_partkey
      and s_suppkey = l_suppkey
      and l_orderkey = o_orderkey
      and o_custkey = c_custkey
      and c_nationkey = n1.n_nationkey
      and n1.n_regionkey = r_regionkey
      and r_name = 'AMERICA'
      and s_nationkey = n2.n_nationkey
      and o_orderdate between date '1995-01-01' and date '1996-12-31'
      and p_type = 'ECONOMY ANODIZED STEEL'
  ) as all_nations
group by o_year
order by o_year;
  1. Incorrect results displayed below
+--------+-------------------------------------------+
| o_year | mkt_share                                 |
+--------+-------------------------------------------+
| 1995.0 | -0.00000000000011380044067220119495060732 |
| 1996.0 | 0.00000000000019588288500717383285218261  |
+--------+-------------------------------------------+
2 rows in set. Query took 1.131 seconds.

Expected behavior

The correct results should be:

  1. From Postgres:
 o_year |       mkt_share        
--------+------------------------
   1995 | 0.03882014251433219622
   1996 | 0.03948968749183991638
(2 rows)

Time: 11925.416 ms (00:11.925)
  1. From DuckDb (with same parquet files):
┌────────┬─────────────────────┐
│ o_year │      mkt_share      │
│ int64  │       double        │
├────────┼─────────────────────┤
│   1995 │  0.0388201425143322 │
│   1996 │ 0.03948968749183992 │
└────────┴─────────────────────┘
Run Time (s): real 1.328 user 8.095730 sys 0.358842

Additional context

No response

@osawyerr osawyerr added the bug Something isn't working label Jun 28, 2023
@mingmwang
Copy link
Contributor

I will take a look.

@mingmwang
Copy link
Contributor

mingmwang commented Jun 29, 2023

Might be related to Decimal division.

@viirya
Copy link
Member

viirya commented Jun 29, 2023

That is because overflow happened in decimal divide kernel, see previous comment https://github.com/apache/arrow-datafusion/pull/5675/files#r1152896889. I should provide a scalar version of fixed point decimal multiplication kernel to fix it but haven't find time working on it.

@viirya
Copy link
Member

viirya commented Jun 29, 2023

Currently you can only get a meaningful result by adding cast on the division, see benchmarks/queries/q8.sql.

@viirya
Copy link
Member

viirya commented Jun 29, 2023

I'm going to add scalar version of fixed point decimal multiplication kernel at the upstream. We can use it to fix this once it is available.

@tustvold
Copy link
Contributor

tustvold commented Jul 3, 2023

I think this is actually a bug in the way that decimal type coercion is currently performed, which causes computations to overflow when they shouldn't - #6828

This is something I hope to fix upstream as part of apache/arrow-rs#3999

Edit: To clarify this can't easily be fixed without apache/arrow-rs#1047 as there is currently no way to propagate the precision and scale of a scalar

@viirya
Copy link
Member

viirya commented Jul 3, 2023

That is why the internal scaling in the division kernel should be fixed point computation to allow precision loss instead of overflow.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2023

Can someone test this again now that #6832 has been merged?

@viirya
Copy link
Member

viirya commented Aug 8, 2023

Opened #7233 to verify it.

@osawyerr
Copy link
Author

osawyerr commented Aug 8, 2023

Sure. Just ran Query 8. its working well now with the latest build on main branch. Nice speed boost as well.

+--------+------------+
| o_year | mkt_share  |
+--------+------------+
| 1995.0 | 0.03882014 |
| 1996.0 | 0.03948968 |
+--------+------------+
2 rows in set. Query took 0.985 seconds.

The o_year comes back as a decimal though. But I think thats related to the extract function in extract(year from o_orderdate) as o_year of the query.

@alamb
Copy link
Contributor

alamb commented Aug 9, 2023

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants