Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Run full sqllogic / sqlite test suite against DataFusion #13811

Open
7 of 12 tasks
Tracked by #13648
alamb opened this issue Dec 17, 2024 · 7 comments
Open
7 of 12 tasks
Tracked by #13648

[EPIC] Run full sqllogic / sqlite test suite against DataFusion #13811

alamb opened this issue Dec 17, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

Is your feature request related to a problem or challenge?

As DataFusion matures and becomes the foundation for more and more systems, ensuring it works bug free with many SQL queries is even more important. We have our own substantial test suites, but we still find issues from time to time, and more systematic testing would be even better

This ticket attempts to

Describe the solution you'd like

As described on #13470, there is a well known, very large corpus of sql test files originally from sqlite in sqllogictest format. It is common for other sql engines to run these scripts

The main description page is: https://sqlite.org/sqllogictest/file?name=about.wiki&ci=tip

@Omega359 has undertaken a substantial effort to create a DataFusion test runner to run these tests

I believe the code is here

Describe alternatives you've considered

Infrastructure Tasks

Fix bugs discovered by the initial testing

Additional context

No response

@Omega359
Copy link
Contributor

A quick grep and cleaning of DF errors that resulted in a query being marked as 'skipif Datafusion':

uniq -c errors_sorted.txt 
      5 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value 2154602500.0 to type Int32
      5 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value 2546884400.0 to type Int32
      1 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value -2986724764 to type Int32
      5 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value 48526918000.0 to type Int32
    203 # Datafusion - DataFusion error: Arrow error: Divide by zero error
   7075 # Datafusion - DataFusion error: Error during planning: Unary operator '+' only supports numeric, interval and timestamp types
   6543 # Datafusion - DataFusion error: Execution error: avg(DISTINCT) aggregations are not available
      5 # Datafusion - DataFusion error: External error: Arrow error: Cast error: Can't cast value 16106546000.0 to type Int32
      7 # Datafusion - DataFusion error: External error: Arrow error: Divide by zero error
     14 # Datafusion - DataFusion error: External error: External error: Arrow error: Divide by zero error
      3 # Datafusion - DataFusion error: External error: External error: External error: Arrow error: Divide by zero error
      2 # Datafusion - DataFusion error: External error: External error: Internal error: predicate did not evaluate to an array.
     56 # Datafusion - DataFusion error: Internal error: predicate did not evaluate to an array.
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "count(cor0.col1)"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "count((- cor0.col2))"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "count(NULL)"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "max(Int64(33))"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "sum((- cor0.col2))"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "sum(NULL)"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "sum((- tab0.col1))"
   6235 # Datafusion - DataFusion error: Schema error: Schema contains qualified field name
    275 # Datafusion - DataFusion error: SQL error: RecursionLimitExceeded
   1707 # Datafusion - DataFusion error: This feature is not implemented: Physical plan does not support logical expression ScalarSubquery(<subquery>)
   2602 # Datafusion - DataFusion error: This feature is not implemented: Unsupported logical plan: CreateIndex
  12691 # Datafusion - DataFusion error: type_coercion
   1743 # Datafusion - Error during planning: Cannot automatically convert Float32 to Null
     17 # Datafusion - Error during planning: Cannot automatically convert Float64 to Null
   5739 # Datafusion - Error during planning: Cannot automatically convert Int32 to Null
   5192 # Datafusion - Error during planning: Cannot automatically convert Int64 to Null
    386 # Datafusion - Error during planning: HAVING clause references non-aggregate values
    888 # Datafusion - Error during planning: Projection references non-aggregate values:
  39706 # Datafusion - Error during planning: Projections require unique expression names

@2010YOUY01
Copy link
Contributor

2010YOUY01 commented Jan 2, 2025

A quick grep and cleaning of DF errors that resulted in a query being marked as 'skipif Datafusion':

uniq -c errors_sorted.txt 
      5 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value 2154602500.0 to type Int32
      5 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value 2546884400.0 to type Int32
      1 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value -2986724764 to type Int32
      5 # Datafusion - DataFusion error: Arrow error: Cast error: Can't cast value 48526918000.0 to type Int32
    203 # Datafusion - DataFusion error: Arrow error: Divide by zero error
   7075 # Datafusion - DataFusion error: Error during planning: Unary operator '+' only supports numeric, interval and timestamp types
   6543 # Datafusion - DataFusion error: Execution error: avg(DISTINCT) aggregations are not available
      5 # Datafusion - DataFusion error: External error: Arrow error: Cast error: Can't cast value 16106546000.0 to type Int32
      7 # Datafusion - DataFusion error: External error: Arrow error: Divide by zero error
     14 # Datafusion - DataFusion error: External error: External error: Arrow error: Divide by zero error
      3 # Datafusion - DataFusion error: External error: External error: External error: Arrow error: Divide by zero error
      2 # Datafusion - DataFusion error: External error: External error: Internal error: predicate did not evaluate to an array.
     56 # Datafusion - DataFusion error: Internal error: predicate did not evaluate to an array.
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "count(cor0.col1)"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "count((- cor0.col2))"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "count(NULL)"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "max(Int64(33))"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "sum((- cor0.col2))"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "sum(NULL)"
      1 # Datafusion - DataFusion error: Schema error: Schema contains duplicate unqualified field name "sum((- tab0.col1))"
   6235 # Datafusion - DataFusion error: Schema error: Schema contains qualified field name
    275 # Datafusion - DataFusion error: SQL error: RecursionLimitExceeded
   1707 # Datafusion - DataFusion error: This feature is not implemented: Physical plan does not support logical expression ScalarSubquery(<subquery>)
   2602 # Datafusion - DataFusion error: This feature is not implemented: Unsupported logical plan: CreateIndex
  12691 # Datafusion - DataFusion error: type_coercion
   1743 # Datafusion - Error during planning: Cannot automatically convert Float32 to Null
     17 # Datafusion - Error during planning: Cannot automatically convert Float64 to Null
   5739 # Datafusion - Error during planning: Cannot automatically convert Int32 to Null
   5192 # Datafusion - Error during planning: Cannot automatically convert Int64 to Null
    386 # Datafusion - Error during planning: HAVING clause references non-aggregate values
    888 # Datafusion - Error during planning: Projection references non-aggregate values:
  39706 # Datafusion - Error during planning: Projections require unique expression names

Those skipifs might also need manual inspections for potential bugs 🤔

Done:

  1. Error during planning: Cannot automatically convert ... to Null Incorrect NULL handling in BETWEEN expression #13976
  2. All errors for duplicate field names like Schema error: Schema contains duplicate unqualified field name ... Referencing a column from select and order by clauses triggers duplicate expression error #13558

@alamb
Copy link
Contributor Author

alamb commented Jan 2, 2025

Thank you @2010YOUY01

@Omega359
Copy link
Contributor

Ticket filed for RecursionLimitExceeded: #13811

@Omega359
Copy link
Contributor

I regenerated the sqlite slt files (apache/datafusion-testing#4) to switch REAL -> FLOAT8 and change how DF errors are handled. Instead of skipif Datafusion they are now query error/statement error with the expected results move into the comments. This should mean that any changes in DF that either break/change/fix the sqlite tests will show immediately.

After the above PR is approved I believe the only blocker left is #13784 but I am not 100% certain on that.

@Omega359
Copy link
Contributor

Issue for avg(distinct) support - #2408

This I believe would be a useful addition if someone is able to take it on.

@Omega359
Copy link
Contributor

Many of the cast to Int issues are caused by the fact that in sqlite int is a flexible width data type - 0, 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants