Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Type Coercion for UDF Arguments #14268

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

shehabgamin
Copy link
Contributor

@shehabgamin shehabgamin commented Jan 24, 2025

Which issue does this PR close?

Closes #14230

Rationale for this change

A bug was introduced in DataFusion v43.0.0 that affects type coercion for UDF arguments. Sail's tests uncovered several of these regressions, which required explicit casting in multiple areas as a workaround during the upgrade to DataFusion 43.0.0.

The regressions identified by Sail's tests include the following functions:

  1. ascii
  2. bit_length
  3. contains
  4. ends_with
  5. starts_with
  6. octet_length

Upon digging into the code, I discovered the following:

  1. The above functions didn't have Signature::coercible.
  2. Signature::coercible was incomplete. Coercion would only happen if logical_type == target_type, logical_type == NativeType::Null, or target_type.is_integer() && logical_type.is_integer().
    1. It seems like this was unintentional because the doc comments specified behavior that was not consistent with the logic implemented:

      For example, Coercible(vec![logical_float64()]) accepts arguments like vec![Int32] or vec![Float32] since i32 and f32 can be cast to f64.

What changes are included in this PR?

  1. Update functions from the list above to have Signature::coercible and port the relevant tests from Sail.
  2. Fix TypeSignature::Coercible.

Are these changes tested?

Yes.

Are there any user-facing changes?

signature: Signature::one_of(
vec![
TypeSignature::String(1),
TypeSignature::Coercible(vec![TypeSignatureClass::Native(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use coercible(string), we don't need string since it is a more strict rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I initially did, but after testing on Sail, I discovered new test failures related to coercing input that's all String (e.g. func(Utf8, Utf8View)).

The plan is to port all the relevant tests from Sail into this PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it was not my intention to apply this pattern on single arg functions. I'll get that fixed!

Copy link
Contributor

@jayzhan211 jayzhan211 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design for coercion may still have room for improvement. It would be beneficial to represent the function signature in a simpler and more concise manner, rather than relying on complex combinations of multiple, similar signatures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I'll add that in.

@shehabgamin shehabgamin changed the title Fix DF 43 Regression: Coerce Various Scalar Func Args to String Fix DF 43 Coercion Regression: Improve TypeSignature::Coercible Jan 25, 2025
@github-actions github-actions bot added the common Related to common crate label Jan 25, 2025
@shehabgamin shehabgamin changed the title Fix DF 43 Coercion Regression: Improve TypeSignature::Coercible Fix DF 43 Type Coercion Regression: Improve TypeSignature::Coercible Jan 25, 2025
@shehabgamin shehabgamin changed the title Fix DF 43 Type Coercion Regression: Improve TypeSignature::Coercible Fix Type Coercion for UDF Arguments Jan 25, 2025
@shehabgamin shehabgamin mentioned this pull request Jan 25, 2025
21 tasks
@github-actions github-actions bot added the optimizer Optimizer rules label Jan 25, 2025
}

#[test]
fn test_ascii_expr() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred to place the various UDF tests within their respective files, but I couldn't due to circular dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ending up putting the tests in the .slt file, but figured we can still leave this test here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test if the purpose of the test is covered already in slt

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 26, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 26, 2025
@shehabgamin shehabgamin marked this pull request as ready for review January 26, 2025 09:45

/// Recursively traverses [`DataType::Dictionary`] to get the underlying value [`DataType`].
/// For non-dictionary types, returns the default [`DataType`].
fn base_dictionary_type_or_default_type(data_type: &DataType) -> DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add dictionary support to datafusion/common/src/utils/mod.rs base_type

50

query I
SELECT ascii(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DuckDB, this doesn't work

D select ascii(2);
Binder Error: No function matches the given name and argument types 'ascii(INTEGER_LITERAL)'. You might need to add explicit type casts.
	Candidate functions:
	ascii(VARCHAR) -> INTEGER

LINE 1: select ascii(2);
               ^
D select ascii('2');
┌────────────┐
│ ascii('2') │
│   int32    │
├────────────┤
│         50 │
└────────────┘

So does Postgres

postgres=# select ascii(2);
ERROR:  function ascii(integer) does not exist
LINE 1: select ascii(2);
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# select ascii('2');
 ascii 
-------
    50
(1 row)

This used to work in DataFusion, but it is not consistent with other systems. I'm not sure whether we should introduce this behaviour back

Copy link
Contributor Author

@shehabgamin shehabgamin Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is consistent with Spark and the tests also come from the Spark codebase (which we ported to Sail then here lol).

spark-sql (default)> SELECT ASCII(2);
50
Time taken: 0.952 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT ASCII('2');
50
Time taken: 0.044 seconds, Fetched 1 row(s)
spark-sql (default)>

@@ -571,8 +601,10 @@ select repeat('-1.2', arrow_cast(3, 'Int32'));
----
-1.2-1.2-1.2

query error DataFusion error: Error during planning: Internal error: Expect TypeSignatureClass::Native\(LogicalType\(Native\(Int64\), Int64\)\) but received Float64
query T
select repeat('-1.2', 3.2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be error, the 2nd argument should be integer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the 2nd arg should be TypeSignatureClass::Integer instead of TypeSignatureClass::Native(logical_int64())? It's currently unimplemented with a TODO:

// TODO:
// Numeric
// Integer

Otherwise, it would seem strange to deviate from the behavior specified in the implementation of default_cast_forfor NativeType.

I do think that we should prioritize being permissive and general-purpose though as I was discussing here: #14296 (comment)

}

#[test]
fn test_ascii_expr() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test if the purpose of the test is covered already in slt

@@ -584,23 +544,7 @@ fn get_valid_types(
match target_type_class {
TypeSignatureClass::Native(native_type) => {
let target_type = native_type.native();
if &logical_type == target_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean others function that used Coercible String now also cast integer to string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's TypeSignature::Coercible with a TypeSignatureClass::Native(logical_string()), then yes. Any function that specifies TypeSignature::Coercible with a TypeSignatureClass::Native should coerce according to the behavior implemented in the default_cast_for function for NativeType in order to be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate documentation Improvements or additions to documentation functions logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFusion Regression (Starting in v43): Type Coercion for UDF Arguments
2 participants