-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Type Coercion for UDF Arguments #14268
base: main
Are you sure you want to change the base?
Conversation
signature: Signature::one_of( | ||
vec![ | ||
TypeSignature::String(1), | ||
TypeSignature::Coercible(vec![TypeSignatureClass::Native( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use coercible(string)
, we don't need string
since it is a more strict rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I initially did, but after testing on Sail, I discovered new test failures related to coercing input that's all String (e.g. func(Utf8, Utf8View)
).
The plan is to port all the relevant tests from Sail into this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 You can find the test failures here if interested!
lakehq/sail@372e13b#diff-bb36d996163d98235e107f8203c9c24be34ef71c84f8afa4a420a8e483102e3e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it was not my intention to apply this pattern on single arg functions. I'll get that fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current design for coercion may still have room for improvement. It would be beneficial to represent the function signature in a simpler and more concise manner, rather than relying on complex combinations of multiple, similar signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I'll add that in.
} | ||
|
||
#[test] | ||
fn test_ascii_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred to place the various UDF tests within their respective files, but I couldn't due to circular dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ending up putting the tests in the .slt
file, but figured we can still leave this test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this test if the purpose of the test is covered already in slt
|
||
/// Recursively traverses [`DataType::Dictionary`] to get the underlying value [`DataType`]. | ||
/// For non-dictionary types, returns the default [`DataType`]. | ||
fn base_dictionary_type_or_default_type(data_type: &DataType) -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add dictionary support to datafusion/common/src/utils/mod.rs
base_type
50 | ||
|
||
query I | ||
SELECT ascii(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DuckDB, this doesn't work
D select ascii(2);
Binder Error: No function matches the given name and argument types 'ascii(INTEGER_LITERAL)'. You might need to add explicit type casts.
Candidate functions:
ascii(VARCHAR) -> INTEGER
LINE 1: select ascii(2);
^
D select ascii('2');
┌────────────┐
│ ascii('2') │
│ int32 │
├────────────┤
│ 50 │
└────────────┘
So does Postgres
postgres=# select ascii(2);
ERROR: function ascii(integer) does not exist
LINE 1: select ascii(2);
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# select ascii('2');
ascii
-------
50
(1 row)
This used to work in DataFusion, but it is not consistent with other systems. I'm not sure whether we should introduce this behaviour back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is consistent with Spark and the tests also come from the Spark codebase (which we ported to Sail then here lol).
spark-sql (default)> SELECT ASCII(2);
50
Time taken: 0.952 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT ASCII('2');
50
Time taken: 0.044 seconds, Fetched 1 row(s)
spark-sql (default)>
@@ -571,8 +601,10 @@ select repeat('-1.2', arrow_cast(3, 'Int32')); | |||
---- | |||
-1.2-1.2-1.2 | |||
|
|||
query error DataFusion error: Error during planning: Internal error: Expect TypeSignatureClass::Native\(LogicalType\(Native\(Int64\), Int64\)\) but received Float64 | |||
query T | |||
select repeat('-1.2', 3.2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be error, the 2nd argument should be integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the 2nd arg should be TypeSignatureClass::Integer
instead of TypeSignatureClass::Native(logical_int64())
? It's currently unimplemented with a TODO
:
datafusion/datafusion/expr-common/src/signature.rs
Lines 216 to 218 in f775791
// TODO: | |
// Numeric | |
// Integer |
Otherwise, it would seem strange to deviate from the behavior specified in the implementation of default_cast_for
for NativeType
.
I do think that we should prioritize being permissive and general-purpose though as I was discussing here: #14296 (comment)
} | ||
|
||
#[test] | ||
fn test_ascii_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this test if the purpose of the test is covered already in slt
@@ -584,23 +544,7 @@ fn get_valid_types( | |||
match target_type_class { | |||
TypeSignatureClass::Native(native_type) => { | |||
let target_type = native_type.native(); | |||
if &logical_type == target_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean others function that used Coercible String now also cast integer to string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's TypeSignature::Coercible
with a TypeSignatureClass::Native(logical_string())
, then yes. Any function that specifies TypeSignature::Coercible
with a TypeSignatureClass::Native
should coerce according to the behavior implemented in the default_cast_for
function for NativeType
in order to be consistent.
Which issue does this PR close?
Closes #14230
Rationale for this change
A bug was introduced in DataFusion v43.0.0 that affects type coercion for UDF arguments. Sail's tests uncovered several of these regressions, which required explicit casting in multiple areas as a workaround during the upgrade to DataFusion 43.0.0.
The regressions identified by Sail's tests include the following functions:
ascii
bit_length
contains
ends_with
starts_with
octet_length
Upon digging into the code, I discovered the following:
Signature::coercible
.Signature::coercible
was incomplete. Coercion would only happen iflogical_type == target_type
,logical_type == NativeType::Null
, ortarget_type.is_integer() && logical_type.is_integer()
.What changes are included in this PR?
Signature::coercible
and port the relevant tests from Sail.TypeSignature::Coercible
.Are these changes tested?
Yes.
Are there any user-facing changes?