Add two new methods in ScalarFunction `return_type_from_args` and `is_nullable_from_args_nullable` #14094

jayzhan211 · 2025-01-12T09:25:15Z

Which issue does this PR close?

Rationale for this change

return_type_from_args that has less dependencies on Expr itself but the computed properties of Expr and Schema including data_type and nullability

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

Combine return_type_from_args and is_nullable_from_args_nullable

Signed-off-by: Jay Zhan <[email protected]>

jayzhan211 · 2025-01-12T09:27:07Z

datafusion/expr/src/udf.rs

    pub fn is_nullable(&self, args: &[Expr], schema: &dyn ExprSchema) -> bool {
        self.inner.is_nullable(args, schema)
    }

+    pub fn is_nullable_from_args_nullable(&self, args_nullables: &[bool]) -> bool {


Remove Expr dependency

jayzhan211 · 2025-01-12T09:28:11Z

datafusion/expr/src/udf.rs

+    /// The data types of the arguments to the function
+    pub arg_types: &'a [DataType],
+    /// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
+    pub arguments: &'a [String],


better name 🤔 ?

Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

jayzhan211 · 2025-01-12T09:28:41Z

datafusion/functions/src/core/arrow_cast.rs

@@ -86,22 +87,36 @@ impl ScalarUDFImpl for ArrowCastFunc {
    }

    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {


If this change looks good, we can deprecate this too

jayzhan211 · 2025-01-12T09:29:51Z

datafusion/functions/src/core/named_struct.rs

            let name_column = &chunk[0];
            let name = match name_column {
-                ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => name_scalar,
-                _ => return exec_err!("named_struct even arguments must be string literals, got {name_column:?} instead at position {}", i * 2)


name_column output less readable array in this change, remove it for now.

findepi · 2025-01-12T10:09:23Z

As stated in #13717 (comment) , this new method doesn't necessarily simplify anything.

Can you please fill "Rationale for this change"? What problem are we solving?

alamb

Thanks @jayzhan211 -- I think this is a step in the right direction, but I am worried it just makes the API more complicated (adds as many functions as it deprecates)

Challenge: Exprs / constants

It seems to me one challenge is that different information is known for computing return types at different points in the plan (e.g. sometimes we have Expr and sometimes we don't)

What would you think about making this more explicit in ReturnTypeArgs by making it an enum:

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

Challenge: Multiple APIs (Nullability and return type)

It is somewhat akward to have two functions, one for nullability and one for return type. Also I can imagine that the nullability calculation depends on the input type of arguments too (not just the input nullability) I wonder if we can combine them into a single API:

Maybe something like

/// Information about the output of the function
/// including the data type and nullability:
struct ReturnTypeInfo {
  data_type: DataType,
  nullable: bool,
}

trait ScalarUDFImpl {
    /// Returns the 
    pub fn return_type_from_args(&self, args: ReturnTypeArgs) -> Result<ReturnTypeInfo> 
}

alamb · 2025-01-12T11:46:57Z

datafusion/expr/src/udf.rs

+    /// The data types of the arguments to the function
+    pub arg_types: &'a [DataType],
+    /// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
+    pub arguments: &'a [String],


Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

alamb · 2025-01-12T11:48:20Z

datafusion/expr/src/udf.rs

+    /// The data types of the arguments to the function
+    pub arg_types: &'a [DataType],
+    /// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
+    pub arguments: &'a [String],


I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

jayzhan211 · 2025-01-12T14:10:48Z

Multiple APIs (Nullability and return type)

Great, I also want this too.

jayzhan211 · 2025-01-12T14:14:37Z

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

Yes, it might be, since I assume we don't really need Expr but String instead. However, in theory they can achieve what they need with String, but of course this is breaking change.

If the constant integer is the only concern, ScalarValue or ColumnarValue looks good for me too!

jayzhan211 · 2025-01-12T14:17:11Z

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

One good thing in this PR is that we don't need Expr anymore, we compute data type and nullable in datafusion core and they are not "public" for customization.

Do we really need Planning? My thought is that whenever we have Expr and Schema we can compute corresponding DataType and Nullability. Therefore, even for Planning stage, we can still get the information DataType + Nullability

jayzhan211 added 9 commits January 12, 2025 13:57

switch func

6b00b9a

Signed-off-by: Jay Zhan <[email protected]>

fix test

b079be3

Signed-off-by: Jay Zhan <[email protected]>

fix test

8c9ee8c

Signed-off-by: Jay Zhan <[email protected]>

deprecate old

6df7476

Signed-off-by: Jay Zhan <[email protected]>

add try new

fe7f6a5

Signed-off-by: Jay Zhan <[email protected]>

deprecate

4da4c71

Signed-off-by: Jay Zhan <[email protected]>

rm deprecate

de4b484

Signed-off-by: Jay Zhan <[email protected]>

reaplce deprecated func

02a64ce

Signed-off-by: Jay Zhan <[email protected]>

cleanup

f26ce70

Signed-off-by: Jay Zhan <[email protected]>

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Jan 12, 2025

jayzhan211 commented Jan 12, 2025

View reviewed changes

alamb reviewed Jan 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two new methods in ScalarFunction `return_type_from_args` and `is_nullable_from_args_nullable` #14094

Add two new methods in ScalarFunction `return_type_from_args` and `is_nullable_from_args_nullable` #14094

jayzhan211 commented Jan 12, 2025 •

edited

Loading

jayzhan211 Jan 12, 2025

jayzhan211 Jan 12, 2025

alamb Jan 12, 2025

alamb Jan 12, 2025

jayzhan211 Jan 12, 2025 •

edited

Loading

jayzhan211 Jan 12, 2025

findepi commented Jan 12, 2025

alamb left a comment

alamb Jan 12, 2025

alamb Jan 12, 2025

jayzhan211 commented Jan 12, 2025

jayzhan211 commented Jan 12, 2025 •

edited

Loading

jayzhan211 commented Jan 12, 2025 •

edited

Loading

		@@ -86,22 +87,36 @@ impl ScalarUDFImpl for ArrowCastFunc {
		}

		fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {

Add two new methods in ScalarFunction return_type_from_args and is_nullable_from_args_nullable #14094

Are you sure you want to change the base?

Add two new methods in ScalarFunction return_type_from_args and is_nullable_from_args_nullable #14094

Conversation

jayzhan211 commented Jan 12, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

jayzhan211 Jan 12, 2025

Choose a reason for hiding this comment

jayzhan211 Jan 12, 2025

Choose a reason for hiding this comment

alamb Jan 12, 2025

Choose a reason for hiding this comment

alamb Jan 12, 2025

Choose a reason for hiding this comment

jayzhan211 Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Jan 12, 2025

Choose a reason for hiding this comment

findepi commented Jan 12, 2025

alamb left a comment

Choose a reason for hiding this comment

Challenge: Exprs / constants

Challenge: Multiple APIs (Nullability and return type)

alamb Jan 12, 2025

Choose a reason for hiding this comment

alamb Jan 12, 2025

Choose a reason for hiding this comment

jayzhan211 commented Jan 12, 2025

jayzhan211 commented Jan 12, 2025 • edited Loading

jayzhan211 commented Jan 12, 2025 • edited Loading

Add two new methods in ScalarFunction `return_type_from_args` and `is_nullable_from_args_nullable` #14094

Add two new methods in ScalarFunction `return_type_from_args` and `is_nullable_from_args_nullable` #14094

jayzhan211 commented Jan 12, 2025 •

edited

Loading

jayzhan211 Jan 12, 2025 •

edited

Loading

jayzhan211 commented Jan 12, 2025 •

edited

Loading

jayzhan211 commented Jan 12, 2025 •

edited

Loading