Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add two new methods in ScalarFunction return_type_from_args and is_nullable_from_args_nullable #14094

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jan 12, 2025

Which issue does this PR close?

Part of #13717

Rationale for this change

return_type_from_args that has less dependencies on Expr itself but the computed properties of Expr and Schema including data_type and nullability

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

Combine return_type_from_args and is_nullable_from_args_nullable

Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Jan 12, 2025
pub fn is_nullable(&self, args: &[Expr], schema: &dyn ExprSchema) -> bool {
self.inner.is_nullable(args, schema)
}

pub fn is_nullable_from_args_nullable(&self, args_nullables: &[bool]) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Expr dependency

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name 🤔 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

@@ -86,22 +87,36 @@ impl ScalarUDFImpl for ArrowCastFunc {
}

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
Copy link
Contributor Author

@jayzhan211 jayzhan211 Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this change looks good, we can deprecate this too

let name_column = &chunk[0];
let name = match name_column {
ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => name_scalar,
_ => return exec_err!("named_struct even arguments must be string literals, got {name_column:?} instead at position {}", i * 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name_column output less readable array in this change, remove it for now.

@findepi
Copy link
Member

findepi commented Jan 12, 2025

As stated in #13717 (comment) , this new method doesn't necessarily simplify anything.

Can you please fill "Rationale for this change"? What problem are we solving?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- I think this is a step in the right direction, but I am worried it just makes the API more complicated (adds as many functions as it deprecates)

Challenge: Exprs / constants

It seems to me one challenge is that different information is known for computing return types at different points in the plan (e.g. sometimes we have Expr and sometimes we don't)

What would you think about making this more explicit in ReturnTypeArgs by making it an enum:

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

Challenge: Multiple APIs (Nullability and return type)

It is somewhat akward to have two functions, one for nullability and one for return type. Also I can imagine that the nullability calculation depends on the input type of arguments too (not just the input nullability) I wonder if we can combine them into a single API:

Maybe something like

/// Information about the output of the function
/// including the data type and nullability:
struct ReturnTypeInfo {
  data_type: DataType,
  nullable: bool,
}

trait ScalarUDFImpl {
    /// Returns the 
    pub fn return_type_from_args(&self, args: ReturnTypeArgs) -> Result<ReturnTypeInfo> 
}

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

@jayzhan211
Copy link
Contributor Author

Multiple APIs (Nullability and return type)

Great, I also want this too.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 12, 2025

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

Yes, it might be, since I assume we don't really need Expr but String instead. However, in theory they can achieve what they need with String, but of course this is breaking change.

If the constant integer is the only concern, ScalarValue or ColumnarValue looks good for me too!

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 12, 2025

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

One good thing in this PR is that we don't need Expr anymore, we compute data type and nullable in datafusion core and they are not "public" for customization.

Do we really need Planning? My thought is that whenever we have Expr and Schema we can compute corresponding DataType and Nullability. Therefore, even for Planning stage, we can still get the information DataType + Nullability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants