Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified exception objects being thrown when converting Pyarrow tables #1498

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

DevChrisCross
Copy link

@DevChrisCross DevChrisCross commented Jan 8, 2025

This modifies the exception being thrown when converting PyArrow types to Iceberg types, it also now declares the field being involved, if applicable, and can also be accessed when catching the exception.

closes #860

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

As I mentioned in the comments, I'm not sure if primitive is where we want to print out the error message given that it does not have information about the field itself.

The visitor has a field function which might be a better place to do this

def field(self, field: pa.Field, field_result: IcebergType) -> NestedField:
field_id = self._field_id(field)
field_doc = doc_str.decode() if (field.metadata and (doc_str := field.metadata.get(PYARROW_FIELD_DOC_KEY))) else None
field_type = field_result
return NestedField(field_id, field.name, field_type, required=not field.nullable, doc=field_doc)

For example, here's the difference

# a primitive
pyarrow_type = pa.time64("ns")

# a field
pa.field("foo", pa.string(), nullable=True, metadata={"PARQUET:field_id": "1", "doc": "foo doc"}),

Field has information about the field name.

Let me know what you think!

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
@DevChrisCross
Copy link
Author

DevChrisCross commented Jan 8, 2025

@kevinjqliu Ah yes I've noticed that part as well, I've initially placed on the primitive because based on my understanding, it traverses through the schema until it reaches the primitive part, and thus, it makes sense to me to also place the code and the error message to the same place where the actual data type checking is being done :)

I've just utilized the existing functionality here in visit_pyarrow with StructTypes where it puts the field name, before doing the check and if indeed applicable, to the data type being processed, and removes it also afterwards, making it feasible for me to work on and giving some sort of guarantee when the code is working with "fields".

for field in obj:
visitor.before_field(field)
result = visit_pyarrow(field.type, visitor)
results.append(visitor.field(field, result))
visitor.after_field(field)

So I think, in my POV, it is the right place to do it, given the code above.

I'm not sure if I'm missing out regarding on how the field method is being utilized by PyArrowSchemaVisitor classes if I ever place it there? It doesn't make much that sense to me at the moment, I hope you could point me to the right direction :) Thank you!

@kevinjqliu
Copy link
Contributor

I think generally we want this kind of error message when calling the pyarrow_to_schema function, which is used in the create table path and the write path.

Column '{field_name}' has an unsupported type: {field_type}

Let's take a look at pyarrow_to_schema.
It takes a schema: pa.Schema as input, this is a pyarrow Schema class. And it has a list of Fields. Each Field has a corresponding name and type.

It uses the visit_pyarrow decorator, which dispatches based on the type of arrow object.
https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20visit_pyarrow.register&type=code
The function will start from visit_pyarrow.register(pa.Schema) and continues down until it hits the primitives (visit_pyarrow.register(pa.DataType)) .
Before the primitives, it should dispatch to pa.field. I think this is where we want to emit the error since the field has name/type/optionality/etc.

Currently theres no visit_pyarrow decorator for pa.field, so we'd need to add one.

@DevChrisCross
Copy link
Author

@kevinjqliu thank you for the insight! I kind of somehow hesitated also at first in adding the visit_pyarrow for pa.field. I've committed the necessary changes :)

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved
pyiceberg/io/pyarrow.py Show resolved Hide resolved
pyiceberg/exceptions.py Outdated Show resolved Hide resolved
pyiceberg/exceptions.py Show resolved Hide resolved
tests/io/test_pyarrow_visitor.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_visitor.py Outdated Show resolved Hide resolved
@DevChrisCross DevChrisCross force-pushed the feat/improved-pyarrow-error-message branch from b405ec9 to cce35c7 Compare January 11, 2025 06:49
pyiceberg/io/pyarrow.py Show resolved Hide resolved
tests/io/test_pyarrow_visitor.py Show resolved Hide resolved
tests/io/test_pyarrow_visitor.py Show resolved Hide resolved
Signed-off-by: Christian Molina <[email protected]>
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this

exception_cause = exc_info.value.__cause__
assert isinstance(exception_cause, TypeError)
assert (
"Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part of the error message? if so, can we add it to the pytest.raises to be explicit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer part of the error message and pytest.raises won't be able to catch the value any further since the UnsupportedPyArrowTypeException is now in place. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this error, we actually want to propagate it upwards. We want to let the users know there is the downcast-ns-timestamp-to-us-on-write configuration.

Maybe we can try reraising the underlying error to preserve its message

    try:
        result = visit_pyarrow(field_type, visitor)
    except TypeError as e:
        # Raise a custom exception while preserving the original error message and traceback
        raise UnsupportedPyArrowTypeException(
            obj, 
            f"Column '{obj.name}' has an unsupported type: {field_type}"
        ) from e

Copy link
Author

@DevChrisCross DevChrisCross Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is still preserved and propagated still and can also be seen in the traceback once the error is raised, given the exception is raised from e. Perhaps, I also want to emphasize to the user that the exception really originated from primitive() instead of visit_pyarrow() which will state something like this:

<... TypeError traceback indicating 'downcast-ns-timestamp-to-us-on-write' configuration ...>
The above exception was the direct cause of the following exception:
<... UnsupportedPyArrowTypeException ...>

I think it's still sufficient enough for the user to know what's going on, let me know if you still think otherwise. :)

tests/io/test_pyarrow_visitor.py Show resolved Hide resolved
@kevinjqliu kevinjqliu requested a review from Fokko January 12, 2025 19:30
@@ -14,6 +14,9 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from typing import Any

import pyarrow as pa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, for those that use PyIceberg with just s3fs, this import will be problematic. We should move this into pyarrow.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I will move this into the pyarrow.py. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better error messages when creating a table with unsupported types
3 participants