Modified exception objects being thrown when converting Pyarrow tables #1498

DevChrisCross · 2025-01-08T07:01:28Z

This modifies the exception being thrown when converting PyArrow types to Iceberg types, it also now declares the field being involved, if applicable, and can also be accessed when catching the exception.

closes #860

kevinjqliu

Thanks for the PR!

As I mentioned in the comments, I'm not sure if primitive is where we want to print out the error message given that it does not have information about the field itself.

The visitor has a field function which might be a better place to do this

iceberg-python/pyiceberg/io/pyarrow.py

Lines 1118 to 1122 in 3b58011

    
           def field(self, field: pa.Field, field_result: IcebergType) -> NestedField: 
        
               field_id = self._field_id(field) 
        
               field_doc = doc_str.decode() if (field.metadata and (doc_str := field.metadata.get(PYARROW_FIELD_DOC_KEY))) else None 
        
               field_type = field_result 
        
               return NestedField(field_id, field.name, field_type, required=not field.nullable, doc=field_doc)

For example, here's the difference

# a primitive
pyarrow_type = pa.time64("ns")

# a field
pa.field("foo", pa.string(), nullable=True, metadata={"PARQUET:field_id": "1", "doc": "foo doc"}),

Field has information about the field name.

Let me know what you think!

pyiceberg/io/pyarrow.py

DevChrisCross · 2025-01-08T21:02:27Z

@kevinjqliu Ah yes I've noticed that part as well, I've initially placed on the primitive because based on my understanding, it traverses through the schema until it reaches the primitive part, and thus, it makes sense to me to also place the code and the error message to the same place where the actual data type checking is being done :)

I've just utilized the existing functionality here in visit_pyarrow with StructTypes where it puts the field name, before doing the check and if indeed applicable, to the data type being processed, and removes it also afterwards, making it feasible for me to work on and giving some sort of guarantee when the code is working with "fields".

iceberg-python/pyiceberg/io/pyarrow.py

Lines 964 to 968 in 3b58011

    
           for field in obj: 
        
               visitor.before_field(field) 
        
               result = visit_pyarrow(field.type, visitor) 
        
               results.append(visitor.field(field, result)) 
        
               visitor.after_field(field)

So I think, in my POV, it is the right place to do it, given the code above.

I'm not sure if I'm missing out regarding on how the field method is being utilized by PyArrowSchemaVisitor classes if I ever place it there? It doesn't make much that sense to me at the moment, I hope you could point me to the right direction :) Thank you!

kevinjqliu · 2025-01-09T17:55:21Z

I think generally we want this kind of error message when calling the pyarrow_to_schema function, which is used in the create table path and the write path.

Column '{field_name}' has an unsupported type: {field_type}

Let's take a look at pyarrow_to_schema.
It takes a schema: pa.Schema as input, this is a pyarrow Schema class. And it has a list of Fields. Each Field has a corresponding name and type.

It uses the visit_pyarrow decorator, which dispatches based on the type of arrow object.
https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20visit_pyarrow.register&type=code
The function will start from visit_pyarrow.register(pa.Schema) and continues down until it hits the primitives (visit_pyarrow.register(pa.DataType)) .
Before the primitives, it should dispatch to pa.field. I think this is where we want to emit the error since the field has name/type/optionality/etc.

Currently theres no visit_pyarrow decorator for pa.field, so we'd need to add one.

DevChrisCross · 2025-01-09T20:24:30Z

@kevinjqliu thank you for the insight! I kind of somehow hesitated also at first in adding the visit_pyarrow for pa.field. I've committed the necessary changes :)

pyiceberg/io/pyarrow.py

pyiceberg/exceptions.py

tests/io/test_pyarrow_visitor.py

Signed-off-by: Christian Molina <[email protected]>

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow_visitor.py

Signed-off-by: Christian Molina <[email protected]>

kevinjqliu

LGTM! Thanks for adding this

kevinjqliu · 2025-01-12T19:21:56Z

tests/integration/test_add_files.py

+    exception_cause = exc_info.value.__cause__
+    assert isinstance(exception_cause, TypeError)
+    assert (
+        "Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write."


is this part of the error message? if so, can we add it to the pytest.raises to be explicit

It is no longer part of the error message and pytest.raises won't be able to catch the value any further since the UnsupportedPyArrowTypeException is now in place. :)

I think for this error, we actually want to propagate it upwards. We want to let the users know there is the downcast-ns-timestamp-to-us-on-write configuration.

Maybe we can try reraising the underlying error to preserve its message

try: result = visit_pyarrow(field_type, visitor) except TypeError as e: # Raise a custom exception while preserving the original error message and traceback raise UnsupportedPyArrowTypeException( obj, f"Column '{obj.name}' has an unsupported type: {field_type}" ) from e

The exception is still preserved and propagated still and can also be seen in the traceback once the error is raised, given the exception is raised from e. Perhaps, I also want to emphasize to the user that the exception really originated from primitive() instead of visit_pyarrow() which will state something like this:

<... TypeError traceback indicating 'downcast-ns-timestamp-to-us-on-write' configuration ...> The above exception was the direct cause of the following exception: <... UnsupportedPyArrowTypeException ...>

I think it's still sufficient enough for the user to know what's going on, let me know if you still think otherwise. :)

tests/io/test_pyarrow_visitor.py

Fokko · 2025-01-13T13:54:20Z

pyiceberg/exceptions.py

@@ -14,6 +14,9 @@
 #  KIND, either express or implied.  See the License for the
 #  specific language governing permissions and limitations
 #  under the License.
+from typing import Any
+
+import pyarrow as pa


Oof, for those that use PyIceberg with just s3fs, this import will be problematic. We should move this into pyarrow.py

Okay I will move this into the pyarrow.py. :)

kevinjqliu reviewed Jan 8, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

kevinjqliu reviewed Jan 11, 2025

View reviewed changes

DevChrisCross added 2 commits January 11, 2025 14:46

Modified exception objects being thrown when converting Pyarrow tables

f1b8f50

Signed-off-by: Christian Molina <[email protected]>

Added visit_pyarrow dispatch for pyarrow field

cce35c7

Signed-off-by: Christian Molina <[email protected]>

DevChrisCross force-pushed the feat/improved-pyarrow-error-message branch from b405ec9 to cce35c7 Compare January 11, 2025 06:49

Removed unnecessary codes and modified testing

217e142

Signed-off-by: Christian Molina <[email protected]>

kevinjqliu reviewed Jan 11, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Show resolved Hide resolved

tests/io/test_pyarrow_visitor.py Show resolved Hide resolved

tests/io/test_pyarrow_visitor.py Show resolved Hide resolved

Fixed integration test

e7736ce

Signed-off-by: Christian Molina <[email protected]>

kevinjqliu approved these changes Jan 12, 2025

View reviewed changes

kevinjqliu requested a review from Fokko January 12, 2025 19:30

Fokko reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified exception objects being thrown when converting Pyarrow tables #1498

Modified exception objects being thrown when converting Pyarrow tables #1498

DevChrisCross commented Jan 8, 2025 •

edited

Loading

kevinjqliu left a comment

DevChrisCross commented Jan 8, 2025 •

edited

Loading

kevinjqliu commented Jan 9, 2025

DevChrisCross commented Jan 9, 2025

kevinjqliu left a comment

kevinjqliu Jan 12, 2025

DevChrisCross Jan 12, 2025

kevinjqliu Jan 12, 2025

DevChrisCross Jan 12, 2025 •

edited

Loading

Fokko Jan 13, 2025

DevChrisCross Jan 13, 2025

	def field(self, field: pa.Field, field_result: IcebergType) -> NestedField:
	field_id = self._field_id(field)
	field_doc = doc_str.decode() if (field.metadata and (doc_str := field.metadata.get(PYARROW_FIELD_DOC_KEY))) else None
	field_type = field_result
	return NestedField(field_id, field.name, field_type, required=not field.nullable, doc=field_doc)

Modified exception objects being thrown when converting Pyarrow tables #1498

Are you sure you want to change the base?

Modified exception objects being thrown when converting Pyarrow tables #1498

Conversation

DevChrisCross commented Jan 8, 2025 • edited Loading

kevinjqliu left a comment

Choose a reason for hiding this comment

DevChrisCross commented Jan 8, 2025 • edited Loading

kevinjqliu commented Jan 9, 2025

DevChrisCross commented Jan 9, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu Jan 12, 2025

Choose a reason for hiding this comment

DevChrisCross Jan 12, 2025

Choose a reason for hiding this comment

kevinjqliu Jan 12, 2025

Choose a reason for hiding this comment

DevChrisCross Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Fokko Jan 13, 2025

Choose a reason for hiding this comment

DevChrisCross Jan 13, 2025

Choose a reason for hiding this comment

DevChrisCross commented Jan 8, 2025 •

edited

Loading

DevChrisCross commented Jan 8, 2025 •

edited

Loading

DevChrisCross Jan 12, 2025 •

edited

Loading