-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified exception objects being thrown when converting Pyarrow tables #1498
base: main
Are you sure you want to change the base?
Modified exception objects being thrown when converting Pyarrow tables #1498
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
As I mentioned in the comments, I'm not sure if primitive
is where we want to print out the error message given that it does not have information about the field itself.
The visitor has a field
function which might be a better place to do this
iceberg-python/pyiceberg/io/pyarrow.py
Lines 1118 to 1122 in 3b58011
def field(self, field: pa.Field, field_result: IcebergType) -> NestedField: | |
field_id = self._field_id(field) | |
field_doc = doc_str.decode() if (field.metadata and (doc_str := field.metadata.get(PYARROW_FIELD_DOC_KEY))) else None | |
field_type = field_result | |
return NestedField(field_id, field.name, field_type, required=not field.nullable, doc=field_doc) |
For example, here's the difference
# a primitive
pyarrow_type = pa.time64("ns")
# a field
pa.field("foo", pa.string(), nullable=True, metadata={"PARQUET:field_id": "1", "doc": "foo doc"}),
Field has information about the field name.
Let me know what you think!
@kevinjqliu Ah yes I've noticed that part as well, I've initially placed on the I've just utilized the existing functionality here in iceberg-python/pyiceberg/io/pyarrow.py Lines 964 to 968 in 3b58011
So I think, in my POV, it is the right place to do it, given the code above. I'm not sure if I'm missing out regarding on how the |
I think generally we want this kind of error message when calling the
Let's take a look at It uses the Currently theres no |
@kevinjqliu thank you for the insight! I kind of somehow hesitated also at first in adding the |
Signed-off-by: Christian Molina <[email protected]>
Signed-off-by: Christian Molina <[email protected]>
b405ec9
to
cce35c7
Compare
Signed-off-by: Christian Molina <[email protected]>
Signed-off-by: Christian Molina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this
exception_cause = exc_info.value.__cause__ | ||
assert isinstance(exception_cause, TypeError) | ||
assert ( | ||
"Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this part of the error message? if so, can we add it to the pytest.raises
to be explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is no longer part of the error message and pytest.raises
won't be able to catch the value any further since the UnsupportedPyArrowTypeException is now in place. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this error, we actually want to propagate it upwards. We want to let the users know there is the downcast-ns-timestamp-to-us-on-write
configuration.
Maybe we can try reraising the underlying error to preserve its message
try:
result = visit_pyarrow(field_type, visitor)
except TypeError as e:
# Raise a custom exception while preserving the original error message and traceback
raise UnsupportedPyArrowTypeException(
obj,
f"Column '{obj.name}' has an unsupported type: {field_type}"
) from e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception is still preserved and propagated still and can also be seen in the traceback once the error is raised, given the exception is raised from e
. Perhaps, I also want to emphasize to the user that the exception really originated from primitive()
instead of visit_pyarrow()
which will state something like this:
<... TypeError traceback indicating 'downcast-ns-timestamp-to-us-on-write' configuration ...>
The above exception was the direct cause of the following exception:
<... UnsupportedPyArrowTypeException ...>
I think it's still sufficient enough for the user to know what's going on, let me know if you still think otherwise. :)
@@ -14,6 +14,9 @@ | |||
# KIND, either express or implied. See the License for the | |||
# specific language governing permissions and limitations | |||
# under the License. | |||
from typing import Any | |||
|
|||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, for those that use PyIceberg with just s3fs
, this import will be problematic. We should move this into pyarrow.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will move this into the pyarrow.py
. :)
This modifies the exception being thrown when converting PyArrow types to Iceberg types, it also now declares the field being involved, if applicable, and can also be accessed when catching the exception.
closes #860