Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ArrowTypeError: "Could not convert" Error in inspect._files method #1477

Open
1 of 3 tasks
xsfa opened this issue Dec 28, 2024 · 3 comments
Open
1 of 3 tasks

[BUG] ArrowTypeError: "Could not convert" Error in inspect._files method #1477

xsfa opened this issue Dec 28, 2024 · 3 comments

Comments

@xsfa
Copy link

xsfa commented Dec 28, 2024

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

I think PyArrow is receiving misformatted data from the file metadata, causing me to be unable to call any of the file functions. Could this be caused by my Iceberg table format or is it a genuine bug? I have confirmed that my table is a valid Iceberg V2 table and readable.

Code:

test_table = catalog.load_table("test.table")
current_snapshot_id = test_table.metadata.current_snapshot_id
test_table.inspect.files(current_snapshot_id)

Full Stack Trace:

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Input [In [32]](vscode-notebook-cell:?execution_count=32), in <cell line: 17>()
     [14](vscode-notebook-cell:?execution_count=32&line=14) current_snapshot_id = test_table.metadata.current_snapshot_id
     [15](vscode-notebook-cell:?execution_count=32&line=15) print(current_snapshot_id)
---> [17](vscode-notebook-cell:?execution_count=32&line=17) test_table.inspect.files(current_snapshot_id)

File ~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:582, in InspectTable.files(self, snapshot_id)
    [581](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:581) def files(self, snapshot_id: Optional[int] = None) -> "pa.Table":
--> [582](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:582)     return self._files(snapshot_id)

File ~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:576, in InspectTable._files(self, snapshot_id, data_file_filter)
    [541](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:541)         readable_metrics = {
    [542](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:542)             schema.find_column_name(field.field_id): {
    [543](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:543)                 "column_size": column_sizes.get(field.field_id),
   (...)
    [554](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:554)             for field in self.tbl.metadata.schema().fields
    [555](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:555)         }
    [556](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:556)         files.append({
    [557](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:557)             "content": data_file.content,
    [558](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:558)             "file_path": data_file.file_path,
   (...)
    [573](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:573)             "readable_metrics": readable_metrics,
    [574](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:574)         })
--> [576](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:576) return pa.Table.from_pylist(
    [577](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:577)     files,
    [578](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:578)     schema=files_schema,
    [579](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:579) )

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:3700, in pyarrow.lib.Table.from_pylist()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:5228, in pyarrow.lib._from_pylist()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:3575, in pyarrow.lib.Table.from_arrays()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:1398, in pyarrow.lib._sanitize_arrays()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi:350, in pyarrow.lib.asarray()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status()

ArrowTypeError: Could not convert {1: 145, 2: 545, 3: 132, 4: 91, 5: 92, 6: 80, 7: 42, 8: 118, 9: 146, 10: 108, 11: 188, 12: 112, 13: 169, 14: 42, 15: 166, 16: 1248, 17: 57, 18: 38, 19: 81, 20: 120, 21: 42, 22: 129, 23: 90, 24: 38, 25: 38, 26: 80, 27: 544, 28: 112, 29: 79, 30: 131, 31: 71, 32: 70, 33: 70} with type dict: was not a sequence or recognized null for conversion to list type

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@xsfa xsfa changed the title ArrowTypeError: "Could not convert" Error in inspect._files method [BUG] ArrowTypeError: "Could not convert" Error in inspect._files method Dec 28, 2024
@kevinjqliu
Copy link
Contributor

Thanks for reporting this @xsfa
it looks like the issue happens when the underlying data is transformed into an arrow table

        return pa.Table.from_pylist(
            files,
            schema=files_schema,
        )

Could you provide more information so we can debug this?
For example, what is the content of files?

@xsfa
Copy link
Author

xsfa commented Jan 2, 2025

{
  "content": DataFileContent.DATA,
  "file_path": "s3a://dataplatform/silver/iceberg/spark/dbname/tablename/data/00001-3933-e97b5082-3b9e-4c4e-b965-f290205bcf3a-0-00001.parquet",
  "file_format": "PARQUET",
  "spec_id": 0,
  "record_count": 16718742,
  "file_size_in_bytes": 474139920,
  "column_sizes": {
    "1": 44833933,
    "2": 39592909,
    "3": 26025570,
    "4": 21604711,
    "5": 27511454,
    "6": 930995,
    "7": 5173236,
    "8": 4051761,
    "9": 4944629,
    "10": 24729094
  },
  "value_counts": {
    "1": 16718742,
    "2": 16718742,
    "3": 16718742,
    "4": 16718742,
    "5": 16718742,
    "6": 16718742,
    "7": 16718742,
    "8": 16718742,
    "9": 16718742,
    "10": 16718742
  },
  "null_value_counts": {
    "1": 0,
    "2": 0,
    "3": 0,
    "4": 3910423,
    "5": 7637,
    "6": 0,
    "7": 0,
    "8": 10289423,
    "9": 0,
    "10": 0
  },
  "split_offsets": [4, 138429859, 276834527, 415238280],
  "sort_order_id": 0,
  . . . 
}

Heres an example entry from the files array with column level information removed. While its not matching the original stack trace i posted, i believe the core issue is with the parsing on the column_sizes sub-directory

@kevinjqliu
Copy link
Contributor

i believe the core issue is with the parsing on the column_sizes sub-directory

I dont see anything out of the ordinary. is there a particular reason you think its due to column_sizes?

It would be helpful to print out readable_metrics_struct and readable_metrics
as well as files_schema and files.

can you also try test_table.inspect.entries(), it uses the same pyarrow schema logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants