[BUG] ArrowTypeError: "Could not convert" Error in inspect._files method #1477

xsfa · 2024-12-28T01:16:11Z

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

I think PyArrow is receiving misformatted data from the file metadata, causing me to be unable to call any of the file functions. Could this be caused by my Iceberg table format or is it a genuine bug? I have confirmed that my table is a valid Iceberg V2 table and readable.

Code:

test_table = catalog.load_table("test.table")
current_snapshot_id = test_table.metadata.current_snapshot_id
test_table.inspect.files(current_snapshot_id)

Full Stack Trace:

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Input [In [32]](vscode-notebook-cell:?execution_count=32), in <cell line: 17>()
     [14](vscode-notebook-cell:?execution_count=32&line=14) current_snapshot_id = test_table.metadata.current_snapshot_id
     [15](vscode-notebook-cell:?execution_count=32&line=15) print(current_snapshot_id)
---> [17](vscode-notebook-cell:?execution_count=32&line=17) test_table.inspect.files(current_snapshot_id)

File ~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:582, in InspectTable.files(self, snapshot_id)
    [581](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:581) def files(self, snapshot_id: Optional[int] = None) -> "pa.Table":
--> [582](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:582)     return self._files(snapshot_id)

File ~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:576, in InspectTable._files(self, snapshot_id, data_file_filter)
    [541](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:541)         readable_metrics = {
    [542](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:542)             schema.find_column_name(field.field_id): {
    [543](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:543)                 "column_size": column_sizes.get(field.field_id),
   (...)
    [554](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:554)             for field in self.tbl.metadata.schema().fields
    [555](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:555)         }
    [556](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:556)         files.append({
    [557](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:557)             "content": data_file.content,
    [558](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:558)             "file_path": data_file.file_path,
   (...)
    [573](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:573)             "readable_metrics": readable_metrics,
    [574](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:574)         })
--> [576](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:576) return pa.Table.from_pylist(
    [577](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:577)     files,
    [578](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:578)     schema=files_schema,
    [579](https://file+.vscode-resource.vscode-cdn.net/Users/tshenkute/iceberg-monitoring/jobs/~/opt/miniconda3/lib/python3.9/site-packages/pyiceberg/table/inspect.py:579) )

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:3700, in pyarrow.lib.Table.from_pylist()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:5228, in pyarrow.lib._from_pylist()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:3575, in pyarrow.lib.Table.from_arrays()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi:1398, in pyarrow.lib._sanitize_arrays()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi:350, in pyarrow.lib.asarray()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/opt/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi:123, in pyarrow.lib.check_status()

ArrowTypeError: Could not convert {1: 145, 2: 545, 3: 132, 4: 91, 5: 92, 6: 80, 7: 42, 8: 118, 9: 146, 10: 108, 11: 188, 12: 112, 13: 169, 14: 42, 15: 166, 16: 1248, 17: 57, 18: 38, 19: 81, 20: 120, 21: 42, 22: 129, 23: 90, 24: 38, 25: 38, 26: 80, 27: 544, 28: 112, 29: 79, 30: 131, 31: 71, 32: 70, 33: 70} with type dict: was not a sequence or recognized null for conversion to list type

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

kevinjqliu · 2025-01-02T21:32:36Z

Thanks for reporting this @xsfa
it looks like the issue happens when the underlying data is transformed into an arrow table

        return pa.Table.from_pylist(
            files,
            schema=files_schema,
        )

Could you provide more information so we can debug this?
For example, what is the content of files?

xsfa · 2025-01-02T21:46:34Z

{
  "content": DataFileContent.DATA,
  "file_path": "s3a://dataplatform/silver/iceberg/spark/dbname/tablename/data/00001-3933-e97b5082-3b9e-4c4e-b965-f290205bcf3a-0-00001.parquet",
  "file_format": "PARQUET",
  "spec_id": 0,
  "record_count": 16718742,
  "file_size_in_bytes": 474139920,
  "column_sizes": {
    "1": 44833933,
    "2": 39592909,
    "3": 26025570,
    "4": 21604711,
    "5": 27511454,
    "6": 930995,
    "7": 5173236,
    "8": 4051761,
    "9": 4944629,
    "10": 24729094
  },
  "value_counts": {
    "1": 16718742,
    "2": 16718742,
    "3": 16718742,
    "4": 16718742,
    "5": 16718742,
    "6": 16718742,
    "7": 16718742,
    "8": 16718742,
    "9": 16718742,
    "10": 16718742
  },
  "null_value_counts": {
    "1": 0,
    "2": 0,
    "3": 0,
    "4": 3910423,
    "5": 7637,
    "6": 0,
    "7": 0,
    "8": 10289423,
    "9": 0,
    "10": 0
  },
  "split_offsets": [4, 138429859, 276834527, 415238280],
  "sort_order_id": 0,
  . . . 
}

Heres an example entry from the files array with column level information removed. While its not matching the original stack trace i posted, i believe the core issue is with the parsing on the column_sizes sub-directory

kevinjqliu · 2025-01-02T23:31:30Z

i believe the core issue is with the parsing on the column_sizes sub-directory

I dont see anything out of the ordinary. is there a particular reason you think its due to column_sizes?

It would be helpful to print out readable_metrics_struct and readable_metrics
as well as files_schema and files.

can you also try test_table.inspect.entries(), it uses the same pyarrow schema logic

xsfa changed the title ~~ArrowTypeError: "Could not convert" Error in inspect._files method~~ [BUG] ArrowTypeError: "Could not convert" Error in inspect._files method Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ArrowTypeError: "Could not convert" Error in inspect._files method #1477

[BUG] ArrowTypeError: "Could not convert" Error in inspect._files method #1477

xsfa commented Dec 28, 2024

kevinjqliu commented Jan 2, 2025

xsfa commented Jan 2, 2025 •

edited

Loading

kevinjqliu commented Jan 2, 2025

[BUG] ArrowTypeError: "Could not convert" Error in inspect._files method #1477

[BUG] ArrowTypeError: "Could not convert" Error in inspect._files method #1477

Comments

xsfa commented Dec 28, 2024

Apache Iceberg version

Please describe the bug 🐞

Willingness to contribute

kevinjqliu commented Jan 2, 2025

xsfa commented Jan 2, 2025 • edited Loading

kevinjqliu commented Jan 2, 2025

xsfa commented Jan 2, 2025 •

edited

Loading