PanicException after simple filter operation with LazyFrame on big dataset #20894

MaximilianHess · 2025-01-24T14:44:26Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from polars import col as c

path_to_data = "<path_to_data>/hs_code_version=*/*"
lf = pl.scan_parquet(path_to_data, hive_partitioning=True).drop("quantity")
df = pl.read_parquet(path_to_data, hive_partitioning=True).drop("quantity")


region = [40, 100]
# This works as expected
df = df.filter(c("importer").is_in(region))

print(df)

# This throws an exception: "PanicException: The column lengths in the DataFrame are not equal."

lf = lf.filter(c("importer").is_in(region)).collect()

print(lf)

I couldn't produce some random data where the bug still persists so please find the data I used here (8 % of the original dataset size):

https://drive.proton.me/urls/XQWVS448J4#1yibTP9fy9ny

You have to keep the folder structure for the hive_partitioning.

Log output

thread '<unnamed>' panicked at crates/polars-core/src/fmt.rs:582:13:
The column lengths in the DataFrame are not equal.
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: polars_core::fmt::<impl core::fmt::Display for polars_core::frame::DataFrame>::fmt
   3: core::fmt::write
   4: alloc::fmt::format::format_inner
   5: polars_python::dataframe::general::<impl polars_python::dataframe::PyDataFrame>::__pymethod_as_str__
   6: pyo3::impl_::trampoline::trampoline
   7: polars_python::dataframe::general::_::__INVENTORY::trampoline
   8: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1733407224341/work/build-static/Python/bytecodes.c:3159:19
   9: _PyEval_EvalFrame
             at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_ceval.h:89:16
  10: _PyEval_Vector
             at /usr/local/src/conda/python-3.12.8/Python/ceval.c:1683:12
  11: _PyFunction_Vectorcall
             at /usr/local/src/conda/python-3.12.8/Objects/call.c:419:16
  12: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_call.h:92:11
  13: vectorcall_unbound
             at /usr/local/src/conda/python-3.12.8/Objects/typeobject.c:2236
  14: vectorcall_method
             at /usr/local/src/conda/python-3.12.8/Objects/typeobject.c:2267
  15: slot_tp_str
             at /usr/local/src/conda/python-3.12.8/Objects/typeobject.c:8726:0
  16: PyObject_Str
             at /usr/local/src/conda/python-3.12.8/Objects/object.c:630:12
  17: PyFile_WriteObject
             at /usr/local/src/conda/python-3.12.8/Objects/fileobject.c:124:17
  18: builtin_print_impl
             at /usr/local/src/conda/python-3.12.8/Python/bltinmodule.c:2057:15
  19: builtin_print
             at /usr/local/src/conda/python-3.12.8/Python/clinic/bltinmodule.c.h:1121:20
  20: cfunction_vectorcall_FASTCALL_KEYWORDS
             at /usr/local/src/conda/python-3.12.8/Objects/methodobject.c:438:24
  21: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_call.h:92:11
  22: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.12.8/Objects/call.c:325:12
  23: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1733407224341/work/build-static/Python/bytecodes.c:2715:19
  24: PyEval_EvalCode
             at /usr/local/src/conda/python-3.12.8/Python/ceval.c:578:21
  25: run_eval_code_obj
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1722:0
  26: run_mod
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1743:0
  27: pyrun_file
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1643:0
  28: _PyRun_SimpleFileObject
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:433:0
  29: _PyRun_AnyFileObject
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:78:0
  30: pymain_run_file_obj
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:360:0
  31: pymain_run_file
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:379
  32: pymain_run_python
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:633
  33: Py_RunMain
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:713
  34: Py_BytesMain
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:767:12
  35: __libc_start_main
  36: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "/home/hess/development/projects/ascii/research-space/src/research/maximilian/200_supply_chain_indicators/220_scan_worldwide/reproduce.py", line 20, in <module>
    print(lf)
  File "/home/hess/development/projects/ascii/research-space/.pixi/envs/scan/lib/python3.12/site-packages/polars/dataframe/frame.py", line 1201, in __str__
    return self._df.as_str()
           ^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: The column lengths in the DataFrame are not equal.

Issue description

I have a big dataset where I am trying to filter for a list of importers to further process the data. This works in eager mode but throws an exception in lazy mode: "PanicException: The column lengths in the DataFrame are not equal." The issue does not persists if I downsample the data even more.

Expected behavior

Eager mode and lazy mode should return the exact same results and not fail.

Installed versions

--------Version info---------
Polars:              1.20.0
Index type:          UInt32
Platform:            Linux-5.10.0-32-amd64-x86_64-with-glibc2.31
Python:              3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                2.2.2
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

MaximilianHess added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PanicException after simple filter operation with LazyFrame on big dataset #20894

PanicException after simple filter operation with LazyFrame on big dataset #20894

MaximilianHess commented Jan 24, 2025 •

edited

Loading

PanicException after simple filter operation with LazyFrame on big dataset #20894

PanicException after simple filter operation with LazyFrame on big dataset #20894

Comments

MaximilianHess commented Jan 24, 2025 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

MaximilianHess commented Jan 24, 2025 •

edited

Loading