Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanicException after simple filter operation with LazyFrame on big dataset #20894

Open
2 tasks done
MaximilianHess opened this issue Jan 24, 2025 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@MaximilianHess
Copy link

MaximilianHess commented Jan 24, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from polars import col as c

path_to_data = "<path_to_data>/hs_code_version=*/*"
lf = pl.scan_parquet(path_to_data, hive_partitioning=True).drop("quantity")
df = pl.read_parquet(path_to_data, hive_partitioning=True).drop("quantity")


region = [40, 100]
# This works as expected
df = df.filter(c("importer").is_in(region))

print(df)

# This throws an exception: "PanicException: The column lengths in the DataFrame are not equal."

lf = lf.filter(c("importer").is_in(region)).collect()

print(lf)

I couldn't produce some random data where the bug still persists so please find the data I used here (8 % of the original dataset size):

https://drive.proton.me/urls/XQWVS448J4#1yibTP9fy9ny

You have to keep the folder structure for the hive_partitioning.

Log output

thread '<unnamed>' panicked at crates/polars-core/src/fmt.rs:582:13:
The column lengths in the DataFrame are not equal.
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: polars_core::fmt::<impl core::fmt::Display for polars_core::frame::DataFrame>::fmt
   3: core::fmt::write
   4: alloc::fmt::format::format_inner
   5: polars_python::dataframe::general::<impl polars_python::dataframe::PyDataFrame>::__pymethod_as_str__
   6: pyo3::impl_::trampoline::trampoline
   7: polars_python::dataframe::general::_::__INVENTORY::trampoline
   8: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1733407224341/work/build-static/Python/bytecodes.c:3159:19
   9: _PyEval_EvalFrame
             at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_ceval.h:89:16
  10: _PyEval_Vector
             at /usr/local/src/conda/python-3.12.8/Python/ceval.c:1683:12
  11: _PyFunction_Vectorcall
             at /usr/local/src/conda/python-3.12.8/Objects/call.c:419:16
  12: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_call.h:92:11
  13: vectorcall_unbound
             at /usr/local/src/conda/python-3.12.8/Objects/typeobject.c:2236
  14: vectorcall_method
             at /usr/local/src/conda/python-3.12.8/Objects/typeobject.c:2267
  15: slot_tp_str
             at /usr/local/src/conda/python-3.12.8/Objects/typeobject.c:8726:0
  16: PyObject_Str
             at /usr/local/src/conda/python-3.12.8/Objects/object.c:630:12
  17: PyFile_WriteObject
             at /usr/local/src/conda/python-3.12.8/Objects/fileobject.c:124:17
  18: builtin_print_impl
             at /usr/local/src/conda/python-3.12.8/Python/bltinmodule.c:2057:15
  19: builtin_print
             at /usr/local/src/conda/python-3.12.8/Python/clinic/bltinmodule.c.h:1121:20
  20: cfunction_vectorcall_FASTCALL_KEYWORDS
             at /usr/local/src/conda/python-3.12.8/Objects/methodobject.c:438:24
  21: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.12.8/Include/internal/pycore_call.h:92:11
  22: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.12.8/Objects/call.c:325:12
  23: _PyEval_EvalFrameDefault
             at /home/conda/feedstock_root/build_artifacts/python-split_1733407224341/work/build-static/Python/bytecodes.c:2715:19
  24: PyEval_EvalCode
             at /usr/local/src/conda/python-3.12.8/Python/ceval.c:578:21
  25: run_eval_code_obj
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1722:0
  26: run_mod
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1743:0
  27: pyrun_file
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:1643:0
  28: _PyRun_SimpleFileObject
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:433:0
  29: _PyRun_AnyFileObject
             at /usr/local/src/conda/python-3.12.8/Python/pythonrun.c:78:0
  30: pymain_run_file_obj
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:360:0
  31: pymain_run_file
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:379
  32: pymain_run_python
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:633
  33: Py_RunMain
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:713
  34: Py_BytesMain
             at /usr/local/src/conda/python-3.12.8/Modules/main.c:767:12
  35: __libc_start_main
  36: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "/home/hess/development/projects/ascii/research-space/src/research/maximilian/200_supply_chain_indicators/220_scan_worldwide/reproduce.py", line 20, in <module>
    print(lf)
  File "/home/hess/development/projects/ascii/research-space/.pixi/envs/scan/lib/python3.12/site-packages/polars/dataframe/frame.py", line 1201, in __str__
    return self._df.as_str()
           ^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: The column lengths in the DataFrame are not equal.

Issue description

I have a big dataset where I am trying to filter for a list of importers to further process the data. This works in eager mode but throws an exception in lazy mode: "PanicException: The column lengths in the DataFrame are not equal." The issue does not persists if I downsample the data even more.

Expected behavior

Eager mode and lazy mode should return the exact same results and not fail.

Installed versions

--------Version info---------
Polars:              1.20.0
Index type:          UInt32
Platform:            Linux-5.10.0-32-amd64-x86_64-with-glibc2.31
Python:              3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:24:40) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                2.2.2
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@MaximilianHess MaximilianHess added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant