Delta Stats for binary columns are not truncated #1805

emcake · 2023-11-04T22:43:29Z

Environment

Delta-rs version: master

Binding: rust

Environment: local test

Bug

What happened:

When writing a file with large binary columns, the delta log json for the commit is very large due to a large statistics object.

What you expected to happen:

These are expected to receive truncated statistics due to a PR to arrow-rs (apache/arrow-rs#4389).

How to reproduce it:

A test case (to put in stats.rs):

    #[tokio::test]
    async fn test_delta_stats_truncation() -> Result<(), crate::DeltaTableError> {
        let temp_dir = tempfile::tempdir().unwrap();
        let table_path = temp_dir.path().to_owned();

        let schema_fields = vec![
            crate::schema::SchemaField::new(
                "long_string".to_owned(),
                crate::SchemaDataType::primitive("string".to_owned()),
                false,
                Default::default(),
            ),
            crate::schema::SchemaField::new(
                "long_binary".to_owned(),
                crate::SchemaDataType::primitive("binary".to_owned()),
                false,
                Default::default(),
            ),
        ];

        let table = crate::operations::create::CreateBuilder::new()
            .with_table_name("temp")
            .with_location(table_path.to_str().unwrap())
            .with_columns(schema_fields.clone())
            .await?;
        let mut writer = RecordBatchWriter::for_table(&table).unwrap();
        writer = writer.with_writer_properties(
            WriterProperties::builder()
                .set_compression(Compression::SNAPPY)
                .set_max_row_group_size(128)
                .build(),
        );

        let fields = arrow::datatypes::Fields::from(
            schema_fields
                .into_iter()
                .map(|f| arrow::datatypes::Field::try_from(&f).unwrap())
                .collect::<Vec<_>>(),
        );

        let long_field_len =
            10 * parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap();

        const ROW_COUNT: usize = 10;

        let mut string_builder = arrow::array::StringBuilder::new();
        let mut binary_builder = arrow::array::BinaryBuilder::new();

        for i in 0..ROW_COUNT {
            let long_string = std::iter::repeat(i.to_string())
                .take(long_field_len)
                .collect::<Vec<_>>()
                .join("");
            string_builder.append_value(&long_string);

            let long_binary = std::iter::repeat(i as u8)
                .take(long_field_len)
                .collect::<Vec<_>>();
            binary_builder.append_value(&long_binary);
        }

        let arrays: Vec<Arc<dyn arrow::array::Array>> = vec![
            Arc::new(string_builder.finish()),
            Arc::new(binary_builder.finish()),
        ];

        let file_contents: arrow::record_batch::RecordBatch =
            StructArray::new(fields, arrays, None).into();

        writer.write(file_contents).await?;

        let mut actions = writer.flush().await?;

        assert!(actions.len() == 1);

        let action = actions.remove(0);

        let stats = action.get_stats()?.expect("stats");

        match stats.min_values.get("long_string").unwrap() {
            ColumnValueStat::Value(serde_json::Value::String(s)) => {
                assert_eq!(
                    s.len(),
                    parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
                );
            }
            x => panic!("invalid stats format: {x:?}"),
        }

        match stats.min_values.get("long_binary").unwrap() {
            ColumnValueStat::Value(serde_json::Value::String(s)) => {
                assert_eq!(
                    s.len(),
                    parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
                );
            }
            x => panic!("invalid stats format: {x:?}"),
        }

        Ok(())
    }

More details:

I think this is due to the underlying parquet write truncating the values it uses for the column index, but not for the column metadata statistics. I'm going to open a companion issue to arrow-rs to track that. (EDIT: apache/arrow-rs#5037)

The text was updated successfully, but these errors were encountered:

emcake added the bug Something isn't working label Nov 4, 2023

emcake mentioned this issue Nov 4, 2023

Binary columns do not receive truncated statistics apache/arrow-rs#5037

Closed

emcake mentioned this issue Dec 19, 2023

chore: datafusion 34, arrow & parquet 49 #1983

Closed

rtyler added this to the Rust v0.18 milestone Feb 6, 2024

echai58 mentioned this issue Jun 28, 2024

Expose set_statistics_truncate_length via Python WriterProperties #2630

Closed

ion-elgreco modified the milestones: Rust v1.0.0, Proper binary dtype handling Sep 20, 2024

rtyler modified the milestones: Proper binary dtype handling, Rust v1.0.0 Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta Stats for binary columns are not truncated #1805

Delta Stats for binary columns are not truncated #1805

emcake commented Nov 4, 2023 •

edited

Loading

Delta Stats for binary columns are not truncated #1805

Delta Stats for binary columns are not truncated #1805

Comments

emcake commented Nov 4, 2023 • edited Loading

Environment

Bug

emcake commented Nov 4, 2023 •

edited

Loading