From 2f7620bde85ae6f542416824159778c92a2a9400 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Wed, 4 Sep 2024 14:20:13 +0200 Subject: [PATCH 01/18] Add Compression paragraph to the Columnar.rst IPC section --- docs/source/format/Columnar.rst | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index c5f822f41643f..c5c41f7ad9f73 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1385,6 +1385,36 @@ have two entries in each RecordBatch. For a RecordBatch of this schema with buffer 13: col2 data +Compression +----------- + +There are three different options for record batch body +buffers compression: buffers can be uncompressed, can use +``lz4`` or ``zstd`` compression codec. All buffers in the flat +sequence of the message body are compressed separately with the +same codec. + +The difference between compressed and uncompressed buffers in the +serialized form is as follows: + +* If the buffers in the ``RecordBatch`` message are **compressed** + + - the ``data header`` includes the length and memory offset + of each **compressed buffer** in the record batch's body + + - the ``body`` includes a flat sequence of **compressed memory + buffers** together with the **length of uncompressed buffer** + stored in the first 8 bytes for each buffer in the sequence + +* If the buffers in the ``RecordBatch`` message are **uncompressed** + + - the ``data header`` includes the length and memory offset + of each **uncompressed buffer** in the record batch's body + + - the ``body`` includes a flat sequence of **uncompressed memory + buffers** with the first 8 bytes empty or equal to ``-1`` to + indicate that the buffer is uncompressed + Byte Order (`Endianness`_) --------------------------- From bc75ce59c16044f5c08d618b9672c227168ec71c Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Wed, 4 Sep 2024 18:54:59 +0200 Subject: [PATCH 02/18] Update docs/source/format/Columnar.rst Co-authored-by: Ian Cook --- docs/source/format/Columnar.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index c5c41f7ad9f73..2ca0009f018c3 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1388,11 +1388,12 @@ have two entries in each RecordBatch. For a RecordBatch of this schema with Compression ----------- -There are three different options for record batch body -buffers compression: buffers can be uncompressed, can use -``lz4`` or ``zstd`` compression codec. All buffers in the flat -sequence of the message body are compressed separately with the -same codec. +There are three different options for compression of record batch +body buffers: Buffers can be uncompressed, buffers can be +compressed with the``lz4`` compression codec, or buffers can +be compressed with the ``zstd`` compression codec. Buffers in +the flat sequence of a message body must be either all +uncompressed or all compressed separately using the same codec. The difference between compressed and uncompressed buffers in the serialized form is as follows: From b4375f5d56a6e3542179d034b11945f2873f528a Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Thu, 5 Sep 2024 06:25:34 +0200 Subject: [PATCH 03/18] Update docs/source/format/Columnar.rst Co-authored-by: Ian Cook --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 2ca0009f018c3..096322e8e8f8c 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1390,7 +1390,7 @@ Compression There are three different options for compression of record batch body buffers: Buffers can be uncompressed, buffers can be -compressed with the``lz4`` compression codec, or buffers can +compressed with the ``lz4`` compression codec, or buffers can be compressed with the ``zstd`` compression codec. Buffers in the flat sequence of a message body must be either all uncompressed or all compressed separately using the same codec. From 01aa4c8e23601ba211c0b856cd57be3dd4179af0 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Thu, 5 Sep 2024 11:29:32 +0200 Subject: [PATCH 04/18] Use only buffer instead of memory buffer --- docs/source/format/Columnar.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 096322e8e8f8c..8488c64de8f2f 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1403,18 +1403,18 @@ serialized form is as follows: - the ``data header`` includes the length and memory offset of each **compressed buffer** in the record batch's body - - the ``body`` includes a flat sequence of **compressed memory - buffers** together with the **length of uncompressed buffer** - stored in the first 8 bytes for each buffer in the sequence + - the ``body`` includes a flat sequence of **compressed buffers** + together with the **length of uncompressed buffer** stored in + the first 8 bytes for each buffer in the sequence * If the buffers in the ``RecordBatch`` message are **uncompressed** - the ``data header`` includes the length and memory offset of each **uncompressed buffer** in the record batch's body - - the ``body`` includes a flat sequence of **uncompressed memory - buffers** with the first 8 bytes empty or equal to ``-1`` to - indicate that the buffer is uncompressed + - the ``body`` includes a flat sequence of **uncompressed buffers** + with the first 8 bytes empty or equal to ``-1`` to indicate that + the buffer is uncompressed Byte Order (`Endianness`_) --------------------------- From 83257b302da2a091d9df95f5f3c8c8bf39839df7 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Thu, 5 Sep 2024 11:30:49 +0200 Subject: [PATCH 05/18] Add info about endian and signed for the length in the msg body --- docs/source/format/Columnar.rst | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 8488c64de8f2f..d7c0c9250256e 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1404,8 +1404,9 @@ serialized form is as follows: of each **compressed buffer** in the record batch's body - the ``body`` includes a flat sequence of **compressed buffers** - together with the **length of uncompressed buffer** stored in - the first 8 bytes for each buffer in the sequence + together with the **length of uncompressed buffer** as a 64-bit + little-endian signed integer stored in the first 8 bytes for each + buffer in the sequence * If the buffers in the ``RecordBatch`` message are **uncompressed** From d9ad1b2a9f1293dd3e43db47b7fee05054b0a941 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Thu, 5 Sep 2024 11:42:32 +0200 Subject: [PATCH 06/18] Fix links, add link to RecordBatch msg section and add note about lz4 codec --- docs/source/format/Columnar.rst | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index d7c0c9250256e..a13147a2982a8 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1284,6 +1284,8 @@ We additionally provide both schema-level and field-level ``custom_metadata`` attributes allowing for systems to insert their own application defined metadata to customize behavior. +.. _ipc-recordbatch-message: + RecordBatch message ------------------- @@ -1395,10 +1397,17 @@ be compressed with the ``zstd`` compression codec. Buffers in the flat sequence of a message body must be either all uncompressed or all compressed separately using the same codec. +.. note:: + + ``lz4`` compression codec means the + `LZ4 frame format `_ + and should not to be confused with + `"raw" (also called "block") format `_. + The difference between compressed and uncompressed buffers in the serialized form is as follows: -* If the buffers in the ``RecordBatch`` message are **compressed** +* If the buffers in the :ref:`ipc-recordbatch-message` are **compressed** - the ``data header`` includes the length and memory offset of each **compressed buffer** in the record batch's body @@ -1408,7 +1417,7 @@ serialized form is as follows: little-endian signed integer stored in the first 8 bytes for each buffer in the sequence -* If the buffers in the ``RecordBatch`` message are **uncompressed** +* If the buffers in the :ref:`ipc-recordbatch-message` are **uncompressed** - the ``data header`` includes the length and memory offset of each **uncompressed buffer** in the record batch's body From cb50c9dbeacc097a0506064dc7cda8a607dabac0 Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Mon, 9 Sep 2024 15:00:37 +0200 Subject: [PATCH 07/18] Update docs/source/format/Columnar.rst Co-authored-by: Ian Cook --- docs/source/format/Columnar.rst | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index a13147a2982a8..67385d3be3c3b 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1426,6 +1426,19 @@ serialized form is as follows: with the first 8 bytes empty or equal to ``-1`` to indicate that the buffer is uncompressed +.. note:: + + Some Arrow implementations lack support for producing and consuming + IPC data with compressed buffers using one or either of the codecs + listed above. See :doc:`../status` for details. + + Some applications might apply compression in the protocol they use + to store or transport Arrow IPC data. (For example, an HTTP server + might serve gzip-compressed Arrow IPC streams.) Applications that + already use compression in their storage or transport protocols + should avoid using buffer compression. Double compression typically + worsens performance and does not substantially improve compression + ratios. Byte Order (`Endianness`_) --------------------------- From 3eb37929caa472c4b20a53f023946c0a2cc62b75 Mon Sep 17 00:00:00 2001 From: Ian Cook Date: Mon, 9 Sep 2024 09:20:03 -0400 Subject: [PATCH 08/18] Add missing newline --- docs/source/format/Columnar.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 67385d3be3c3b..2a728e65108f6 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1439,6 +1439,7 @@ serialized form is as follows: should avoid using buffer compression. Double compression typically worsens performance and does not substantially improve compression ratios. + Byte Order (`Endianness`_) --------------------------- From 6df2797c6c223ef62406ff1bc3f82dc2e6ed1a2f Mon Sep 17 00:00:00 2001 From: Ian Cook Date: Mon, 9 Sep 2024 09:24:35 -0400 Subject: [PATCH 09/18] Add missing newline --- docs/source/format/Columnar.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 2a728e65108f6..ae31a84955b5d 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1439,7 +1439,8 @@ serialized form is as follows: should avoid using buffer compression. Double compression typically worsens performance and does not substantially improve compression ratios. - + + Byte Order (`Endianness`_) --------------------------- From 4e06abb6fce2a93fa6c1fa84bfe153ea4e06b98d Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Tue, 10 Sep 2024 06:31:28 +0200 Subject: [PATCH 10/18] Apply suggestions from code review Co-authored-by: Sutou Kouhei --- docs/source/format/Columnar.rst | 31 +++++++++++++++---------------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index ae31a84955b5d..cdfbd4701d4ec 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1399,10 +1399,10 @@ uncompressed or all compressed separately using the same codec. .. note:: - ``lz4`` compression codec means the - `LZ4 frame format `_ - and should not to be confused with - `"raw" (also called "block") format `_. + ``lz4`` compression codec means the + `LZ4 frame format `_ + and should not to be confused with + `"raw" (also called "block") format `_. The difference between compressed and uncompressed buffers in the serialized form is as follows: @@ -1428,18 +1428,17 @@ serialized form is as follows: .. note:: - Some Arrow implementations lack support for producing and consuming - IPC data with compressed buffers using one or either of the codecs - listed above. See :doc:`../status` for details. - - Some applications might apply compression in the protocol they use - to store or transport Arrow IPC data. (For example, an HTTP server - might serve gzip-compressed Arrow IPC streams.) Applications that - already use compression in their storage or transport protocols - should avoid using buffer compression. Double compression typically - worsens performance and does not substantially improve compression - ratios. - + Some Arrow implementations lack support for producing and consuming + IPC data with compressed buffers using one or either of the codecs + listed above. See :doc:`../status` for details. + + Some applications might apply compression in the protocol they use + to store or transport Arrow IPC data. (For example, an HTTP server + might serve gzip-compressed Arrow IPC streams.) Applications that + already use compression in their storage or transport protocols + should avoid using buffer compression. Double compression typically + worsens performance and does not substantially improve compression + ratios. Byte Order (`Endianness`_) --------------------------- From bcb614d0acd207568f2de1e1463b94e1c628f8fc Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Wed, 11 Sep 2024 11:43:24 +0200 Subject: [PATCH 11/18] Update docs/source/format/Columnar.rst Co-authored-by: Joris Van den Bossche --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index cdfbd4701d4ec..45fc79a16e317 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1413,7 +1413,7 @@ serialized form is as follows: of each **compressed buffer** in the record batch's body - the ``body`` includes a flat sequence of **compressed buffers** - together with the **length of uncompressed buffer** as a 64-bit + together with the **length of the uncompressed buffer** as a 64-bit little-endian signed integer stored in the first 8 bytes for each buffer in the sequence From 6fb80b0cc3679e856b708e83c9d6ea40ddabcab5 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Wed, 11 Sep 2024 12:50:21 +0200 Subject: [PATCH 12/18] Add info about where compression is stored --- docs/source/format/Columnar.rst | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 45fc79a16e317..af7125e005ccb 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1397,6 +1397,10 @@ be compressed with the ``zstd`` compression codec. Buffers in the flat sequence of a message body must be either all uncompressed or all compressed separately using the same codec. +The codec or the compression type used is defined in the ``data header``` +of the :ref:`ipc-recordbatch-message` in the optional ``compression`` +field. + .. note:: ``lz4`` compression codec means the @@ -1410,7 +1414,8 @@ serialized form is as follows: * If the buffers in the :ref:`ipc-recordbatch-message` are **compressed** - the ``data header`` includes the length and memory offset - of each **compressed buffer** in the record batch's body + of each **compressed buffer** in the record batch's body together + with the compression type - the ``body`` includes a flat sequence of **compressed buffers** together with the **length of the uncompressed buffer** as a 64-bit From 99b7db5823c4ae1f66d2cb28d62a255c7e36e515 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Wed, 11 Sep 2024 12:59:25 +0200 Subject: [PATCH 13/18] Update body info for first 8 bytes --- docs/source/format/Columnar.rst | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index af7125e005ccb..6968b0541063a 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1392,10 +1392,12 @@ Compression There are three different options for compression of record batch body buffers: Buffers can be uncompressed, buffers can be -compressed with the ``lz4`` compression codec, or buffers can -be compressed with the ``zstd`` compression codec. Buffers in -the flat sequence of a message body must be either all -uncompressed or all compressed separately using the same codec. +compressed with the ``lz4`` compression codec, or buffers can be +compressed with the ``zstd`` compression codec. Buffers in the +flat sequence of a message body must be compressed separately using +the same codec. Specific buffer in the sequence of compressed +buffers can be left uncompressed in case compression does not yield +appreciable savings. The codec or the compression type used is defined in the ``data header``` of the :ref:`ipc-recordbatch-message` in the optional ``compression`` @@ -1420,16 +1422,15 @@ serialized form is as follows: - the ``body`` includes a flat sequence of **compressed buffers** together with the **length of the uncompressed buffer** as a 64-bit little-endian signed integer stored in the first 8 bytes for each - buffer in the sequence + buffer in the sequence. The first 8 bytes can be left empty or equal + to ``-1`` to indicate that that specific buffer is left uncompressed. * If the buffers in the :ref:`ipc-recordbatch-message` are **uncompressed** - the ``data header`` includes the length and memory offset of each **uncompressed buffer** in the record batch's body - - the ``body`` includes a flat sequence of **uncompressed buffers** - with the first 8 bytes empty or equal to ``-1`` to indicate that - the buffer is uncompressed + - the ``body`` includes a flat sequence of **uncompressed buffers**. .. note:: From b87a17c0aaacadd615d30697bd3b31c6e7401d33 Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Thu, 12 Sep 2024 08:50:51 +0200 Subject: [PATCH 14/18] Remove reference to empty 8 bytes --- docs/source/format/Columnar.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 6968b0541063a..a19288474335c 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1422,8 +1422,8 @@ serialized form is as follows: - the ``body`` includes a flat sequence of **compressed buffers** together with the **length of the uncompressed buffer** as a 64-bit little-endian signed integer stored in the first 8 bytes for each - buffer in the sequence. The first 8 bytes can be left empty or equal - to ``-1`` to indicate that that specific buffer is left uncompressed. + buffer in the sequence. The first 8 bytes can equal ``-1`` to indicate + that that specific buffer is left uncompressed. * If the buffers in the :ref:`ipc-recordbatch-message` are **uncompressed** From fb03ef0610e255705287cf279e148ae100d17d76 Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Thu, 12 Sep 2024 14:48:59 +0200 Subject: [PATCH 15/18] Apply suggestions from code review Co-authored-by: Joris Van den Bossche --- docs/source/format/Columnar.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index a19288474335c..ece9e59d13c32 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1399,7 +1399,7 @@ the same codec. Specific buffer in the sequence of compressed buffers can be left uncompressed in case compression does not yield appreciable savings. -The codec or the compression type used is defined in the ``data header``` +The compression type used is defined in the ``data header``` of the :ref:`ipc-recordbatch-message` in the optional ``compression`` field. @@ -1421,8 +1421,8 @@ serialized form is as follows: - the ``body`` includes a flat sequence of **compressed buffers** together with the **length of the uncompressed buffer** as a 64-bit - little-endian signed integer stored in the first 8 bytes for each - buffer in the sequence. The first 8 bytes can equal ``-1`` to indicate + little-endian signed integer stored in the first 8 bytes of each + buffer in the sequence. This uncompressed length can be set to ``-1`` to indicate that that specific buffer is left uncompressed. * If the buffers in the :ref:`ipc-recordbatch-message` are **uncompressed** From 739f6106f392118eeaab500806c2e65b9cb36cda Mon Sep 17 00:00:00 2001 From: AlenkaF Date: Thu, 12 Sep 2024 14:50:33 +0200 Subject: [PATCH 16/18] Add compression default --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index ece9e59d13c32..fa0e20a05190e 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1401,7 +1401,7 @@ appreciable savings. The compression type used is defined in the ``data header``` of the :ref:`ipc-recordbatch-message` in the optional ``compression`` -field. +field with the default being uncompressed. .. note:: From e684f542d9a73b4880e8fc48e9549cfd753abafe Mon Sep 17 00:00:00 2001 From: Alenka Frim Date: Fri, 13 Sep 2024 10:01:20 +0200 Subject: [PATCH 17/18] Update docs/source/format/Columnar.rst Co-authored-by: Ian Cook --- docs/source/format/Columnar.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index fa0e20a05190e..df0e05608660a 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1395,9 +1395,9 @@ body buffers: Buffers can be uncompressed, buffers can be compressed with the ``lz4`` compression codec, or buffers can be compressed with the ``zstd`` compression codec. Buffers in the flat sequence of a message body must be compressed separately using -the same codec. Specific buffer in the sequence of compressed -buffers can be left uncompressed in case compression does not yield -appreciable savings. +the same codec. Specific buffers in the sequence of compressed +buffers may be left uncompressed (for example if compressing those +specific buffers would not appreciably reduce their size). The compression type used is defined in the ``data header``` of the :ref:`ipc-recordbatch-message` in the optional ``compression`` From e4e9629768675a24c07bbfde415fe9182324d90e Mon Sep 17 00:00:00 2001 From: Ian Cook Date: Tue, 17 Sep 2024 13:52:54 -0400 Subject: [PATCH 18/18] Remove errant backtick --- docs/source/format/Columnar.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index df0e05608660a..6a2de760cb85a 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -1399,7 +1399,7 @@ the same codec. Specific buffers in the sequence of compressed buffers may be left uncompressed (for example if compressing those specific buffers would not appreciably reduce their size). -The compression type used is defined in the ``data header``` +The compression type used is defined in the ``data header`` of the :ref:`ipc-recordbatch-message` in the optional ``compression`` field with the default being uncompressed.