Avro codec enhancements + Avro Writer Schema Generator #6965

jecsand838 · 2025-01-10T19:24:07Z

Which issue does this PR close?

Part of #4886

Rationale for this change

The primary objective of this PR is to enhance the arrow-rs Avro implementation by introducing full support for Avro data types, support for Avro aliases, and laying the foundations for an Avro writer. These enhancements are crucial for several reasons:

1. Enhanced Data Interoperability:
By supporting these additional types, the Avro reader becomes more compatible with a wider range of Avro schemas. This ensures that users can ingest and process diverse datasets without encountering type-related limitations.

2. Foundational Support for an Avro Writer:
This PR lays the groundwork for writing Arrow data structures to Avro format by introducing foundational support for an Avro writer. Previously, the focus was primarily on reading and converting Avro data into Arrow's in-memory format. Extending this functionality to include writing ensures bidirectional interoperability between Arrow and Avro.

What changes are included in this PR?

Avro Codec

Support for new types:
- Decimals
- Maps
- Enums
- UUID (in-progress)
Extended namespace support to all supported types
Added Alias support (in-progress)
Added Arrow Schema -> Avro Schema mapping logic
Extended support for local timestamps
Added Unit tests

Avro Record Decoder

Support for new types:
- Lists
- Fixed
- Interval
- Decimal
- Map
- Enum
- UUID (in-progress)
Expanded nullability support
Added Unit tests

Avro Writer Foundations

Setup basic initial foundation for writer:
- mod file
- VLQ Encoder
Added initial support for Arrow -> Avro JSON Schema builder
Added Unit tests

Are there any user-facing changes?

No, none of these types are public yet

…avro writer.

1. Namespaces 2. Enums 3. Maps 4. Decimals

…al types. Signed-off-by: Connor Sanders <[email protected]>

Signed-off-by: Connor Sanders <[email protected]>

* Implemented reader decoder for Avro Lists * Cleaned up reader/record.rs and added comments for readability. Signed-off-by: Connor Sanders <[email protected]>

Signed-off-by: Connor Sanders <[email protected]>

- Fixed - Interval Signed-off-by: Connor Sanders <[email protected]>

Added Avro codec + decoder support for new types

Signed-off-by: Connor Sanders <[email protected]>

tustvold · 2025-01-11T10:00:25Z

👋 thank you for working on this, it is very exciting to see arrow-avro getting some love. I think what would probably help is to break this up into smaller pieces that can be delivered separately. Whilst I accept this is more work for you, it makes reviewing the code much more practical, especially given our relatively limited review bandwidth. Many of the bullets in your PR description would warrant a separate PR IMO.

alamb

Thank you @jecsand838 I triggered CI for this PR

I haven't had a chance to review the code yet, but I did start to look at the test coverage. What would you think about adding tests using the existing avro testing data in https://github.com/apache/arrow-testing/tree/master/data/avro (already a submodule in this repo)

Key tests in my mind would be:

Read the avro testing files and verify the schema and data read (and leave comments for tests that don't pass)
For the writer implement round trip tests: create a RecordBatch (es), write it to an .avro file and then read it back in and ensure the round tripped batches are equal

You might be able to find code / testing that you can reuse in the datafusion copy: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

FYI @Jefffrey

alamb

After some review of this PR, I think I could likely find some additional time to review / help it along if/when it has end to end tests of reading existing avro files (that would give me confidence that the code being reviewed did the right thing functionally).

To make a specific proposal for splitting up this PRs' functionality as suggested by @tustvold -- #6965 (comment), one way to do so would be:

First PR: Reader improvements / tests showing reading existing .avro files
Second PR: Minimal Writer support (with tests showing round tripping for a few basic data types)
Subsequent PRs: Additional PRs to support writing additional data types

The rationale for breaking it up this way would be to have better confidence in the read path to ensure round tripping also works as needed

Let me know what you think

And thanks again

jecsand838 and others added 17 commits December 28, 2024 12:44

Added basic support for arrow -> avro codec along with beginnings of …

7d4e6fd

…avro writer.

Added codec support + tests for:

36d56a9

1. Namespaces 2. Enums 3. Maps 4. Decimals

Added reader record decoder support for non-null Enum, Map, and Decim…

36b4b73

…al types. Signed-off-by: Connor Sanders <[email protected]>

Added reader record decoder support for non-null Enum, Map, and Decim…

9d0bf4c

…al types. Signed-off-by: Connor Sanders <[email protected]>

Added reader record decoder support for non-null Enum, Map, and Decim…

082581a

…al types. Signed-off-by: Connor Sanders <[email protected]>

Added null support

6647b25

Signed-off-by: Connor Sanders <[email protected]>

* Reader decoder Support for nullable types.

9cfda09

* Implemented reader decoder for Avro Lists * Cleaned up reader/record.rs and added comments for readability. Signed-off-by: Connor Sanders <[email protected]>

* Minor Cleanup

84ffb62

Signed-off-by: Connor Sanders <[email protected]>

* Added record decoder support for the following types:

8600680

- Fixed - Interval Signed-off-by: Connor Sanders <[email protected]>

Merge pull request #1 from elastiflow/avro-codec-improvements

2792037

Added Avro codec + decoder support for new types

Merge branch 'apache:main' into avro-codec

3029efc

Merge branch 'apache:main' into avro-codec

01e5224

Minor cleanup

8331058

Signed-off-by: Connor Sanders <[email protected]>

chore: import cleanup and formatting

c54de48

chore: simplifies and cleans code

fc696c8

ran linter

81d7bba

Signed-off-by: Connor Sanders <[email protected]>

Merge branch 'apache:main' into avro-codec

03bd5f0

github-actions bot added the arrow Changes to the arrow crate label Jan 10, 2025

alamb reviewed Jan 12, 2025

View reviewed changes

alamb mentioned this pull request Jan 12, 2025

Consider using upstream arrow-avro reader apache/datafusion#14097

Open

alamb reviewed Jan 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avro codec enhancements + Avro Writer Schema Generator #6965

Avro codec enhancements + Avro Writer Schema Generator #6965

jecsand838 commented Jan 10, 2025

tustvold commented Jan 11, 2025 •

edited

Loading

alamb left a comment

alamb left a comment

Avro codec enhancements + Avro Writer Schema Generator #6965

Are you sure you want to change the base?

Avro codec enhancements + Avro Writer Schema Generator #6965

Conversation

jecsand838 commented Jan 10, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Avro Codec

Avro Record Decoder

Avro Writer Foundations

Are there any user-facing changes?

tustvold commented Jan 11, 2025 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

tustvold commented Jan 11, 2025 •

edited

Loading