Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avro codec enhancements + Avro Writer Schema Generator #6965

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jecsand838
Copy link

Which issue does this PR close?

Part of #4886

Rationale for this change

The primary objective of this PR is to enhance the arrow-rs Avro implementation by introducing full support for Avro data types, support for Avro aliases, and laying the foundations for an Avro writer. These enhancements are crucial for several reasons:

1. Enhanced Data Interoperability:
By supporting these additional types, the Avro reader becomes more compatible with a wider range of Avro schemas. This ensures that users can ingest and process diverse datasets without encountering type-related limitations.

2. Foundational Support for an Avro Writer:
This PR lays the groundwork for writing Arrow data structures to Avro format by introducing foundational support for an Avro writer. Previously, the focus was primarily on reading and converting Avro data into Arrow's in-memory format. Extending this functionality to include writing ensures bidirectional interoperability between Arrow and Avro.

What changes are included in this PR?

Avro Codec

  1. Support for new types:
    • Decimals
    • Maps
    • Enums
    • UUID (in-progress)
  2. Extended namespace support to all supported types
  3. Added Alias support (in-progress)
  4. Added Arrow Schema -> Avro Schema mapping logic
  5. Extended support for local timestamps
  6. Added Unit tests

Avro Record Decoder

  1. Support for new types:
    • Lists
    • Fixed
    • Interval
    • Decimal
    • Map
    • Enum
    • UUID (in-progress)
  2. Expanded nullability support
  3. Added Unit tests

Avro Writer Foundations

  1. Setup basic initial foundation for writer:
    • mod file
    • VLQ Encoder
  2. Added initial support for Arrow -> Avro JSON Schema builder
  3. Added Unit tests

Are there any user-facing changes?

No, none of these types are public yet

jecsand838 and others added 17 commits December 28, 2024 12:44
1. Namespaces
2. Enums
3. Maps
4. Decimals
Signed-off-by: Connor Sanders <[email protected]>
* Implemented reader decoder for Avro Lists
* Cleaned up reader/record.rs and added comments for readability.

Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
- Fixed
- Interval

Signed-off-by: Connor Sanders <[email protected]>
Added Avro codec + decoder support for new types
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 10, 2025
@tustvold
Copy link
Contributor

tustvold commented Jan 11, 2025

👋 thank you for working on this, it is very exciting to see arrow-avro getting some love. I think what would probably help is to break this up into smaller pieces that can be delivered separately. Whilst I accept this is more work for you, it makes reviewing the code much more practical, especially given our relatively limited review bandwidth. Many of the bullets in your PR description would warrant a separate PR IMO.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jecsand838 I triggered CI for this PR

I haven't had a chance to review the code yet, but I did start to look at the test coverage. What would you think about adding tests using the existing avro testing data in https://github.com/apache/arrow-testing/tree/master/data/avro (already a submodule in this repo)

Key tests in my mind would be:

  1. Read the avro testing files and verify the schema and data read (and leave comments for tests that don't pass)
  2. For the writer implement round trip tests: create a RecordBatch (es), write it to an .avro file and then read it back in and ensure the round tripped batches are equal

You might be able to find code / testing that you can reuse in the datafusion copy: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

FYI @Jefffrey

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some review of this PR, I think I could likely find some additional time to review / help it along if/when it has end to end tests of reading existing avro files (that would give me confidence that the code being reviewed did the right thing functionally).

To make a specific proposal for splitting up this PRs' functionality as suggested by @tustvold -- #6965 (comment), one way to do so would be:

  1. First PR: Reader improvements / tests showing reading existing .avro files
  2. Second PR: Minimal Writer support (with tests showing round tripping for a few basic data types)
  3. Subsequent PRs: Additional PRs to support writing additional data types

The rationale for breaking it up this way would be to have better confidence in the read path to ensure round tripping also works as needed

Let me know what you think

And thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants