Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Arrow-IPC performance by avoiding Unsafe Unchecked IPC Read RecordBatch #3287

Open
tustvold opened this issue Dec 7, 2022 · 3 comments · May be fixed by #6938
Open

Improve Arrow-IPC performance by avoiding Unsafe Unchecked IPC Read RecordBatch #3287

tustvold opened this issue Dec 7, 2022 · 3 comments · May be fixed by #6938
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog help wanted performance

Comments

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When transferring flatbuffers between trusted sources I would like a mechanism to elide costly verification of ArrayData contents.

Describe the solution you'd like

I would like a variant of read_record_batch_unchecked that performs the function of read_record_batch but without performing validation of the ArrayData.

Describe alternatives you've considered

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Dec 7, 2022
@tustvold tustvold changed the title Unsafe Unchecked IPC Reader Unsafe Unchecked IPC Read RecordBatch Dec 7, 2022
@totoroyyb
Copy link

Coming from #6933.

Would you think a separate API, as you suggested, would be good, or maybe provide an option for FileReader (or FileDecoder)?

@alamb alamb changed the title Unsafe Unchecked IPC Read RecordBatch Improve Arrow-IPC performance by avoiding Unsafe Unchecked IPC Read RecordBatch Jan 10, 2025
@alamb alamb added arrow Changes to the arrow crate performance labels Jan 10, 2025
@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

Coming from #6933.

Would you think a separate API, as you suggested, would be good, or maybe provide an option for FileReader (or FileDecoder)?

I think an option would be good

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

Copying @totoroyyb 's high level usecase description from #6933

They report a 100x performance improvement when disabling data validation:

Describe your question
I am using high-level API (FileReader and FileDecoder) to read IPC files via mmap. I have noticed that validate_data() in the Array building process (here) adds significant overhead.

I am targeting an ultra-low-latency scenario. With validate_data I got 290ms for reading a 2.2GB IPC file (via mmap), and 3.8ms without validate_data, which I tested locally by commenting that out. 3.8ms latency is pretty much identical to c++ arrow implementation I tested, and I suspect c++ codebase didn't do this sanity check (not entirely sure).

The functions for the "unchecked" building are here in the codebase, but they are not accessible from high-level API, where I can easily disable them without creating my own array and everything on top of it.

I wonder if there is any better way to achieve that?

Additional context
Low latency is critical in my case. Thus, I am trying to avoid any additional overhead (C++ codebase as the baseline, maybe?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog help wanted performance
Projects
None yet
3 participants