-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize select_bit #52
base: master
Are you sure you want to change the base?
Conversation
ARM NEON has a byte-wise popcount instruction, which helps to optimize `select_bit` and `PopCount::count`. Use it for AArch64 (64-bit ARM). 15% speedup for `Rank1`, 4% for `Select0` and 3% for `Select1`. (60% for `PopCount::count` itself.)
This gives a 9% speedup on `select0` and 7% on `select1`. (Tested on Pixel 3 in armeabi-v7a mode.) This is likely because the branches of this unrolled linear search are more predictable than the binary search that was used previously.
Instead of computing `(counts | MASK_80) - ((i + 1) * MASK_01)`, we pre-compute a lookup table ``` PREFIX_SUM_OVERFLOW[i] = (0x80 - (i + 1)) * MASK_01 = (0x7F - i) * MASK_01 ``` then use `counts + PREFIX_SUM_OVERFLOW[i]`. This uses a `UInt64[64]` or 0.5kiB lookup table. The trick is from: Gog, Simon and Matthias Petri. “Optimized succinct data structures for massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314. https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3 This gives a 2-3% speedup for `BitVector::select0`/`select1`.
FYI in C++20, there is an embedded std::popcount function that performs this functionality. It uses CPU instructions dedicated to this. From my testing separate from this library, it seems faster and way more portable. |
https://godbolt.org/z/Tj6TvxaeW How did you measure it and what are you seeing? This repo tries to work with C++98, so can't depend (unconditionally) on C++20. It seems to be abandoned anyway. |
Yes, adding a C++ version check is fine. I wasn’t suggesting to make it unconditional. This code has alternate implementations that are CPU dependent that leverage SSE, and don’t leverage existing popcount CPU instructions. From what I’ve seen, std::popcount calls the underlying CPU instruction for the most efficient implementation, which improves portability to more CPUs. It doesn’t use a table (fewer memory loads). Yes it’s unfortunate that this project seems abandoned. I kind of wish a new official and maintained fork came into existence. |
ARM NEON has a byte-wise popcount instruction, which helps to optimize
select_bit
andPopCount::count
. Use it for AArch64 (64-bit ARM).15% speedup for
Rank1
, 4% forSelect0
and 3% forSelect1
.(60% for
PopCount::count
itself.)This gives a 9% speedup on
select0
and 7% onselect1
.(Tested on Pixel 3 in armeabi-v7a mode.)
This is likely because the branches of this unrolled linear
search are more predictable than the binary search that was
used previously.
Instead of computing
(counts | MASK_80) - ((i + 1) * MASK_01)
,we pre-compute a lookup table
then use
counts + PREFIX_SUM_OVERFLOW[i]
.This uses a
UInt64[64]
or 0.5kiB lookup table. The trick is from:Gog, Simon and Matthias Petri. “Optimized succinct data structures for
massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.
https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3
This gives a 2-3% speedup for
BitVector::select0
/select1
.