Optimize select_bit #52

jmr · 2023-08-21T15:43:52Z

AArch64 byte-wise popcount optimization

ARM NEON has a byte-wise popcount instruction, which helps to optimize
select_bit and PopCount::count. Use it for AArch64 (64-bit ARM).

15% speedup for Rank1, 4% for Select0 and 3% for Select1.
(60% for PopCount::count itself.)

byte-serial 32-bit version

This gives a 9% speedup on select0 and 7% on select1.
(Tested on Pixel 3 in armeabi-v7a mode.)

This is likely because the branches of this unrolled linear
search are more predictable than the binary search that was
used previously.

Use lookup table

Instead of computing (counts | MASK_80) - ((i + 1) * MASK_01),
we pre-compute a lookup table

PREFIX_SUM_OVERFLOW[i] = (0x80 - (i + 1)) * MASK_01 = (0x7F - i) * MASK_01

then use counts + PREFIX_SUM_OVERFLOW[i].

This uses a UInt64[64] or 0.5kiB lookup table. The trick is from:
Gog, Simon and Matthias Petri. “Optimized succinct data structures for
massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.

https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3

This gives a 2-3% speedup for BitVector::select0/select1.

ARM NEON has a byte-wise popcount instruction, which helps to optimize `select_bit` and `PopCount::count`. Use it for AArch64 (64-bit ARM). 15% speedup for `Rank1`, 4% for `Select0` and 3% for `Select1`. (60% for `PopCount::count` itself.)

This gives a 9% speedup on `select0` and 7% on `select1`. (Tested on Pixel 3 in armeabi-v7a mode.) This is likely because the branches of this unrolled linear search are more predictable than the binary search that was used previously.

Instead of computing `(counts | MASK_80) - ((i + 1) * MASK_01)`, we pre-compute a lookup table ``` PREFIX_SUM_OVERFLOW[i] = (0x80 - (i + 1)) * MASK_01 = (0x7F - i) * MASK_01 ``` then use `counts + PREFIX_SUM_OVERFLOW[i]`. This uses a `UInt64[64]` or 0.5kiB lookup table. The trick is from: Gog, Simon and Matthias Petri. “Optimized succinct data structures for massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314. https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3 This gives a 2-3% speedup for `BitVector::select0`/`select1`.

jmr · 2023-08-21T15:44:08Z

@glebm

grhoten · 2025-01-11T21:12:17Z

FYI in C++20, there is an embedded std::popcount function that performs this functionality. It uses CPU instructions dedicated to this. From my testing separate from this library, it seems faster and way more portable.

jmr · 2025-01-12T08:38:06Z

std::popcount and what's in this PR look the same to me:

https://godbolt.org/z/Tj6TvxaeW

How did you measure it and what are you seeing?

This repo tries to work with C++98, so can't depend (unconditionally) on C++20. It seems to be abandoned anyway.

grhoten · 2025-01-12T16:11:16Z

Yes, adding a C++ version check is fine. I wasn’t suggesting to make it unconditional. This code has alternate implementations that are CPU dependent that leverage SSE, and don’t leverage existing popcount CPU instructions. From what I’ve seen, std::popcount calls the underlying CPU instruction for the most efficient implementation, which improves portability to more CPUs. It doesn’t use a table (fewer memory loads).

Yes it’s unfortunate that this project seems abandoned. I kind of wish a new official and maintained fork came into existence.

jmr added 3 commits August 21, 2023 17:21

select_bit: Optimize for AArch64

fc0fe8f

ARM NEON has a byte-wise popcount instruction, which helps to optimize `select_bit` and `PopCount::count`. Use it for AArch64 (64-bit ARM). 15% speedup for `Rank1`, 4% for `Select0` and 3% for `Select1`. (60% for `PopCount::count` itself.)

select_bit: Use byte-serial version for 32-bit

b69d476

This gives a 9% speedup on `select0` and 7% on `select1`. (Tested on Pixel 3 in armeabi-v7a mode.) This is likely because the branches of this unrolled linear search are more predictable than the binary search that was used previously.

glebm approved these changes Aug 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize select_bit #52

Optimize select_bit #52

jmr commented Aug 21, 2023

jmr commented Aug 21, 2023

grhoten commented Jan 11, 2025

jmr commented Jan 12, 2025

grhoten commented Jan 12, 2025

Optimize select_bit #52

Are you sure you want to change the base?

Optimize select_bit #52

Conversation

jmr commented Aug 21, 2023

jmr commented Aug 21, 2023

grhoten commented Jan 11, 2025

jmr commented Jan 12, 2025

grhoten commented Jan 12, 2025