Vectorize find_first_not_of
/find_last_not_of
member functions (multiple characters overloads)
#5206
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Two remaining in
find_meow_of
family,Together with #5102 should complete
basic_string
vectorization coverage.Surprisingly not trivial change. The not flavor does not have early return for the inner (needle) loop. This severely impacts the paths that do have this inner loop.
⚙️ Product code changes
Added the implementation of
find_meow_not_of
for 8 and 16 bit characters.No 32-bit and 64-bit characters vectorization. We happen to support them in
find_first_of
, because it exists as a free function callable with integers or pointers, but supporting them infind_first_not_of
would take severely altering the specific AVX2 algorithm, that doesn't need to be altered otherwise.The implementation is added into existing functions via a template parameter, like in #5102. For bitmap algorithms and small needle path it is only a matter of results negation or bit mask inversion, which is done:
find_first_not_of
/find_last_not_of
member functions (single character) #5102 for vector bitmapThe fallback nested loop has a separate compile-time branch without early return.
For SSE4.2 large needle branch. in addition to the negation in the intrinsic parameter, need also to switch to no-early-return inner loop, and combine the results. The
_Test_whole_needle
lambda has changed to have different loop based on template parameter. It was also changed to return position, and having inner lambda_Step
instead of them both. The lambda change can potentially affect codegen in non-not control path, but I don't expect it to be too much of impact, if any at all.🏁 Benchmark code changes
The fill strategy was altered to:
So the
iota
was dropped. Still incremental values are used to fill needle. because it is boring to justmemsetstd::fill
it.💹 Performance expectations
The not function are expected to perform almost the same, as their positive counterpart. But sure we can't have supersymmetry here.
The noticeable distinct thing is SSE4.2 path with different instructions. It has less control flow, but it has
PCMPESTRM
instead ofPCMPESTRI
, Their performance is overall the same, but there is some small difference on some CPUs, Decent Intels tend to likePCMPESTRI
, decent AMDs tend to make no difference, older AMDs and power-saving Intels tend to likePCMPESTRM
.See the comparison on uops.info.
Apparently we're good on big scale, and fine tuning cannot be addressed anyway, so I didn't attempt to look for new thresholds for not functions.
⏱️ Benchmark results
i5 1235U