Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce branching in __brick_shift_left implementation in SYCL backend #2021

Open
mmichel11 opened this issue Jan 24, 2025 · 0 comments
Open

Comments

@mmichel11
Copy link
Contributor

This issue is being filed based on the following review comment: #1976 (comment)

Potential Performance Issue
The current implementation of __brick_shift_left implementation in the SYCL backend performs strided accesses within a loop with a conditional check to ensure we are within bounds at each iteration:

const _DiffType __i = __idx - __n; //loop invariant
for (_DiffType __k = __n; __k < __size; __k += __n)
{
    if (__k + __idx < __size)
         __rng[__k + __i] = ::std::move(__rng[__k + __idx]);
}

The proposed vectorization path in https://github.com/uxlfoundation/oneDPL/pull/1976 more or less follows the same implementation with the same branching. This likely has some performance hit particularly on GPU architectures as they lack branch prediction. Instead, we should precompute the number of iterations outside the loop and hoist the last iteration after the loop with boundary checking as it may not be a full case.

This optimization should be a follow-up to the mentioned PR and should adjust both scalar and vector implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant