Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split batched solver compilation #1629

Merged
merged 15 commits into from
Nov 19, 2024
Merged

Conversation

MarcelKoch
Copy link
Member

This PR splits up the compilation of the batched solvers in order to reduce the compilation times. It splits up the instantiations of the kernel launches depending on the number of vectors in shared memory. This is based on the same CMake mechanism as for the csr and fbcsr kernels.

@MarcelKoch MarcelKoch self-assigned this Jun 24, 2024
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. mod:core This is related to the core module. mod:cuda This is related to the CUDA module. type:solver This is related to the solvers type:matrix-format This is related to the Matrix formats mod:hip This is related to the HIP module. labels Jun 24, 2024
@MarcelKoch MarcelKoch force-pushed the split-batched-solver-compilation branch from 259f2c1 to 8c25a83 Compare June 24, 2024 11:24
@upsj
Copy link
Member

upsj commented Jun 27, 2024

This should have a huge impact, excerpt from the HIP 5.14 debug build log

6534.89 hip/CMakeFiles/ginkgo_hip.dir/solver/batch_bicgstab_kernels.hip.cpp.o

#define GKO_BATCH_INSTANTIATE_STOP(macro, ...) \
macro(__VA_ARGS__, \
::gko::batch::solver::device::batch_stop::SimpleAbsResidual); \
template macro( \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the template here (and in the other macros below) could be removed, if the value/index type instantiation macros would accept variable number or arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work until C++20. A macro with (arg, ...) requires two arguments before c++20.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the idea looks good, but the pipelines are failing.

One thing against this approach is the readability and maintainability is seriously affected. The already complex batched code is even more complex and annoying to read now. We should maybe see if instead we dont do this split approach and instead maybe do what Jacobi does and have fewer cases as default, and only have full instantiations when necessary.

cuda/solver/batch_bicgstab_kernels.cuh Outdated Show resolved Hide resolved
@MarcelKoch MarcelKoch force-pushed the split-batched-solver-compilation branch from 8c25a83 to 870ad69 Compare July 5, 2024 12:47
@MarcelKoch
Copy link
Member Author

IMO the Jacobi instantiation is more complex than what is here. The kernel and the instantiations are directly together, instead of being generated by CMake, which makes it easier to follow for me.
I also merged the two .cpp files per solver, perhaps that can simplify things a bit again.

But I agree that the batch system needs an overhaul in general.

@MarcelKoch MarcelKoch force-pushed the split-batched-solver-compilation branch 4 times, most recently from d04f06c to fa6d091 Compare July 9, 2024 07:42
@MarcelKoch MarcelKoch requested a review from pratikvn July 9, 2024 07:42
@MarcelKoch MarcelKoch force-pushed the split-batched-solver-compilation branch from fa6d091 to e59ab55 Compare July 10, 2024 07:36
@pratikvn
Copy link
Member

An alternative approach: https://github.com/ginkgo-project/ginkgo/tree/batch-optim

@MarcelKoch
Copy link
Member Author

An alternative approach: https://github.com/ginkgo-project/ginkgo/tree/batch-optim

This seems to be quite orthogonal to this PR. With full optimizations enabled, there would be the same issue as before, so the fix from this PR is still needed. I don't see a reason why we should burden people that want the full optimizations enabled with those long compile times, for which we already have a fix available.
But we could add this into this PR.

@MarcelKoch MarcelKoch marked this pull request as draft August 9, 2024 09:31
@MarcelKoch MarcelKoch added the 1:ST:WIP This PR is a work in progress. Not ready for review. label Aug 15, 2024
@MarcelKoch MarcelKoch added this to the Ginkgo 1.9.0 milestone Aug 26, 2024
@pratikvn
Copy link
Member

@MarcelKoch, can you please rebase this when you have some time and we can try to get it merged ?

@MarcelKoch MarcelKoch force-pushed the split-batched-solver-compilation branch 2 times, most recently from 48fe94b to 045ad1c Compare September 17, 2024 14:43
@MarcelKoch MarcelKoch marked this pull request as ready for review September 17, 2024 14:45
@MarcelKoch MarcelKoch added 1:ST:ready-for-review This PR is ready for review 1:ST:run-full-test and removed 1:ST:ready-to-merge This PR is ready to merge. labels Nov 12, 2024
Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wait for the CI to finish to merge this (maybe also the Intel SYCL pipelines), but looks good to me otherwise.

Comment on lines +37 to +44
get_num_regs(
batch_single_kernels::apply_kernel<StopType, 9, true, PrecType,
LogType, BatchMatrixType,
ValueType>),
get_num_regs(
batch_single_kernels::apply_kernel<StopType, 0, false, PrecType,
LogType, BatchMatrixType,
ValueType>));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think first one is everything in shared memory, second one is nothing in shared memory.

Comment on lines +48 to +52
const int max_threads_regs =
((max_regs_blk / static_cast<int>(num_regs_used)) / warp_sz) * warp_sz;
int max_threads = std::min(max_threads_regs, device_max_threads);
max_threads = max_threads <= max_bicgstab_threads ? max_threads
: max_bicgstab_threads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment and something for me to do in the future: I think this whole logic needs to be simplified. It seems it is now also possible to set the max number of registers similar to the launch_bounds with CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximum-number-of-registers-per-thread

But of course, that means we maybe cannot unify HIP and CUDA anymore, but something we need to investigate.

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It is a bit hard to understand now though.

@MarcelKoch MarcelKoch added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Nov 13, 2024
@MarcelKoch MarcelKoch force-pushed the split-batched-solver-compilation branch from 02b4f27 to bdf51dc Compare November 19, 2024 07:44
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
27.3% Duplication on New Code (required ≤ 20%)

See analysis details on SonarQube Cloud

@MarcelKoch MarcelKoch merged commit 53bbc1d into develop Nov 19, 2024
12 of 14 checks passed
@MarcelKoch MarcelKoch deleted the split-batched-solver-compilation branch November 19, 2024 20:25
MarcelKoch added a commit to MarcelKoch/ginkgo that referenced this pull request Dec 2, 2024
This PR splits up the compilation of the batched solvers in order to reduce the compilation times. It splits up the instantiations of the kernel launches depending on the number of vectors in shared memory. This is based on the same CMake mechanism as for the csr and fbcsr kernels.

Related PR: ginkgo-project#1629
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. 1:ST:run-full-test mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. reg:build This is related to the build system. type:matrix-format This is related to the Matrix formats type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable batched optimizations and split solver instantiations.
5 participants