Optimize backward propagation kernel (41% end-to-end speedup on the example) #53

interestingLSY · 2024-06-11T03:33:28Z

Motivation

The original backward propagation kernel, renderCUDA, uses atomicAdd to accumulate gradients from render objects to Gaussians. This is OK for most Gaussians since they only cover a few tiles. However, for some large Gaussians that spawn over a large set of tiles, atomicAdd leads to performance degradation.

Modification

The new kernel automatically selects the best approach for gradient accumulation. We define a hyperparameter, $\alpha$. If a Gaussian touches no more than $\alpha$ tiles, we use the original approach which uses atomicAdd. Otherwise, we first perform a block-level reduction, then add the gradient to global gradient arrays (dL_dcolors and so on) by atomicAdd. This reduces the number of atomicAdds to $1/256$. Furthermore, the "block-level reduction" is performed in batches to avoid the expensive block-level synchronization operations.

Evaluation

We use the tandt/truck dataset with --iterations=5000 to evaluate our optimization.

The following figure illustrates the end-to-end absolute time usage and relative speedup.

And the following figure shows that my modification has no problem with correctness.

interestingLSY added 7 commits June 6, 2024 17:46

Update .gitignore

a1e74ad

Use rsqrt() instead of 1/sqrt for speed and precision

3ec0567

Optimize backward propagation

53f1e86

Further optimization

fbaec5c

Further optimization

83ecedd

Don't use a separate gather kernel

6565525

Remove unused functions

c096556

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize backward propagation kernel (41% end-to-end speedup on the example) #53

Optimize backward propagation kernel (41% end-to-end speedup on the example) #53

interestingLSY commented Jun 11, 2024

Optimize backward propagation kernel (41% end-to-end speedup on the example) #53

Are you sure you want to change the base?

Optimize backward propagation kernel (41% end-to-end speedup on the example) #53

Conversation

interestingLSY commented Jun 11, 2024

Motivation

Modification

Evaluation