Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize backward propagation kernel (41% end-to-end speedup on the example) #53

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

interestingLSY
Copy link

Motivation

The original backward propagation kernel, renderCUDA, uses atomicAdd to accumulate gradients from render objects to Gaussians. This is OK for most Gaussians since they only cover a few tiles. However, for some large Gaussians that spawn over a large set of tiles, atomicAdd leads to performance degradation.

Modification

The new kernel automatically selects the best approach for gradient accumulation. We define a hyperparameter, $\alpha$. If a Gaussian touches no more than $\alpha$ tiles, we use the original approach which uses atomicAdd. Otherwise, we first perform a block-level reduction, then add the gradient to global gradient arrays (dL_dcolors and so on) by atomicAdd. This reduces the number of atomicAdds to $1/256$. Furthermore, the "block-level reduction" is performed in batches to avoid the expensive block-level synchronization operations.

Evaluation

We use the tandt/truck dataset with --iterations=5000 to evaluate our optimization.

The following figure illustrates the end-to-end absolute time usage and relative speedup.

image

And the following figure shows that my modification has no problem with correctness.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant