Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator #7567

Merged
merged 27 commits into from
Jan 24, 2025

Conversation

cad-audio
Copy link
Contributor

Summary

Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator

Test plan

Unit tested kernels

dijopaul and others added 17 commits October 23, 2024 06:51
Adding mean and where ops optimized on HiFi
* adding pow, remainder, minimum, maximum operators

* adding pow, remainder, minimum, maximum operators
Adding quantized linear optimized versions for int8 and uint8
* Adding cat, full, permute_copy and relu ops (pytorch#34)

* Adding cat, full, permute_copy

* updating relu wrt new ref (pytorch#36)

* Temporary memory allocation, replacing mallocs (pytorch#38)

* Integrated temporary mem alloc functionality in place of malloc

* Namespace related changes

* Cleanup the main application

* Adding atan2, softmax, clamp and remainder ops (pytorch#37)

* Replaced malloc with temp_memory_allocator

---------

Co-authored-by: nishpoonia <[email protected]>
Co-authored-by: Rushi-cad <[email protected]>
* adding ET_KERNEL_CHECK for allocate_temp_memory

* solving lint error

* Removing redundant check
Adding _softmax, relu, permute etc
Copy link

pytorch-bot bot commented Jan 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7567

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d62648a with merge base 9a0b51c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 9, 2025
@dijopaul
Copy link
Collaborator

dijopaul commented Jan 9, 2025

@pytorchbot label "topic: not user facing"

dijopaul and others added 2 commits January 9, 2025 13:57
- fixing build issue on previous commit
Update functions_hifi.yaml
@hsharma35 hsharma35 self-requested a review January 9, 2025 17:07
nishpoonia and others added 3 commits January 10, 2025 12:01
Incorporating review comments: removing nesting to check data type an…
@kimishpatel kimishpatel added the module: cadence Issues related to the Cadence/Xtensa backend label Jan 14, 2025
Copy link
Contributor

@zonglinpeng zonglinpeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from eyeballing, will link internally and solve any issues in follow up diffs

@@ -172,6 +179,7 @@ int main(int argc, char** argv) {

// Run the model.
Error status = method->execute();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid empty line changes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_clone.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_embedding.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_full.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gt.cpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which op requires gt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of any model as such but was part of the ops list provided. This change not necessary for this PR but we will be including all logical ops under optimized version on next PR. We will remove this and add from cadence/hifi/operators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particular issues with gt, but this is removing full :) so maybe we should change it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full not removed, but moved to cadence/hifi/operators/ Hope this is good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right, my bad! The alphabetical order threw me off :)

@facebook-github-bot
Copy link
Contributor

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@mcremon-meta mcremon-meta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with just one comment about full op!

#include <inttypes.h>
#include <stddef.h>
#include <xa_type_def.h>
/* For NNLIB APIs */
#include "xa_nnlib_kernels_api.h"

using executorch::runtime::KernelRuntimeContext;
using executorch::runtime::Result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsharma35 this is the right format for using right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no actually, please use ::executorch: for ::executorch::runtime::KernelRuntimeContext; and all other namespaces @cad-audio

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is generic comment we will capture in our pending items. We will apply to all ops (inc. already merged) through a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully qualified is preferred per ET style guide, but this one works too.

"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_clone.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_embedding.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_full.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gt.cpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particular issues with gt, but this is removing full :) so maybe we should change it

@@ -19,12 +19,12 @@
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

******************************************************************************/
#include "xa_type_def.h"
#include "xa_nnlib_common_fpu.h"
#include "xa_nn_common.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be kept as before? Our internal build system wont recognize the full path

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 183 to 198
ET_SWITCH_REALHB_TYPES(a_type, ctx, name, CTYPE_A, [&]() {
ET_SWITCH_REALHB_TYPES(b_type, ctx, name, CTYPE_B, [&]() {
ET_SWITCH_FLOATH_TYPES(out_type, ctx, name, CTYPE_OUT, [&]() {
torch::executor::
apply_binary_elementwise_fn<CTYPE_A, CTYPE_B, CTYPE_OUT>(
[](const CTYPE_A val_a, const CTYPE_B val_b) {
CTYPE_OUT casted_a = static_cast<CTYPE_OUT>(val_a);
CTYPE_OUT casted_b = static_cast<CTYPE_OUT>(val_b);
return static_cast<CTYPE_OUT>(std::atan2(casted_a, casted_b));
},
a,
b,
out);
});
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the updated portable code to reduce code size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

#include <executorch/kernels/portable/cpu/util/broadcast_util.h>
#include <executorch/kernels/portable/cpu/util/elementwise_util.h>
#include <executorch/runtime/kernel/kernel_includes.h>
#include <cmath>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put this header above executorch headers.

Copy link
Collaborator

@dijopaul dijopaul Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but this is causing a lint issue


bool optimized = true;

if (out.scalar_type() != ScalarType::Float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity - Why does dtype matter for a concat op?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we know 32bit word is 4 byte aligned, we do have some advantage in optimization (in HiFi4)


} // namespace

Tensor& clamp_out(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one doesn't have xa_nn calls, can we remove it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

inp_shape,
min_data,
min_shape,
max_data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's use XT_* macros defined under executorch/backends/cadence/fusion_g3/operators/xt_macros.h
We can move the macros to a common directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on this through a separate PR (Captured in pending comments list)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on this through a separate PR (Captured in pending comments list)

No worries, I have a PR to change that. @dijopaul

WORD32 p_permute_vec[kNnlibMaxDim];

for (int i = 0; i < num_inp_dims; i++) {
p_inp_shape[i] = in.size(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is common for char/float. Let's move this outside the dtype if/else

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nnlib_common_fpu.h"
#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nn_common.h"
#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nnlib_err_chk.h"
#include "nnlib-hifi4/xa_nnlib/algo/kernels/basic/hifi4/xa_nn_basic_state.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this header used? All kernel seems working fine with all these header commented out. @cad-audio

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unwanted headers

@facebook-github-bot
Copy link
Contributor

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zonglinpeng pushed a commit to zonglinpeng/executorch that referenced this pull request Jan 23, 2025
…_copy_out ops and updates to use memory_allocator (pytorch#7567)

Summary:
Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator

Pull Request resolved: pytorch#7567

Test Plan: Unit tested kernels

Differential Revision: D68446171

Pulled By: zonglinpeng
zonglinpeng pushed a commit to zonglinpeng/executorch that referenced this pull request Jan 23, 2025
…_copy_out ops and updates to use memory_allocator (pytorch#7567)

Summary:
Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator

Pull Request resolved: pytorch#7567

Test Plan: Unit tested kernels

Reviewed By: hsharma35

Differential Revision: D68446171

Pulled By: zonglinpeng
@facebook-github-bot
Copy link
Contributor

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zonglinpeng pushed a commit to zonglinpeng/executorch that referenced this pull request Jan 24, 2025
…_copy_out ops and updates to use memory_allocator (pytorch#7567)

Summary:
Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator

Pull Request resolved: pytorch#7567

Test Plan: Unit tested kernels

Reviewed By: hsharma35

Differential Revision: D68446171

Pulled By: zonglinpeng
@facebook-github-bot
Copy link
Contributor

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zonglinpeng pushed a commit to zonglinpeng/executorch that referenced this pull request Jan 24, 2025
…_copy_out ops and updates to use memory_allocator (pytorch#7567)

Summary:
Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator

Pull Request resolved: pytorch#7567

Test Plan: Unit tested kernels

Reviewed By: hsharma35

Differential Revision: D68446171

Pulled By: zonglinpeng
@facebook-github-bot facebook-github-bot merged commit ff1d6af into pytorch:main Jan 24, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: cadence Issues related to the Cadence/Xtensa backend topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants