Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator #7567

cad-audio · 2025-01-09T06:50:21Z

Summary

Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator

Test plan

Unit tested kernels

Adding mean and where ops optimized on HiFi

* adding pow, remainder, minimum, maximum operators * adding pow, remainder, minimum, maximum operators

Adding quantized linear optimized versions for int8 and uint8

* Adding cat, full, permute_copy and relu ops (pytorch#34) * Adding cat, full, permute_copy * updating relu wrt new ref (pytorch#36) * Temporary memory allocation, replacing mallocs (pytorch#38) * Integrated temporary mem alloc functionality in place of malloc * Namespace related changes * Cleanup the main application * Adding atan2, softmax, clamp and remainder ops (pytorch#37) * Replaced malloc with temp_memory_allocator --------- Co-authored-by: nishpoonia <[email protected]> Co-authored-by: Rushi-cad <[email protected]>

* adding ET_KERNEL_CHECK for allocate_temp_memory * solving lint error * Removing redundant check

Adding _softmax, relu, permute etc

pytorch-bot · 2025-01-09T06:50:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7567

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d62648a with merge base 9a0b51c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

dijopaul · 2025-01-09T06:52:32Z

@pytorchbot label "topic: not user facing"

- fixing build issue on previous commit

Update functions_hifi.yaml

…d removing exec_ten uses

Incorporating review comments: removing nesting to check data type an…

zonglinpeng

Looks good from eyeballing, will link internally and solve any issues in follow up diffs

zonglinpeng · 2025-01-16T17:19:29Z

examples/portable/executor_runner/executor_runner.cpp

@@ -172,6 +179,7 @@ int main(int argc, char** argv) {

  // Run the model.
  Error status = method->execute();
+


please avoid empty line changes

zonglinpeng · 2025-01-20T18:23:56Z

backends/cadence/hifi/operators/CMakeLists.txt

    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_clone.cpp"
    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_embedding.cpp"
-    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_full.cpp"
+    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gt.cpp"


which op requires gt?

Not part of any model as such but was part of the ops list provided. This change not necessary for this PR but we will be including all logical ops under optimized version on next PR. We will remove this and add from cadence/hifi/operators

no particular issues with gt, but this is removing full :) so maybe we should change it

full not removed, but moved to cadence/hifi/operators/ Hope this is good

ah you're right, my bad! The alphabetical order threw me off :)

facebook-github-bot · 2025-01-21T17:12:26Z

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mcremon-meta

LGTM with just one comment about full op!

mcremon-meta · 2025-01-21T19:52:11Z

backends/cadence/hifi/kernels/kernels.h

 #include <inttypes.h>
 #include <stddef.h>
 #include <xa_type_def.h>
 /* For NNLIB APIs */
 #include "xa_nnlib_kernels_api.h"

+using executorch::runtime::KernelRuntimeContext;
+using executorch::runtime::Result;


@hsharma35 this is the right format for using right?

no actually, please use ::executorch: for ::executorch::runtime::KernelRuntimeContext; and all other namespaces @cad-audio

As this is generic comment we will capture in our pending items. We will apply to all ops (inc. already merged) through a separate PR

Fully qualified is preferred per ET style guide, but this one works too.

mcremon-meta · 2025-01-21T19:52:59Z

backends/cadence/hifi/operators/CMakeLists.txt

    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_clone.cpp"
    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_embedding.cpp"
-    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_full.cpp"
+    "${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gt.cpp"


no particular issues with gt, but this is removing full :) so maybe we should change it

zonglinpeng · 2025-01-21T20:18:49Z

backends/cadence/hifi/third-party/nnlib/xa_nn_elm_minimum_maximum_f32.c

@@ -19,12 +19,12 @@
 * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

 ******************************************************************************/
-#include "xa_type_def.h"
-#include "xa_nnlib_common_fpu.h"
-#include "xa_nn_common.h"


can these be kept as before? Our internal build system wont recognize the full path

hsharma35 · 2025-01-09T17:11:10Z

backends/cadence/hifi/operators/op_atan2.cpp

+  ET_SWITCH_REALHB_TYPES(a_type, ctx, name, CTYPE_A, [&]() {
+    ET_SWITCH_REALHB_TYPES(b_type, ctx, name, CTYPE_B, [&]() {
+      ET_SWITCH_FLOATH_TYPES(out_type, ctx, name, CTYPE_OUT, [&]() {
+        torch::executor::
+            apply_binary_elementwise_fn<CTYPE_A, CTYPE_B, CTYPE_OUT>(
+                [](const CTYPE_A val_a, const CTYPE_B val_b) {
+                  CTYPE_OUT casted_a = static_cast<CTYPE_OUT>(val_a);
+                  CTYPE_OUT casted_b = static_cast<CTYPE_OUT>(val_b);
+                  return static_cast<CTYPE_OUT>(std::atan2(casted_a, casted_b));
+                },
+                a,
+                b,
+                out);
+      });
+    });
+  });


Please refer to the updated portable code to reduce code size.

hsharma35 · 2025-01-16T17:21:44Z

backends/cadence/hifi/operators/op_atan2.cpp

+#include <executorch/kernels/portable/cpu/util/broadcast_util.h>
+#include <executorch/kernels/portable/cpu/util/elementwise_util.h>
+#include <executorch/runtime/kernel/kernel_includes.h>
+#include <cmath>


Nit: put this header above executorch headers.

Done, but this is causing a lint issue

hsharma35 · 2025-01-17T23:21:55Z

backends/cadence/hifi/operators/op_cat.cpp

+
+  bool optimized = true;
+
+  if (out.scalar_type() != ScalarType::Float)


Out of curiosity - Why does dtype matter for a concat op?

As we know 32bit word is 4 byte aligned, we do have some advantage in optimization (in HiFi4)

hsharma35 · 2025-01-17T23:23:02Z

backends/cadence/hifi/operators/op_clamp.cpp

+
+} // namespace
+
+Tensor& clamp_out(


This one doesn't have xa_nn calls, can we remove it?

hsharma35 · 2025-01-17T23:24:57Z

backends/cadence/hifi/operators/op_clamp.cpp

+              inp_shape,
+              min_data,
+              min_shape,
+              max_data,


Nit: let's use XT_* macros defined under executorch/backends/cadence/fusion_g3/operators/xt_macros.h
We can move the macros to a common directory.

Will work on this through a separate PR (Captured in pending comments list)

Will work on this through a separate PR (Captured in pending comments list)

No worries, I have a PR to change that. @dijopaul

hsharma35 · 2025-01-17T23:27:51Z

backends/cadence/hifi/operators/op_permute_copy.cpp

+      WORD32 p_permute_vec[kNnlibMaxDim];
+
+      for (int i = 0; i < num_inp_dims; i++) {
+        p_inp_shape[i] = in.size(i);


This is common for char/float. Let's move this outside the dtype if/else

zonglinpeng · 2025-01-22T19:04:03Z

backends/cadence/hifi/third-party/nnlib/xa_nn_elm_minimum_maximum_f32.c

+#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nnlib_common_fpu.h"
+#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nn_common.h"
+#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nnlib_err_chk.h"
+#include "nnlib-hifi4/xa_nnlib/algo/kernels/basic/hifi4/xa_nn_basic_state.h"


where is this header used? All kernel seems working fine with all these header commented out. @cad-audio

Removed unwanted headers

facebook-github-bot · 2025-01-22T19:35:39Z

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-01-23T17:27:25Z

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Differential Revision: D68446171 Pulled By: zonglinpeng

…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Reviewed By: hsharma35 Differential Revision: D68446171 Pulled By: zonglinpeng

facebook-github-bot · 2025-01-24T04:15:15Z

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Reviewed By: hsharma35 Differential Revision: D68446171 Pulled By: zonglinpeng

facebook-github-bot · 2025-01-24T19:29:10Z

@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Reviewed By: hsharma35 Differential Revision: D68446171 Pulled By: zonglinpeng

dijopaul and others added 17 commits October 23, 2024 06:51

Adding mean and where ops optimized on HiFi

216389c

Merge pull request #14 from dijopaul/main

3d849bb

Adding mean and where ops optimized on HiFi

Adding quantized linear optimized versions for int8 and uint8

9b71aed

adding pow, remainder, minimum, maximum operators (pytorch#33)

07743ab

* adding pow, remainder, minimum, maximum operators * adding pow, remainder, minimum, maximum operators

Fix for build issue faced in div_mod on old tools

edc1b3d

Merge pull request #15 from dijopaul/main

222beee

Adding quantized linear optimized versions for int8 and uint8

Merge branch 'main' into main

6e074ec

Fix build failure due to merge issue

afca3db

Merge branch 'main' into main

10a0ee0

Fixing review comments on PR 6867

f1f0bb3

Cleaning cmakelist to avoid duplications

911021f

Fixing lint issues and removing free statements

18cf518

adding ET_KERNEL_CHECK for allocate_temp_memory (pytorch#41)

5e471f2

* adding ET_KERNEL_CHECK for allocate_temp_memory * solving lint error * Removing redundant check

Merge branch 'main' into main_PR18

6928f95

Fixing lint error due to merge

991961b

Merge pull request #18 from dijopaul/main_PR18

7585ee0

Adding _softmax, relu, permute etc

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 9, 2025

pytorch-bot bot added the topic: not user facing label Jan 9, 2025

dijopaul and others added 2 commits January 9, 2025 13:57

Update functions_hifi.yaml

540243a

- fixing build issue on previous commit

Merge pull request #19 from dijopaul/patch-1

85e7c59

Update functions_hifi.yaml

hsharma35 self-requested a review January 9, 2025 17:07

nishpoonia and others added 3 commits January 10, 2025 12:01

Incorporating review comments: removing nesting to check data type an…

1f681c7

…d removing exec_ten uses

clean up

3539f52

Merge pull request #20 from dijopaul/main_PR18

fe5e7d7

Incorporating review comments: removing nesting to check data type an…

mcr229 requested review from digantdesai and kimishpatel January 14, 2025 00:56

kimishpatel added the module: cadence Issues related to the Cadence/Xtensa backend label Jan 14, 2025

kimishpatel requested review from mcremon-meta and tarun292 January 14, 2025 03:48

zonglinpeng approved these changes Jan 20, 2025

View reviewed changes

Fixing review comment on PR 7567

4923b83

mcremon-meta approved these changes Jan 21, 2025

View reviewed changes

zonglinpeng reviewed Jan 21, 2025

View reviewed changes

hsharma35 approved these changes Jan 21, 2025

View reviewed changes

zonglinpeng reviewed Jan 22, 2025

View reviewed changes

Fixing review comments in PR 7567

224aaf4

Merge branch 'main' into main

7f9a78f

dijopaul added 2 commits January 24, 2025 00:15

Fixing lint error in PR7567

6409958

Updating cat to support Int variant

d62648a

facebook-github-bot merged commit ff1d6af into pytorch:main Jan 24, 2025
5 of 7 checks passed

		@@ -172,6 +179,7 @@ int main(int argc, char** argv) {

		// Run the model.
		Error status = method->execute();


		bool optimized = true;

		if (out.scalar_type() != ScalarType::Float)

Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator #7567

Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator #7567

Conversation

cad-audio commented Jan 9, 2025

Summary

Test plan

pytorch-bot bot commented Jan 9, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7567

✅ No Failures

dijopaul commented Jan 9, 2025

zonglinpeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 21, 2025

mcremon-meta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dijopaul Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 22, 2025

facebook-github-bot commented Jan 23, 2025

facebook-github-bot commented Jan 24, 2025

facebook-github-bot commented Jan 24, 2025

pytorch-bot bot commented Jan 9, 2025 •

edited

Loading

dijopaul Jan 23, 2025 •

edited

Loading