Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator #7567

Merged
merged 27 commits into from
Jan 24, 2025
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
216389c
Adding mean and where ops optimized on HiFi
dijopaul Oct 23, 2024
3d849bb
Merge pull request #14 from dijopaul/main
cad-audio Oct 24, 2024
9b71aed
Adding quantized linear optimized versions for int8 and uint8
dijopaul Nov 6, 2024
07743ab
adding pow, remainder, minimum, maximum operators (#33)
nishpoonia Nov 7, 2024
edc1b3d
Fix for build issue faced in div_mod on old tools
dijopaul Nov 13, 2024
222beee
Merge pull request #15 from dijopaul/main
cad-audio Nov 14, 2024
6e074ec
Merge branch 'main' into main
cad-audio Nov 14, 2024
afca3db
Fix build failure due to merge issue
dijopaul Nov 19, 2024
10a0ee0
Merge branch 'main' into main
mcremon-meta Nov 21, 2024
f1f0bb3
Fixing review comments on PR 6867
dijopaul Nov 22, 2024
f8cf408
Malloc fix (#39)
dijopaul Nov 28, 2024
911021f
Cleaning cmakelist to avoid duplications
dijopaul Dec 2, 2024
18cf518
Fixing lint issues and removing free statements
dijopaul Dec 3, 2024
5e471f2
adding ET_KERNEL_CHECK for allocate_temp_memory (#41)
nishpoonia Dec 23, 2024
6928f95
Merge branch 'main' into main_PR18
dijopaul Jan 9, 2025
991961b
Fixing lint error due to merge
dijopaul Jan 9, 2025
7585ee0
Merge pull request #18 from dijopaul/main_PR18
cad-audio Jan 9, 2025
540243a
Update functions_hifi.yaml
dijopaul Jan 9, 2025
85e7c59
Merge pull request #19 from dijopaul/patch-1
cad-audio Jan 9, 2025
1f681c7
Incorporating review comments: removing nesting to check data type an…
nishpoonia Jan 10, 2025
3539f52
clean up
nishpoonia Jan 13, 2025
fe5e7d7
Merge pull request #20 from dijopaul/main_PR18
cad-audio Jan 13, 2025
4923b83
Fixing review comment on PR 7567
dijopaul Jan 21, 2025
224aaf4
Fixing review comments in PR 7567
dijopaul Jan 23, 2025
7f9a78f
Merge branch 'main' into main
zonglinpeng Jan 24, 2025
6409958
Fixing lint error in PR7567
dijopaul Jan 24, 2025
d62648a
Updating cat to support Int variant
dijopaul Jan 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 30 additions & 5 deletions backends/cadence/aot/functions_hifi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,12 @@
- op: _softmax.out
kernels:
- arg_meta: null
kernel_name: torch::executor::softmax_out
kernel_name: cadence::impl::HiFi::softmax_out

- op: atan2.out
kernels:
- arg_meta: null
kernel_name: cadence::impl::HiFi::atan2_out

- op: add.out
kernels:
Expand All @@ -35,7 +40,12 @@
- op: cat.out
kernels:
- arg_meta: null
kernel_name: torch::executor::cat_out
kernel_name: cadence::impl::HiFi::cat_out

- op: clamp.Tensor_out
kernels:
- arg_meta: null
kernel_name: cadence::impl::HiFi::clamp_tensor_out

- op: clone.out
kernels:
Expand All @@ -60,7 +70,12 @@
- op: full.out
kernels:
- arg_meta: null
kernel_name: torch::executor::full_out
kernel_name: cadence::impl::HiFi::full_out

- op: gt.Scalar_out
kernels:
- arg_meta: null
kernel_name: torch::executor::gt_scalar_out

- op: gelu.out
kernels:
Expand Down Expand Up @@ -100,7 +115,7 @@
- op: permute_copy.out
kernels:
- arg_meta: null
kernel_name: torch::executor::permute_copy_out
kernel_name: cadence::impl::HiFi::permute_copy_out

- op: pow.Scalar_out
kernels:
Expand All @@ -117,6 +132,11 @@
- arg_meta: null
kernel_name: cadence::impl::HiFi::pow_Tensor_Tensor_out

- op: remainder.Tensor_out
kernels:
- arg_meta: null
kernel_name: cadence::impl::HiFi::remainder_Tensor_out

- op: rsqrt.out
kernels:
- arg_meta: null
Expand Down Expand Up @@ -170,7 +190,6 @@
- arg_meta: null
kernel_name: cadence::impl::HiFi::dequantize_per_tensor_out


- func: cadence::quantized_layer_norm.out(Tensor input, Tensor in_scale, Tensor in_zero_point, int[] normalized_shape, Tensor weight, Tensor bias, float eps, float output_scale, int output_zero_point, *, Tensor(a!) out) -> Tensor(a!)
kernels:
- arg_meta: null
Expand All @@ -184,6 +203,12 @@
kernels:
- arg_meta: null
kernel_name: cadence::impl::HiFi::quantized_linear_out

- func: cadence::quantized_relu.out(Tensor X, Tensor X_zero_point, int out_zero_point, Tensor out_multiplier, Tensor out_shift, *, Tensor(a!) out) -> Tensor(a!)
kernels:
- arg_meta: null
kernel_name: cadence::impl::HiFi::quantized_relu_out

- func: cadence::quantized_linear.per_tensor_out(Tensor src, Tensor weight, Tensor bias, SymInt src_zero_point, SymInt weight_zero_point, SymInt out_multiplier, SymInt out_shift, SymInt out_zero_point, Tensor? offset, *, Tensor(a!) out) -> Tensor(a!)
kernels:
- arg_meta: null
Expand Down
5 changes: 5 additions & 0 deletions backends/cadence/hifi/kernels/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,19 @@ add_library(
kernels.cpp
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/matmul_asym8uxasym8u_asym8u.cpp
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_broadcast_32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_concat_32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_add_f32_broadcast.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_atan2_f32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_clamp_f32_broadcast.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_div_f32_broadcast.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_div_mode_f32_broadcast.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_minimum_maximum_f32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_mul_f32_broadcast.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_pow_f32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_remainder_broadcast_f32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_elm_where_f32xf32_f32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_reduce_32_32.c
${EXECUTORCH_ROOT}/backends/cadence/hifi/third-party/nnlib/xa_nn_transpose_32.c
)
# Let files say "include <executorch/path/to/header.h>".
set(_common_include_directories ${EXECUTORCH_ROOT}/..)
Expand Down
5 changes: 5 additions & 0 deletions backends/cadence/hifi/kernels/kernels.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ memcpy(void* dst, const void* src, size_t num_bytes) {
MEMCPY_8b(dst, src, num_bytes);
}

void* allocate_temp_memory(KernelRuntimeContext& ctx, size_t size) {
Result<void*> temp_mem_res = ctx.allocate_temp(size);
return temp_mem_res.ok() ? temp_mem_res.get() : nullptr;
}

// Quantize a fp32 value to an int8_t/uint8_t value
template <typename T>
__attribute__((always_inline)) T
Expand Down
60 changes: 59 additions & 1 deletion backends/cadence/hifi/kernels/kernels.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,16 @@
*/

#pragma once

#include <executorch/runtime/kernel/kernel_includes.h>
#include <inttypes.h>
#include <stddef.h>
#include <xa_type_def.h>
/* For NNLIB APIs */
#include "xa_nnlib_kernels_api.h"

using executorch::runtime::KernelRuntimeContext;
using executorch::runtime::Result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsharma35 this is the right format for using right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no actually, please use ::executorch: for ::executorch::runtime::KernelRuntimeContext; and all other namespaces @cad-audio

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is generic comment we will capture in our pending items. We will apply to all ops (inc. already merged) through a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully qualified is preferred per ET style guide, but this one works too.


/* Potential NNLIB function/APIs */

extern "C" WORD32 xa_nn_broadcast_32_32(
Expand All @@ -23,6 +26,16 @@ extern "C" WORD32 xa_nn_broadcast_32_32(
const int* const in_shape,
int num_dims);

extern "C" WORD32 xa_nn_concat_32_32(
WORD32* __restrict__ p_out,
const WORD32* const p_out_shape,
const WORD32** pp_inps,
const WORD32* const* pp_inps_shape,
WORD32 num_out_dims,
WORD32 num_inp,
WORD32 num_inp_dims,
WORD32 axis);

extern "C" WORD32 xa_nn_elm_add_broadcast_4D_f32xf32_f32(
FLOAT32* __restrict__ p_out,
const WORD32* const p_out_shape,
Expand All @@ -31,6 +44,26 @@ extern "C" WORD32 xa_nn_elm_add_broadcast_4D_f32xf32_f32(
const FLOAT32* __restrict__ p_inp2,
const WORD32* const p_inp2_shape);

extern "C" void
xa_nn_elm_atan2_f32(FLOAT32* z, const FLOAT32* y, const FLOAT32* x, WORD32 N);

extern "C" WORD32 xa_nn_elm_clamp_f32xf32xf32_f32(
FLOAT32* __restrict__ p_out,
const FLOAT32* __restrict__ p_inp,
const FLOAT32* __restrict__ p_min,
const FLOAT32* __restrict__ p_max,
WORD32 num_elm);

extern "C" WORD32 xa_nn_elm_clamp_broadcast_4D_f32Xf32xf32_f32(
FLOAT32* __restrict__ p_out,
const WORD32* const p_out_shape,
const FLOAT32* __restrict__ p_inp,
const WORD32* const p_inp_shape,
const FLOAT32* __restrict__ p_min,
const WORD32* const p_min_shape,
const FLOAT32* __restrict__ p_max,
const WORD32* const p_max_shape);

extern "C" WORD32 xa_nn_elm_div_broadcast_4D_f32xf32_f32(
FLOAT32* __restrict__ p_out,
const WORD32* const p_out_shape,
Expand Down Expand Up @@ -97,6 +130,20 @@ extern "C" void xa_nn_elm_pow_f32(
const FLOAT32* __restrict__ y,
WORD32 N);

extern "C" WORD32 xa_nn_elm_remainder_f32xf32_f32(
FLOAT32* __restrict__ p_out,
const FLOAT32* __restrict__ p_inp1,
const FLOAT32* __restrict__ p_inp2,
WORD32 num_elm);

extern "C" WORD32 xa_nn_elm_remainder_broadcast_4D_f32xf32_f32(
FLOAT32* __restrict__ p_out,
const WORD32* const p_out_shape,
const FLOAT32* __restrict__ p_inp1,
const WORD32* const p_inp1_shape,
const FLOAT32* __restrict__ p_inp2,
const WORD32* const p_inp2_shape);

extern "C" WORD32 xa_nn_elm_where_f32xf32_f32(
FLOAT32* __restrict__ p_out,
const FLOAT32* __restrict__ p_inp1,
Expand Down Expand Up @@ -125,11 +172,22 @@ extern "C" WORD32 xa_nn_reduce_mean_4D_f32_f32(
WORD32 num_axis_dims,
void* __restrict__ p_scratch_in);

extern "C" WORD32 xa_nn_transpose_32_32(
WORD32* __restrict__ p_out,
const WORD32* const p_out_shape,
const WORD32* __restrict__ p_inp,
const WORD32* const p_inp_shape,
const WORD32* __restrict__ p_permute_vec,
WORD32 num_out_dims,
WORD32 num_inp_dims);

namespace cadence {
namespace impl {
namespace HiFi {
namespace kernels {

void* allocate_temp_memory(KernelRuntimeContext& ctx, size_t size);

void memcpy(void* dst, const void* src, size_t num_bytes);

WORD32 matmul_asym8uxasym8u_asym8u(
Expand Down
15 changes: 9 additions & 6 deletions backends/cadence/hifi/operators/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,32 +21,35 @@ endif()
# ATen compliant ops that are needed to run this model.
set(_aten_ops__srcs
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_add.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_atan2.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_cat.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_clamp.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_div.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_full.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_maximum.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_mean.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_minimum.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_mul.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_permute_copy.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_pow.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_remainder.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_rsqrt.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_softmax.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_sigmoid.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_sub.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_tanh.cpp"
"${EXECUTORCH_ROOT}/backends/cadence/hifi/operators/op_where.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_bmm.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_cat.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_clone.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_embedding.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_full.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gt.cpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which op requires gt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of any model as such but was part of the ops list provided. This change not necessary for this PR but we will be including all logical ops under optimized version on next PR. We will remove this and add from cadence/hifi/operators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particular issues with gt, but this is removing full :) so maybe we should change it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full not removed, but moved to cadence/hifi/operators/ Hope this is good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right, my bad! The alphabetical order threw me off :)

"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gelu.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_hardtanh.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_max_pool2d_with_indices.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_permute_copy.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_slice_copy.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_softmax.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_split_with_sizes_copy.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_to_copy.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_view_copy.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_where.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/pattern/unary_ufunc_realhbbf16_to_floathbf16.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/util/activation_ops_util.cpp"
"${EXECUTORCH_ROOT}/kernels/portable/cpu/util/broadcast_util.cpp"
Expand Down Expand Up @@ -74,7 +77,7 @@ target_include_directories(
# Custom ops that are needed to run the test model.
add_library(
custom_ops "quantized_linear_out.cpp" "quantized_layer_norm.cpp"
"quantize_per_tensor.cpp" "dequantize_per_tensor.cpp"
"quantize_per_tensor.cpp" "quantized_relu_out.cpp" "dequantize_per_tensor.cpp"
)
target_include_directories(
custom_ops PUBLIC ${ROOT_DIR}/.. ${CMAKE_BINARY_DIR}
Expand Down
Loading
Loading