-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator #7567
Conversation
Adding mean and where ops optimized on HiFi
* adding pow, remainder, minimum, maximum operators * adding pow, remainder, minimum, maximum operators
Adding quantized linear optimized versions for int8 and uint8
* Adding cat, full, permute_copy and relu ops (pytorch#34) * Adding cat, full, permute_copy * updating relu wrt new ref (pytorch#36) * Temporary memory allocation, replacing mallocs (pytorch#38) * Integrated temporary mem alloc functionality in place of malloc * Namespace related changes * Cleanup the main application * Adding atan2, softmax, clamp and remainder ops (pytorch#37) * Replaced malloc with temp_memory_allocator --------- Co-authored-by: nishpoonia <[email protected]> Co-authored-by: Rushi-cad <[email protected]>
* adding ET_KERNEL_CHECK for allocate_temp_memory * solving lint error * Removing redundant check
Adding _softmax, relu, permute etc
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7567
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d62648a with merge base 9a0b51c (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
- fixing build issue on previous commit
Update functions_hifi.yaml
…d removing exec_ten uses
Incorporating review comments: removing nesting to check data type an…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from eyeballing, will link internally and solve any issues in follow up diffs
@@ -172,6 +179,7 @@ int main(int argc, char** argv) { | |||
|
|||
// Run the model. | |||
Error status = method->execute(); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid empty line changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_clone.cpp" | ||
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_embedding.cpp" | ||
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_full.cpp" | ||
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gt.cpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which op requires gt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not part of any model as such but was part of the ops list provided. This change not necessary for this PR but we will be including all logical ops under optimized version on next PR. We will remove this and add from cadence/hifi/operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no particular issues with gt
, but this is removing full
:) so maybe we should change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full not removed, but moved to cadence/hifi/operators/ Hope this is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you're right, my bad! The alphabetical order threw me off :)
@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with just one comment about full
op!
#include <inttypes.h> | ||
#include <stddef.h> | ||
#include <xa_type_def.h> | ||
/* For NNLIB APIs */ | ||
#include "xa_nnlib_kernels_api.h" | ||
|
||
using executorch::runtime::KernelRuntimeContext; | ||
using executorch::runtime::Result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsharma35 this is the right format for using
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no actually, please use ::executorch:
for ::executorch::runtime::KernelRuntimeContext;
and all other namespaces @cad-audio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is generic comment we will capture in our pending items. We will apply to all ops (inc. already merged) through a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully qualified is preferred per ET style guide, but this one works too.
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_clone.cpp" | ||
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_embedding.cpp" | ||
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_full.cpp" | ||
"${EXECUTORCH_ROOT}/kernels/portable/cpu/op_gt.cpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no particular issues with gt
, but this is removing full
:) so maybe we should change it
@@ -19,12 +19,12 @@ | |||
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | |||
|
|||
******************************************************************************/ | |||
#include "xa_type_def.h" | |||
#include "xa_nnlib_common_fpu.h" | |||
#include "xa_nn_common.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these be kept as before? Our internal build system wont recognize the full path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ET_SWITCH_REALHB_TYPES(a_type, ctx, name, CTYPE_A, [&]() { | ||
ET_SWITCH_REALHB_TYPES(b_type, ctx, name, CTYPE_B, [&]() { | ||
ET_SWITCH_FLOATH_TYPES(out_type, ctx, name, CTYPE_OUT, [&]() { | ||
torch::executor:: | ||
apply_binary_elementwise_fn<CTYPE_A, CTYPE_B, CTYPE_OUT>( | ||
[](const CTYPE_A val_a, const CTYPE_B val_b) { | ||
CTYPE_OUT casted_a = static_cast<CTYPE_OUT>(val_a); | ||
CTYPE_OUT casted_b = static_cast<CTYPE_OUT>(val_b); | ||
return static_cast<CTYPE_OUT>(std::atan2(casted_a, casted_b)); | ||
}, | ||
a, | ||
b, | ||
out); | ||
}); | ||
}); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to the updated portable code to reduce code size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
#include <executorch/kernels/portable/cpu/util/broadcast_util.h> | ||
#include <executorch/kernels/portable/cpu/util/elementwise_util.h> | ||
#include <executorch/runtime/kernel/kernel_includes.h> | ||
#include <cmath> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: put this header above executorch headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but this is causing a lint issue
|
||
bool optimized = true; | ||
|
||
if (out.scalar_type() != ScalarType::Float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity - Why does dtype matter for a concat op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we know 32bit word is 4 byte aligned, we do have some advantage in optimization (in HiFi4)
|
||
} // namespace | ||
|
||
Tensor& clamp_out( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one doesn't have xa_nn calls, can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
inp_shape, | ||
min_data, | ||
min_shape, | ||
max_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's use XT_* macros defined under executorch/backends/cadence/fusion_g3/operators/xt_macros.h
We can move the macros to a common directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will work on this through a separate PR (Captured in pending comments list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will work on this through a separate PR (Captured in pending comments list)
No worries, I have a PR to change that. @dijopaul
WORD32 p_permute_vec[kNnlibMaxDim]; | ||
|
||
for (int i = 0; i < num_inp_dims; i++) { | ||
p_inp_shape[i] = in.size(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is common for char/float. Let's move this outside the dtype if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nnlib_common_fpu.h" | ||
#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nn_common.h" | ||
#include "nnlib-hifi4/xa_nnlib/algo/common/include/xa_nnlib_err_chk.h" | ||
#include "nnlib-hifi4/xa_nnlib/algo/kernels/basic/hifi4/xa_nn_basic_state.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this header used? All kernel seems working fine with all these header commented out. @cad-audio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unwanted headers
@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Differential Revision: D68446171 Pulled By: zonglinpeng
…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Reviewed By: hsharma35 Differential Revision: D68446171 Pulled By: zonglinpeng
@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Reviewed By: hsharma35 Differential Revision: D68446171 Pulled By: zonglinpeng
@zonglinpeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…_copy_out ops and updates to use memory_allocator (pytorch#7567) Summary: Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator Pull Request resolved: pytorch#7567 Test Plan: Unit tested kernels Reviewed By: hsharma35 Differential Revision: D68446171 Pulled By: zonglinpeng
Summary
Optimized atan2, _softmax, cat, clamp, full, relu, remainder, permute_copy_out ops and updates to use memory_allocator
Test plan
Unit tested kernels