Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] remove extra convert for fp16 #26755

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

xczhai
Copy link
Contributor

@xczhai xczhai commented Sep 24, 2024

Details:

In order to remove the extra convert node imported by ConvertPrecision pass, take some optimizations on the specific patterns. The detailed passes are listed as below.

  • remove extra convert to meet sdpa fusion
  • fuse rms with following convert
  • fuse fc with following convert
  • fuse llmmlp with following convert
    After that, sdpa/rms/fc/llmmlp can fuse the following convert node and meet the fusion pattern. As a result, the preformance gains.
index optimization pathway 1st token/ms 2nd token/ms convert num convert cost/1st token convert cost/2nd token
-1 target bf16 perf 699.14 70.63 9 0.136 0.066
0 before enabling f16 for sdpa 1369.85 78.05 147 109.484 1.650
1 enable f16 for sdpa 1035.92 76.44 147 111.867 1.632
2 fix sdpa fusion matching in f16 801.38 72.33 146 120.303 1.679
3 fuse RMS with convert in f16 848.72 72.38 81 117.415 1.136
4 fuse FC with convert in f16 780.73 71.69 48 53.783 0.700
5 fuse LLMMLP with convert in f16 716.79 72.01 16 2.569 0.182

opt

fp16 performance is very close to bf16.

Tickets:

  • 152405

@xczhai xczhai requested review from a team as code owners September 24, 2024 10:10
@xczhai xczhai marked this pull request as draft September 24, 2024 10:10
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Sep 24, 2024
@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 2643652 to cd899d7 Compare September 24, 2024 10:16
- remove Convert for ReadValue node
@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from cd899d7 to 36578f0 Compare October 31, 2024 03:28
@xczhai xczhai marked this pull request as ready for review October 31, 2024 09:09
@xczhai xczhai requested a review from liubo-intel October 31, 2024 09:10
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Nov 16, 2024
@xczhai xczhai requested review from a team as code owners November 20, 2024 05:58
@xczhai xczhai requested review from itikhono and removed request for a team November 20, 2024 05:58
@github-actions github-actions bot added category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations and removed category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations labels Nov 20, 2024
@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 0dfa834 to 3491756 Compare November 21, 2024 01:59
@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch 2 times, most recently from c7601f3 to 6d362a0 Compare November 21, 2024 02:56
@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 6d362a0 to 5901ec9 Compare November 21, 2024 05:22
Comment on lines 416 to 420
false);
need_convert_input_output_precision,
save_original_precision_attribute);
Copy link
Contributor

@liubo-intel liubo-intel Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @xczhai : not a review comments, just a question: from the description of this 'save_original_precision_attribute' param, it seems used to save the original precision of this node. but I didn't find any change of such original precisions process in this pr, so which part of logic will be effected based on this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, where it is actually being used and what logic does it affect?

Copy link
Contributor

@EgorDuplensky EgorDuplensky Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, an original name store_original_precision_as_rt_attribute better explains the idea behind this flag.

Copy link
Contributor Author

@xczhai xczhai Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liubo-intel @EgorDuplensky
answer your question.

  1. store_original_precision_as_rt_attribute will work in the line https://github.com/openvinotoolkit/openvino/blob/master/src/common/transformations/src/transformations/convert_precision.cpp#L426. It means that such pass will insert a Convert node after ReadValue. But it is not necessary in f16 hint. As a result, the following pass sdpafusion cannot match because such extra Convert.
  2. Both CPU and GPU call ConvertPrecision to impl f16 conversion. In most cases, the output graph should be similar.
    manager.register_pass<ov::pass::ConvertPrecision>(fp_convert_precision_map,
    It is aligned with GPU. I think it is good f16 reference since gpu starts f16 earlier than cpu.

@yuxu42 yuxu42 requested a review from usstq November 21, 2024 07:03
@github-actions github-actions bot removed the Stale label Nov 22, 2024
if (m_to_f16) {
vcvtps2ph(ptr[dst + loop_i * 2], zmm0, 0x4);
vcvtps2ph(ptr[dst + loop_i * 2 + 32], zmm2, 0x4);
if (m_out_f32 && m_to_f16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that m_out_f32 should override m_to_f16 flag? I mean, once a convert to fp32 is fused, it should always store f32 result, right? so if (m_out_f32) should be enough, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that m_out_f32 should override m_to_f16 flag? I mean, once a convert to fp32 is fused, it should always store f32 result, right? so if (m_out_f32) should be enough, right?

@usstq unify this logic. replace m_to_16 with m_output_type to mark possible output precision.

if (m_to_f16) {
vcvtps2ph(ptr[dst + loop_i * 2], zmm0, 0x4);
vcvtps2ph(ptr[dst + loop_i * 2 + 32], zmm2, 0x4);
if (m_out_f32 && m_to_f16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@usstq update as above.

if (!mlp_node) {
return false;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a check here to make sure convert is the only child of mlp node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a check here to make sure convert is the only child of mlp node.

@usstq add has_only_child check

const auto& m_fc = pattern_map.at(fc).get_node_shared_ptr();
const auto& m_convert = pattern_map.at(convert).get_node_shared_ptr();
auto output_type = m_convert->get_output_element_type(0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a check here to make sure convert is the only child of fc node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a check here to make sure convert is the only child of fc node.

@usstq update as above.

@@ -25,6 +25,7 @@ class LLMMLPNode : public ov::op::Op {
int hidden_size;
int up_size;
bool gate_up_combined;
bool tail_f32 = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add ov::element::Type output_type = ov::element::undefined; instead, to be consistent with FullyConnectedNode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add ov::element::Type output_type = ov::element::undefined; instead, to be consistent with FullyConnectedNode

@usstq I see. update the spec

@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch 2 times, most recently from db91b52 to 050df44 Compare November 25, 2024 09:03
@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 050df44 to 688d26b Compare November 25, 2024 09:36
@dmitry-gorokhov
Copy link
Contributor

@xczhai @usstq I have couple of concerns regarding current deisgn:

  1. Why do we need to have separate Convert fusion transformation based on target operation type? Sounds like such fusion should be universal transformation (same as ConvertPrecision) and the plugin should only specify target operations types and ports where Convert should be fused.
  2. Fusion on ngraph model level representation doesn't cover "post-ops" case. Common behavior of ConvertPrecision transformation is to keep some activation functions in fp32 to preserve accuracy. It basically means the following pattern after ConvertPrecision pass: [fp16] -> FC [fp32] -> Activation [fp32] -> Convert [fp16]. So in order to fuse such Convert into FC we need to apply graph_optimizer passes first.
    I expect some day we will fully rewrite all graph_optimizer passes to ngraph rails, but for now we have to deal with that. So my proposal is to have universal pass that will allow to fuse Convert operation into other (and replace newly implemented passes with it).

What is you thoughts on that?

@xczhai
Copy link
Contributor Author

xczhai commented Nov 25, 2024

@xczhai @usstq I have couple of concerns regarding current deisgn:

  1. Why do we need to have separate Convert fusion transformation based on target operation type? Sounds like such fusion should be universal transformation (same as ConvertPrecision) and the plugin should only specify target operations types and ports where Convert should be fused.
  2. Fusion on ngraph model level representation doesn't cover "post-ops" case. Common behavior of ConvertPrecision transformation is to keep some activation functions in fp32 to preserve accuracy. It basically means the following pattern after ConvertPrecision pass: [fp16] -> FC [fp32] -> Activation [fp32] -> Convert [fp16]. So in order to fuse such Convert into FC we need to apply graph_optimizer passes first.
    I expect some day we will fully rewrite all graph_optimizer passes to ngraph rails, but for now we have to deal with that. So my proposal is to have universal pass that will allow to fuse Convert operation into other (and replace newly implemented passes with it).

What is you thoughts on that?

@dmitry-gorokhov
I get your concern on such implementation. I can explain something during the development.

  1. For LLM model, the Convert nodes are mainly inserted after FC and LLMMLP. These Convert nodes block the perfermance heavily. I find it that there's an opportunity for fusing Convert into FC or LLMMLP. Optimize the f16 perf step by step so create two similar fusion pass. You mean these two fusion pass can be unified into one fusion pass. Please correct me if I understand it incorrectly.
  2. I have some different comment on So in order to fuse such Convert into FC we need to apply graph_optimizer passes first.. During above optimization, I refer to some GPU implementation. GPU team implement a similar FC + Convert fusion pass
    manager.register_pass<ov::intel_gpu::FullyConnectedConvertFusion>();
    . Looks such fusion pass is simple to maintain.
  3. Last but not least, we try to fuse FC / LLMMLP with Convert(f16=>f32) instead of Convert(f32->f16). So we can save much time on these inserted Convert(f16=>f32) nodes .

@wenjiew wenjiew added this to the 2025.0 milestone Dec 9, 2024
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Dec 24, 2024
@yuxu42 yuxu42 added no_stale Do not mark as stale and removed Stale labels Dec 24, 2024
@xczhai xczhai force-pushed the xc/fix_sdpa_fusion_for_f16 branch from 9c02940 to 0dba6ff Compare January 6, 2025 08:00
@ilya-lavrenov ilya-lavrenov removed this from the 2025.0 milestone Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin no_stale Do not mark as stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants