[GPU] rope optimization #27907

riverlijunjie · 2024-12-04T12:31:39Z

Details:

Optimize rope opencl kernel to improve its performance
Test result shows it can improve RoPE performance about 50% in average.

batch=128, seq_length = 7	base latency(ns)	optimized latency(ns)	latency decreased
rope_ref_5266667119713786613_0_0__sa,	921352	872395	5.31%	RoPETestQwen7b	f32
rope_ref_2672092794364911740_0_0__sa,	1724374	514790	70.15%	RoPETestChatGLM	f32
rope_ref_8061762790816124098_0_0__sa,	633019	127186	79.91%	RoPETestQwen7b	f16
rope_ref_4392014836945391706_0_0__sa,	629791	518749	17.63%	RoPETestLlama2	f32
rope_ref_13829176589243505378_0_0__sa,	870312	259583	70.17%	RoPETestChatGLM	f32
rope_ref_6813544162411765619_0_0__sa,	749895	421875	43.74%	RoPETestChatGLM	f16
rope_ref_15054358246334082928_0_0__sa,	637708	45208	92.91%	RoPETestFlux	f32
rope_ref_3898891400599565440_0_0__sa,	378333	335937	11.21%	RoPETestRotateHalfWithoutTranspose	f32
rope_ref_18119704851383556529_0_0__sa,	371250	208645	43.80%	RoPETestChatGLM	f16
rope_ref_17460680473512025171_0_0__sa,	299166	98958	66.92%	RoPETestFlux	f16

Tickets:

CVS-157438

vladimir-paramuzov · 2024-12-04T12:58:22Z

src/plugins/intel_gpu/src/graph/impls/ocl/rope.cpp

+        switch (impl_param.get_input_layout(0).data_type) {
+        case data_types::f16:
+            params.vec_size = 16;
+            break;
+        case data_types::f32:
+            params.vec_size = 8;
+            break;
+        default:
+            params.vec_size = 1;
+            break;
+        }


this vec_size is a parameter of specific kernel while ocl primitive_impl can be used for multiple kernels. So suggestion is to move it to rope_kernel_ref.cpp

vladimir-paramuzov · 2024-12-04T13:01:29Z

src/plugins/intel_gpu/src/kernel_selector/kernels/rope/rope_kernel_base.cpp

    } else if (params.is_chatglm) {
        if (params.support_2d_rope) {
            // input  [batch_size, seq_length]
            // output [batch_size, head_count, seq_length, half_rotary_ndims]
            dispatchData.gws = {input.Batch().v * params.head_cnt,
                                input.Feature().v,
-                                params.rotary_ndims / 2ul};
+                                params.rotary_ndims / 2ul / params.vec_size};


What if half rotary ndims is not divisible by vec_size? I think you should either add fallback to vec_size 1 for such case or add tail processing

Fallback to vec_size=1 if half rotary ndims is not divisible by vec_size.

sshlyapn · 2024-12-09T06:40:03Z

src/plugins/intel_gpu/src/kernel_selector/kernels/rope/rope_kernel_opt.h

+    JitConstants GetJitConstants(const rope_params& params, DispatchData dispatchData) const override;
+    DispatchData SetDefault(const rope_params& params) const override;
+private:
+    mutable size_t vec_size;


We have a single object instance of this class used for all the Rope layers in the model. To prevent any issues with Rope having different data types or rotary_ndims (and using different vec_size), it's better to introduce some function like size_t get_vec_size(...) and call it directly from GetJitConstants() and SetDefault()

Make sense! Updated it!

vladimir-paramuzov · 2024-12-11T05:28:42Z

src/plugins/intel_gpu/src/kernel_selector/kernels/rope/rope_kernel_opt.cpp

+#include <string>
+
+namespace kernel_selector {
+ParamsKey RoPEKernelOpt::GetSupportedKey() const {


Do we really need a separate opt kernel instance? I think it's a big code duplication as opt kernel with vec_size 1 is identical to ref. Suggest keeping single kernel

I thought that a simple reference kernel implement would be helpful to reference if there is new rope type need to be added.
Anyhow, keep one rope kernel implement also make sense!

### Details: - Optimize rope opencl kernel to improve its performance - Test result shows it can improve RoPE performance about 50% in average. <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=OneNote.File> <meta name=Generator content="Microsoft OneNote 15"> </head> <body lang=en-US style='font-family:Calibri;font-size:11.0pt'>  <div style='direction:ltr'> batch=128, seq_length = 7 | base latency(ns) | optimized latency(ns) | latency decreased | | -- | -- | -- | -- | -- | -- rope_ref_5266667119713786613_0_0__sa, | 921352 | 872395 | 5.31% | RoPETestQwen7b | f32 rope_ref_2672092794364911740_0_0__sa, | 1724374 | 514790 | 70.15% | RoPETestChatGLM | f32 rope_ref_8061762790816124098_0_0__sa, | 633019 | 127186 | 79.91% | RoPETestQwen7b | f16 rope_ref_4392014836945391706_0_0__sa, | 629791 | 518749 | 17.63% | RoPETestLlama2 | f32 rope_ref_13829176589243505378_0_0__sa, | 870312 | 259583 | 70.17% | RoPETestChatGLM | f32 rope_ref_6813544162411765619_0_0__sa, | 749895 | 421875 | 43.74% | RoPETestChatGLM | f16 rope_ref_15054358246334082928_0_0__sa, | 637708 | 45208 | 92.91% | RoPETestFlux | f32 rope_ref_3898891400599565440_0_0__sa, | 378333 | 335937 | 11.21% | RoPETestRotateHalfWithoutTranspose | f32 rope_ref_18119704851383556529_0_0__sa, | 371250 | 208645 | 43.80% | RoPETestChatGLM | f16 rope_ref_17460680473512025171_0_0__sa, | 299166 | 98958 | 66.92% | RoPETestFlux | f16 </div>  </body> </html> ![image](https://github.com/user-attachments/assets/4328b1a7-18ec-485f-abd0-b0fe16785854) ### Tickets: - *CVS-157438*

riverlijunjie requested review from a team as code owners December 4, 2024 12:31

github-actions bot added category: IE Tests OpenVINO Test: plugins and common category: GPU OpenVINO GPU plugin category: CPU OpenVINO CPU plugin labels Dec 4, 2024

riverlijunjie added do_not_review do_not_merge labels Dec 4, 2024

[GPU] rope optimization

5d1b4ab

vladimir-paramuzov reviewed Dec 5, 2024

View reviewed changes

Add rope_opt kernel

b556e8c

riverlijunjie changed the title ~~[TEST][GPU] rope optimization~~ [GPU] rope optimization Dec 6, 2024

riverlijunjie removed do_not_review do_not_merge labels Dec 6, 2024

sshlyapn reviewed Dec 9, 2024

View reviewed changes

riverlijunjie added 2 commits December 10, 2024 21:43

Prevent possible race condition of data type and rotary_ndims

9b04f02

Merge branch 'master' into river/rope_opt

753eda4

peterchen-intel assigned vladimir-paramuzov and sshlyapn Dec 11, 2024

sshlyapn approved these changes Dec 11, 2024

View reviewed changes

vladimir-paramuzov reviewed Dec 11, 2024

View reviewed changes

riverlijunjie added 2 commits December 12, 2024 23:22

Keep one rope kernel implement

fb3b7ce

Merge branch 'master' into river/rope_opt

18ad882

riverlijunjie requested a review from vladimir-paramuzov December 13, 2024 06:14

vladimir-paramuzov approved these changes Dec 16, 2024

View reviewed changes

vladimir-paramuzov added this pull request to the merge queue Dec 16, 2024

vladimir-paramuzov added this to the 2025.0 milestone Dec 16, 2024

Merged via the queue into openvinotoolkit:master with commit 9651768 Dec 16, 2024
181 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] rope optimization #27907

[GPU] rope optimization #27907

riverlijunjie commented Dec 4, 2024 •

edited

Loading

vladimir-paramuzov Dec 4, 2024

riverlijunjie Dec 6, 2024

vladimir-paramuzov Dec 4, 2024

riverlijunjie Dec 6, 2024

sshlyapn Dec 9, 2024

riverlijunjie Dec 10, 2024

vladimir-paramuzov Dec 11, 2024

riverlijunjie Dec 11, 2024

[GPU] rope optimization #27907

[GPU] rope optimization #27907

Conversation

riverlijunjie commented Dec 4, 2024 • edited Loading

Details:

Tickets:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riverlijunjie commented Dec 4, 2024 •

edited

Loading