Stable Diffusion 3.x and Flux Optimization #22986

tianleiwu · 2024-12-02T23:35:19Z

Description

It has dependency on the following PRs:

LayerNormalization broadcast (limited support for axis=2) #23297

Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models (fp32 or fp16).

Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models:

python optimize_pipeline.py -i ./flux1_schnell_onnx/fp32 -o ./flux1_schnell_onnx/fp16 --float16

  Optimize flux1_schnell_onnx/fp32/transformer/model.onnx ...
  Fused LayerNormalization: 115
  Fused SimplifiedLayerNormalization: 152
  Fused FastGelu: 76
  Fused MultiHeadAttention: 57

H100 Benchmark Results

GPU: NVIDIA H100 80GB HBM3
Image Size: 1024x1024
Batch Size: 1

Model	Steps	Precision	Engine	Latency (Seconds)	GPU Memory (MB)
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (compile)	8.198	37,603
Flux 1.0 Dev	50	FP16+BF16	Optimum (ORT)	10.762	41,469
Flux 1.0 Dev	50	FP16+FP32	Optimum (ORT)	10.891	43,545
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (eager)	12.339	36,651
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (compile)	0.775	37,857
Flux 1.0 Schnell	4	FP16+BF16	Optimum (ORT)	0.931	41,433
Flux 1.0 Schnell	4	FP16+FP32	Optimum (ORT)	0.939	43,809
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (eager)	1.120	36,629
SD 3.5 Large	50	BF16	Torch 2.5.1 (compile)	7.466	32,217
SD 3.5 Large	50	FP16+BF16	Optimum (ORT)	10.275	36,609
SD 3.5 Large	50	FP16+FP32	Optimum (ORT)	10.283	36,729
SD 3.5 Large	50	BF16	Torch 2.5.1 (eager)	11.615	31,517
SD 3.5 Medium	50	BF16	Torch 2.5.1 (compile)	3.240	21,143
SD 3.5 Medium	50	FP16+BF16	Optimum (ORT)	4.799	25,097
SD 3.5 Medium	50	FP16+FP32	Optimum (ORT)	4.838	25,109
SD 3.5 Medium	50	BF16	Torch 2.5.1 (eager)	5.582	20,489

A100 Benchmark Results

GPU: A100-SXM4-80GB
Image Size: 1024x1024
Batch Size: 1

Model	Steps	Precision	Engine	Latency (Seconds)	GPU Memory (MB)
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (compile)	17.593	37,723
Flux 1.0 Dev	50	FP16+BF16	Optimum (ORT)	21.918	41,348
Flux 1.0 Dev	50	FP16+FP32	Optimum (ORT)	22.060	44,860
Flux 1.0 Dev	50	BF16	Torch 2.5.1 (eager)	24.267	36,847
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (compile)	1.627	37,881
Flux 1.0 Schnell	4	FP16+BF16	Optimum (ORT)	1.884	41,537
Flux 1.0 Schnell	4	FP16+FP32	Optimum (ORT)	1.902	44,858
Flux 1.0 Schnell	4	BF16	Torch 2.5.1 (eager)	2.162	36,831
SD 3.5 Large	50	BF16	Torch 2.5.1 (compile)	15.881	32,307
SD 3.5 Large	50	FP16+FP32	Optimum (ORT)	19.837	36,451
SD 3.5 Large	50	FP16+BF16	Optimum (ORT)	19.964	36,461
SD 3.5 Large	50	BF16	Torch 2.5.1 (eager)	22.477	31,513
SD 3.5 Medium	50	BF16	Torch 2.5.1 (compile)	6.476	21,341
SD 3.5 Medium	50	FP16+FP32	Optimum (ORT)	8.775	25,183
SD 3.5 Medium	50	BF16	Torch 2.5.1 (eager)	10.057	20,433

Future Works

Triton kernel for matrix multiplication and auto tuning.
FP8/Int8 quantization

Motivation and Context

SD 3.5 Architecture:
https://huggingface.co/stabilityai/stable-diffusion-3.5-medium/resolve/main/mmdit-x.png

onnxruntime/python/tools/transformers/onnx_model_mmdit.py

onnxruntime/python/tools/transformers/fusion_fastgelu.py

onnxruntime/python/tools/transformers/onnx_model_mmdit.py

onnxruntime/python/tools/transformers/models/stable_diffusion/optimize_pipeline.py

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/python/tools/transformers/fusion_mha_mmdit.py

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py

onnxruntime/python/tools/transformers/fusion_mha_mmdit.py

onnxruntime/python/tools/transformers/onnx_model_mmdit.py

onnxruntime/test/python/transformers/test_optimizer_stable_diffusion.py

onnxruntime/python/tools/transformers/float16.py

onnxruntime/python/tools/transformers/models/stable_diffusion/optimize_pipeline.py

onnxruntime/python/tools/transformers/onnx_model_t5.py

Add a tool to generate node_block_list used in [float16 conversion tool](https://github.com/microsoft/onnxruntime/blob/04030f64be10e020d3ac9aa5ba7d0f2917cbd14e/onnxruntime/python/tools/transformers/float16.py#L175). Previously, we have a feature to dump statistics data (like min, max) of each node input/output. However, it is time consuming to generate a list of nodes that need to be kept in float32 when model is large. This could help speed up the process by outputting a list of nodes that have potential overflow in float-to-half conversion. Usage is to build onnxruntime from source with ` --cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1`, then set some environment variables before running float32 optimized onnx model like: ``` export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1 export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000 python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup ``` The threshold `ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD` shall be <= 65504. The default value is 50000 if the environment variable is not set. It is better to leave some margin if number of samples are not large enough in the test. As a demo, we add an option --skip_warmup to benchmark.py for Flux, so that we can reduce the time on dumping warm-up runs. Example snippet of stdout (each inference session has such a summary when session ended): ``` Total counter in node dumping: 141 Found 2 nodes cannot be converted to half precision due to potential input/output overflow. Operator frequencies for these nodes: Softmax : 1 MatMul : 1 # ------- # Example python script for float16 conversion # For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py # ------- from onnxruntime.transformers.onnx_model import OnnxModel m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx')) node_block_list = [ '/decoder/mid_block/attentions.0/Softmax', '/decoder/mid_block/attentions.0/MatMul', ] m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list) m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False) ``` Then you can use the python script to convert corresponding model to float16. ### Motivation and Context It is a tool used to generate node_block_list used in float16 conversion of stable diffusion 3.x and flux models in #22986. In stable diffusion or Flux pipeline, there are multiple models and there could be multiple session runs for each model. Without a proper tool, it is time consuming to get node_block_list for each model.

tianleiwu added 2 commits November 22, 2024 13:59

initial

6fb7369

sd3.x and flux

9b2dcc0

github-advanced-security bot found potential problems Dec 2, 2024

View reviewed changes

onnxruntime/python/tools/transformers/onnx_model_mmdit.py Fixed Show fixed Hide fixed

tianleiwu marked this pull request as draft December 3, 2024 19:19

update FastGelu and RMSNorm fusions

7f925ce

github-advanced-security bot found potential problems Dec 5, 2024

View reviewed changes

onnxruntime/python/tools/transformers/fusion_fastgelu.py Dismissed Show dismissed Hide dismissed

tianleiwu added 7 commits December 6, 2024 00:30

support Reciprocal in RMSNorm fusion

cf259e1

match_child_path interface change

b38f12e

clean up

a58b68c

MHA fusion for MMDit

c7317cb

cuda layernorm support broadcast

2f5b9b9

force fuse layernorm

699a64c

refactoring

c1d0160

github-advanced-security bot found potential problems Dec 15, 2024

View reviewed changes

onnxruntime/python/tools/transformers/onnx_model_mmdit.py Fixed Show fixed Hide fixed

ACinfr reviewed Dec 16, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/optimize_pipeline.py Outdated Show resolved Hide resolved

mha fusion for flux

1b9ea54

github-actions bot reviewed Dec 19, 2024

View reviewed changes

github-advanced-security bot found potential problems Dec 19, 2024

View reviewed changes

onnxruntime/python/tools/transformers/fusion_mha_mmdit.py Fixed Show fixed Hide fixed

remove transpose for query

5528276

github-advanced-security bot found potential problems Dec 20, 2024

View reviewed changes

onnxruntime/python/tools/transformers/onnx_model_mmdit.py Fixed Show fixed Hide fixed

tianleiwu added 10 commits December 23, 2024 04:34

t5 optimization and mixed precision conversion

89950d1

fix node name

c869151

Add option to use bfloat16

84b1a51

fix attention

b7041d1

update node block list of t5 encoder

455a3ea

benchmark torch eager mode

dad0ac4

update comment

8400558

benchmark torch compile

9e43e20

refine benchmark_flux.sh

4bf9f25

Merge branch 'main' into tlwu/sd3_optimum

913c6ed

tianleiwu added 4 commits January 10, 2025 23:52

undo layer norm kernel

a47b6af

CMAKE_CUDA_ARCHITECTURES=native

55178d6

Merge branch 'main' into tlwu/sd3_optimum

dac8ea7

add tests

ebade48

tianleiwu changed the title ~~[WIP] Stable Diffusion 3.x and Flux Optimization~~ Stable Diffusion 3.x and Flux Optimization Jan 12, 2025

tianleiwu marked this pull request as ready for review January 12, 2025 04:12

tianleiwu requested review from kunal-vaishnavi and jiafatom January 12, 2025 04:12

github-advanced-security bot found potential problems Jan 12, 2025

View reviewed changes

onnxruntime/test/python/transformers/test_optimizer_stable_diffusion.py Fixed Show fixed Hide fixed

onnxruntime/test/python/transformers/test_optimizer_stable_diffusion.py Fixed Show fixed Hide fixed

jiafatom reviewed Jan 12, 2025

View reviewed changes

onnxruntime/python/tools/transformers/float16.py Outdated Show resolved Hide resolved

onnxruntime/python/tools/transformers/float16.py Outdated Show resolved Hide resolved

update tests

fd227bb

kunal-vaishnavi reviewed Jan 12, 2025

View reviewed changes

onnxruntime/python/tools/transformers/float16.py Outdated Show resolved Hide resolved

kunal-vaishnavi reviewed Jan 12, 2025

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/optimize_pipeline.py Show resolved Hide resolved

kunal-vaishnavi reviewed Jan 12, 2025

View reviewed changes

onnxruntime/python/tools/transformers/onnx_model_t5.py Show resolved Hide resolved

undo some change (move to another PR)

87bd3ec

kunal-vaishnavi approved these changes Jan 14, 2025

View reviewed changes

tianleiwu merged commit 6550f4b into main Jan 14, 2025
96 of 98 checks passed

tianleiwu deleted the tlwu/sd3_optimum branch January 14, 2025 21:38

tianleiwu mentioned this pull request Jan 14, 2025

Dump nodes with potential overflow in half conversion #23363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable Diffusion 3.x and Flux Optimization #22986

Stable Diffusion 3.x and Flux Optimization #22986

tianleiwu commented Dec 2, 2024 •

edited

Loading

github-actions bot left a comment

Stable Diffusion 3.x and Flux Optimization #22986

Stable Diffusion 3.x and Flux Optimization #22986

Conversation

tianleiwu commented Dec 2, 2024 • edited Loading

Description

H100 Benchmark Results

A100 Benchmark Results

Future Works

Motivation and Context

github-actions bot left a comment

Choose a reason for hiding this comment

tianleiwu commented Dec 2, 2024 •

edited

Loading