Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable Diffusion 3.x and Flux Optimization #22986

Merged
merged 28 commits into from
Jan 14, 2025
Merged

Stable Diffusion 3.x and Flux Optimization #22986

merged 28 commits into from
Jan 14, 2025

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Dec 2, 2024

Description

It has dependency on the following PRs:

Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models (fp32 or fp16).

  • Update optimize_pipeline script
  • Update benchmkark script
  • Update document about Stable Diffusion 3.x and Flux 1.0 models
  • Add graph optimizations for MMDit model
    • FastGelu fusion
    • RMSNorm fusion
    • MultiHeadAttention fusion
  • Add graph optimizations for Flux transformer models
    • MultiHeadAttention fusion
  • Update graph optimizations for t5
  • Add tests

Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models:

python optimize_pipeline.py -i ./flux1_schnell_onnx/fp32 -o ./flux1_schnell_onnx/fp16 --float16

  Optimize flux1_schnell_onnx/fp32/transformer/model.onnx ...
  Fused LayerNormalization: 115
  Fused SimplifiedLayerNormalization: 152
  Fused FastGelu: 76
  Fused MultiHeadAttention: 57

H100 Benchmark Results

  • GPU: NVIDIA H100 80GB HBM3
  • Image Size: 1024x1024
  • Batch Size: 1
Model Steps Precision Engine Latency (Seconds) GPU Memory (MB)
Flux 1.0 Dev 50 BF16 Torch 2.5.1 (compile) 8.198 37,603
Flux 1.0 Dev 50 FP16+BF16 Optimum (ORT) 10.762 41,469
Flux 1.0 Dev 50 FP16+FP32 Optimum (ORT) 10.891 43,545
Flux 1.0 Dev 50 BF16 Torch 2.5.1 (eager) 12.339 36,651
Flux 1.0 Schnell 4 BF16 Torch 2.5.1 (compile) 0.775 37,857
Flux 1.0 Schnell 4 FP16+BF16 Optimum (ORT) 0.931 41,433
Flux 1.0 Schnell 4 FP16+FP32 Optimum (ORT) 0.939 43,809
Flux 1.0 Schnell 4 BF16 Torch 2.5.1 (eager) 1.120 36,629
SD 3.5 Large 50 BF16 Torch 2.5.1 (compile) 7.466 32,217
SD 3.5 Large 50 FP16+BF16 Optimum (ORT) 10.275 36,609
SD 3.5 Large 50 FP16+FP32 Optimum (ORT) 10.283 36,729
SD 3.5 Large 50 BF16 Torch 2.5.1 (eager) 11.615 31,517
SD 3.5 Medium 50 BF16 Torch 2.5.1 (compile) 3.240 21,143
SD 3.5 Medium 50 FP16+BF16 Optimum (ORT) 4.799 25,097
SD 3.5 Medium 50 FP16+FP32 Optimum (ORT) 4.838 25,109
SD 3.5 Medium 50 BF16 Torch 2.5.1 (eager) 5.582 20,489

A100 Benchmark Results

  • GPU: A100-SXM4-80GB
  • Image Size: 1024x1024
  • Batch Size: 1
Model Steps Precision Engine Latency (Seconds) GPU Memory (MB)
Flux 1.0 Dev 50 BF16 Torch 2.5.1 (compile) 17.593 37,723
Flux 1.0 Dev 50 FP16+BF16 Optimum (ORT) 21.918 41,348
Flux 1.0 Dev 50 FP16+FP32 Optimum (ORT) 22.060 44,860
Flux 1.0 Dev 50 BF16 Torch 2.5.1 (eager) 24.267 36,847
Flux 1.0 Schnell 4 BF16 Torch 2.5.1 (compile) 1.627 37,881
Flux 1.0 Schnell 4 FP16+BF16 Optimum (ORT) 1.884 41,537
Flux 1.0 Schnell 4 FP16+FP32 Optimum (ORT) 1.902 44,858
Flux 1.0 Schnell 4 BF16 Torch 2.5.1 (eager) 2.162 36,831
SD 3.5 Large 50 BF16 Torch 2.5.1 (compile) 15.881 32,307
SD 3.5 Large 50 FP16+FP32 Optimum (ORT) 19.837 36,451
SD 3.5 Large 50 FP16+BF16 Optimum (ORT) 19.964 36,461
SD 3.5 Large 50 BF16 Torch 2.5.1 (eager) 22.477 31,513
SD 3.5 Medium 50 BF16 Torch 2.5.1 (compile) 6.476 21,341
SD 3.5 Medium 50 FP16+FP32 Optimum (ORT) 8.775 25,183
SD 3.5 Medium 50 BF16 Torch 2.5.1 (eager) 10.057 20,433

Future Works

  • Triton kernel for matrix multiplication and auto tuning.
  • FP8/Int8 quantization

Motivation and Context

SD 3.5 Architecture:
https://huggingface.co/stabilityai/stable-diffusion-3.5-medium/resolve/main/mmdit-x.png

@tianleiwu tianleiwu marked this pull request as draft December 3, 2024 19:19
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@tianleiwu tianleiwu changed the title [WIP] Stable Diffusion 3.x and Flux Optimization Stable Diffusion 3.x and Flux Optimization Jan 12, 2025
@tianleiwu tianleiwu marked this pull request as ready for review January 12, 2025 04:12
@tianleiwu tianleiwu merged commit 6550f4b into main Jan 14, 2025
96 of 98 checks passed
@tianleiwu tianleiwu deleted the tlwu/sd3_optimum branch January 14, 2025 21:38
tianleiwu added a commit that referenced this pull request Jan 16, 2025
Add a tool to generate node_block_list used in [float16 conversion tool](https://github.com/microsoft/onnxruntime/blob/04030f64be10e020d3ac9aa5ba7d0f2917cbd14e/onnxruntime/python/tools/transformers/float16.py#L175).

Previously, we have a feature to dump statistics data (like min, max) of
each node input/output. However, it is time consuming to generate a list
of nodes that need to be kept in float32 when model is large.

This could help speed up the process by outputting a list of nodes that
have potential overflow in float-to-half conversion.

Usage is to build onnxruntime from source with ` --cmake_extra_defines
onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1`, then set some environment
variables before running float32 optimized onnx model like:
```
export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1
export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000

python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup
```

The threshold `ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD` shall be <=
65504. The default value is 50000 if the environment variable is not
set. It is better to leave some margin if number of samples are not
large enough in the test.

As a demo, we add an option --skip_warmup to benchmark.py for Flux, so
that we can reduce the time on dumping warm-up runs.

Example snippet of stdout (each inference session has such a summary
when session ended):
```
Total counter in node dumping: 141
Found 2 nodes cannot be converted to half precision due to potential input/output overflow.
Operator frequencies for these nodes:
Softmax : 1
MatMul : 1
# -------
# Example python script for float16 conversion
# For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py
# -------
from onnxruntime.transformers.onnx_model import OnnxModel
m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx'))
node_block_list = [
  '/decoder/mid_block/attentions.0/Softmax',
  '/decoder/mid_block/attentions.0/MatMul',
]
m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list)
m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False)
```
Then you can use the python script to convert corresponding model to
float16.

### Motivation and Context

It is a tool used to generate node_block_list used in float16 conversion
of stable diffusion 3.x and flux models in
#22986.

In stable diffusion or Flux pipeline, there are multiple models and
there could be multiple session runs for each model. Without a proper
tool, it is time consuming to get node_block_list for each model.
carzh pushed a commit that referenced this pull request Jan 16, 2025
Add a tool to generate node_block_list used in [float16 conversion tool](https://github.com/microsoft/onnxruntime/blob/04030f64be10e020d3ac9aa5ba7d0f2917cbd14e/onnxruntime/python/tools/transformers/float16.py#L175).

Previously, we have a feature to dump statistics data (like min, max) of
each node input/output. However, it is time consuming to generate a list
of nodes that need to be kept in float32 when model is large.

This could help speed up the process by outputting a list of nodes that
have potential overflow in float-to-half conversion.

Usage is to build onnxruntime from source with ` --cmake_extra_defines
onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1`, then set some environment
variables before running float32 optimized onnx model like:
```
export ORT_DEBUG_NODE_IO_DUMP_HALF_CONVERSION_OVERFLOW=1
export ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD=50000

python benchmark.py -e optimum --height 1024 --width 1024 --steps 3 -b 1 -v Flux.1D -p flux1_dev_onnx/fp32_opt --skip_warmup
```

The threshold `ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD` shall be <=
65504. The default value is 50000 if the environment variable is not
set. It is better to leave some margin if number of samples are not
large enough in the test.

As a demo, we add an option --skip_warmup to benchmark.py for Flux, so
that we can reduce the time on dumping warm-up runs.

Example snippet of stdout (each inference session has such a summary
when session ended):
```
Total counter in node dumping: 141
Found 2 nodes cannot be converted to half precision due to potential input/output overflow.
Operator frequencies for these nodes:
Softmax : 1
MatMul : 1
# -------
# Example python script for float16 conversion
# For details, search `node_block_list` in https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/float16.py
# -------
from onnxruntime.transformers.onnx_model import OnnxModel
m = OnnxModel(onnx.load('flux1_dev_onnx/fp32_opt/vae_decoder/model.onnx'))
node_block_list = [
  '/decoder/mid_block/attentions.0/Softmax',
  '/decoder/mid_block/attentions.0/MatMul',
]
m.convert_float_to_float16(keep_io_types=False, node_block_list=node_block_list)
m.save_model_to_file('fp16/optimized.onnx', use_external_data_format=False)
```
Then you can use the python script to convert corresponding model to
float16.

### Motivation and Context

It is a tool used to generate node_block_list used in float16 conversion
of stable diffusion 3.x and flux models in
#22986.

In stable diffusion or Flux pipeline, there are multiple models and
there could be multiple session runs for each model. Without a proper
tool, it is time consuming to get node_block_list for each model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants