Stable-fast
Royalty of the Stable-fast.md
Introduction¶
What is this?¶
stable-fast
is an ultra lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
stable-fast
provides super fast inference optimization by utilizing some key techniques and features:
- CUDNN Convolution Fusion:
stable-fast
implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations ofConv + Bias + Add + Act
computation patterns. - Low Precision & Fused GEMM:
stable-fast
implements a series of fused GEMM operators that compute withfp16
precision, which is fast than PyTorch’s defaults (read & write withfp16
while compute withfp32
). - Fused Linear GEGLU:
stable-fast
is able to fuseGEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)
into one CUDA kernel. - NHWC & Fused GroupNorm:
stable-fast
implements a highly optimized fused NHWCGroupNorm + Silu
operator with OpenAI’sTriton
, which eliminates the need of memory format permutation operators. - Fully Traced Model:
stable-fast
improves thetorch.jit.trace
interface to make it more proper for tracing complex models. Nearly every part ofStableDiffusionPipeline/StableVideoDiffusionPipeline
can be traced and converted to TorchScript. It is more stable thantorch.compile
and has a significantly lower CPU overhead thantorch.compile
and supports ControlNet and LoRA. - CUDA Graph:
stable-fast
can capture theUNet
,VAE
andTextEncoder
into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape. - Fused Multihead Attention:
stable-fast
just uses xformers and makes it compatible with TorchScript.
My next goal is to keep stable-fast
as one of the fastest inference optimization frameworks for diffusers
and also
provide both speedup and VRAM reduction for transformers
.
In fact, I already use stable-fast
to optimize LLMs and achieve a significant speedup.
But I still need to do some work to make it more stable and easy to use and provide a stable user interface.
Differences With Other Acceleration Libraries¶
- Fast:
stable-fast
is specialy optimized for HuggingFace Diffusers. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster thantorch.compile
,TensorRT
andAITemplate
in compilation time. - Minimal:
stable-fast
works as a plugin framework forPyTorch
. It utilizes existingPyTorch
functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions. - Maximum Compatibility:
stable-fast
is compatible with all kinds ofHuggingFace Diffusers
andPyTorch
versions. It is also compatible withControlNet
andLoRA
. And it even supports the latestStableVideoDiffusionPipeline
out of the box!
Installation¶
NOTE: stable-fast
is currently only tested on Linux
and WSL2 in Windows
.
You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).
I only test stable-fast
with torch>=2.1.0
, xformers>=0.0.22
and triton>=2.1.0
on CUDA 12.1
and Python 3.10
.
Other versions might build and run successfully but that’s not guaranteed.
Install Prebuilt Wheels¶
Download the wheel corresponding to your system from the Releases Page and install it with pip3 install <wheel file>
.
Currently both Linux and Windows wheels are available.
# Change cu121 to your CUDA version and <wheel file> to the path of the wheel file.
# And make sure the wheel file is compatible with your PyTorch version.
pip3 install --index-url https://download.pytorch.org/whl/cu121 \
'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3' \
'<wheel file>'
Install From Source¶
# Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas
# Install PyTorch with CUDA and other packages at first.
# Windows user: Triton might be not available, you could skip it.
# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
pip3 install wheel 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'
# (Optional) Makes the build much faster.
pip3 install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types.
# You can also install the latest stable release from PyPI.
# pip3 install -v -U stable-fast
pip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)
NOTE: Any usage outside sfast.compilers
is not guaranteed to be backward compatible.
NOTE: To get the best performance, xformers
and OpenAI’s triton>=2.1.0
need to be installed and enabled.
You might need to build xformers
from source to make it compatible with your PyTorch
.