Stable-fast

Introduction¶

What is this?¶

stable-fast is an ultra lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs. stable-fast provides super fast inference optimization by utilizing some key techniques and features:

CUDNN Convolution Fusion: stable-fast implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations of Conv + Bias + Add + Act computation patterns.
Low Precision & Fused GEMM: stable-fast implements a series of fused GEMM operators that compute with fp16 precision, which is fast than PyTorch’s defaults (read & write with fp16 while compute with fp32).
Fused Linear GEGLU: stable-fast is able to fuse GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c) into one CUDA kernel.
NHWC & Fused GroupNorm: stable-fast implements a highly optimized fused NHWC GroupNorm + Silu operator with OpenAI’s Triton, which eliminates the need of memory format permutation operators.
Fully Traced Model: stable-fast improves the torch.jit.trace interface to make it more proper for tracing complex models. Nearly every part of StableDiffusionPipeline/StableVideoDiffusionPipeline can be traced and converted to TorchScript. It is more stable than torch.compile and has a significantly lower CPU overhead than torch.compile and supports ControlNet and LoRA.
CUDA Graph: stable-fast can capture the UNet, VAE and TextEncoder into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape.
Fused Multihead Attention: stable-fast just uses xformers and makes it compatible with TorchScript.

My next goal is to keep stable-fast as one of the fastest inference optimization frameworks for diffusers and also provide both speedup and VRAM reduction for transformers. In fact, I already use stable-fast to optimize LLMs and achieve a significant speedup. But I still need to do some work to make it more stable and easy to use and provide a stable user interface.

Differences With Other Acceleration Libraries¶

Fast: stable-fast is specialy optimized for HuggingFace Diffusers. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster than torch.compile, TensorRT and AITemplate in compilation time.
Minimal: stable-fast works as a plugin framework for PyTorch. It utilizes existing PyTorch functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.
Maximum Compatibility: stable-fast is compatible with all kinds of HuggingFace Diffusers and PyTorch versions. It is also compatible with ControlNet and LoRA. And it even supports the latest StableVideoDiffusionPipeline out of the box!

Installation¶

NOTE: stable-fast is currently only tested on Linux and WSL2 in Windows. You need to install PyTorch with CUDA support at first (versions from 1.12 to 2.1 are suggested).

I only test stable-fast with torch>=2.1.0, xformers>=0.0.22 and triton>=2.1.0 on CUDA 12.1 and Python 3.10. Other versions might build and run successfully but that’s not guaranteed.

Install Prebuilt Wheels¶

Download the wheel corresponding to your system from the Releases Page and install it with pip3 install <wheel file>.

Currently both Linux and Windows wheels are available.

# Change cu121 to your CUDA version and <wheel file> to the path of the wheel file.
# And make sure the wheel file is compatible with your PyTorch version.
pip3 install --index-url https://download.pytorch.org/whl/cu121 \
    'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3' \
    '<wheel file>'

Install From Source¶

# Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas

# Install PyTorch with CUDA and other packages at first.
# Windows user: Triton might be not available, you could skip it.
# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
pip3 install wheel 'torch>=2.1.0' 'xformers>=0.0.22' 'triton>=2.1.0' 'diffusers>=0.19.3'

# (Optional) Makes the build much faster.
pip3 install ninja

# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types.
# You can also install the latest stable release from PyPI.
# pip3 install -v -U stable-fast
pip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)

NOTE: Any usage outside sfast.compilers is not guaranteed to be backward compatible.

NOTE: To get the best performance, xformers and OpenAI’s triton>=2.1.0 need to be installed and enabled. You might need to build xformers from source to make it compatible with your PyTorch.