stable-diffusion.cpp适配 MiniT2I

背景¶

MiniT2I 是一个轻量级文生图模型。和 stable-diffusion.cpp 里常见的 SD1.x / SDXL U-Net 模型不同，MiniT2I 的核心不是多尺度卷积 U-Net，而是一个 DiT / MM-DiT 风格的图文联合 Transformer。

这次适配的目标是在 stable-diffusion.cpp 中支持 MiniT2I 推理，包括：

识别 MiniT2I 模型结构和权重前缀；
加载 MiniT2I diffusion transformer；
接入 google/flan-t5-large 文本编码器输出；
实现 MiniT2I 专用 sampling / CFG 更新流程；
支持 Metal / CUDA 后端运行；
缓存每 step 不变的位置编码和 RoPE；
验证 CUDA --diffusion-fa 对 MiniT2I 的加速效果。

MiniT2I 是什么模型架构¶

从适配过程看，MiniT2I 可以理解成：

Text Only
1 2 3 4 5	`T5 text encoder + patchify image/noise input + text preamble transformer + text-image double-stream DiT blocks + final image-token projection / unpatchify`

默认 b16 配置大致如下：

配置项	值
image size	512
patch size	16
image channels	3
T5 hidden size	1024
model hidden size	768
text hidden size	768
prompt length	256
text preamble blocks	2
double-stream blocks	17
heads	12
head dim	64
MLP ratio	2.6667
patch tokens	32 x 32 = 1024

核心配置来自：

Text Only
1 2	`src/model/diffusion/minit2i.hpp MiniT2IConfig`

端到端推理数据流¶

flowchart TD
    P["Prompt"] --> T5["google/flan-t5-large\nT5 Encoder"]
    T5 --> H["Text hidden states\n[L=256, D=1024]"]
    H --> M["Prompt mask\nvalid token mask"]

    N["Initial noise image tensor\n512x512x3"] --> XT["x_t = noise * 2"]
    XT --> LOOP["MiniT2I sampling loop"]

    LOOP --> CF["Cond forward\nmask = prompt mask"]
    LOOP --> UF["Uncond forward\nmask = zeros"]

    H --> CF
    M --> CF
    H --> UF
    ZM["zero mask"] --> UF

    CF --> CX0["cond_x0"]
    UF --> UX0["uncond_x0"]

    CX0 --> CFG["CFG velocity update\nv = uncond_v + cfg_scale * (cond_v - uncond_v)"]
    UX0 --> CFG
    CFG --> NEXT["x_t += v * dt"]
    NEXT --> LOOP

    LOOP --> OUT["Final denoised image tensor"]
    OUT --> FVAE["FakeVAE decode"]
    FVAE --> IMG["Output PNG"]

MiniT2I 当前在 stable-diffusion.cpp 中使用 FakeVAE。也就是说，它不像传统 SD 那样通过 VAE latent decode 还原图像，而更接近直接在图像张量 / 伪 latent 张量上工作。

对应源码位置：

Text Only
1 2	`src/stable-diffusion.cpp sd_version_is_minit2i(version) -> FakeVAE`

Diffusion Backbone¶

MiniT2I diffusion backbone 由 MMJiT 实现。它先把图像 patchify 成 token，把 T5 hidden states 投影到模型 hidden size，然后通过文本预处理 block 和图文双流 block 做联合建模。

flowchart TD
    X["Image / noisy x_t\n[W,H,3,B]"] --> PE["BottleneckPatchEmbed\n16x16 conv stride 16 + 1x1 conv"]
    PE --> IMG["Image tokens\n[1024, 768]"]
    POS["2D sin/cos pos_embed\ncached"] --> IMGADD["Add image pos embed"]
    IMG --> IMGADD

    TXT0["T5 hidden states\n[256,1024]"] --> MASK["apply_text_mask\nprompt token or mask_token"]
    MASK --> TXTE["txt_embedder Linear\n1024 -> 768"]
    TXTE --> TXT["Text tokens\n[256,768]"]

    TXTR["text RoPE\ncached"] --> TP["2x PlainTextTransformerBlock"]
    TXT --> TP
    TP --> TXT2["Refined text tokens"]

    IMGADD --> DB["17x DoubleStreamDiTBlock"]
    TXT2 --> DB
    JROPE["joint text+vision RoPE\ncached"] --> DB

    DB --> IMG2["Updated image tokens"]
    DB --> TXT3["Updated text tokens"]

    TXT3 --> CAT["concat text + image tokens"]
    IMG2 --> CAT
    CAT --> FINAL["FinalLayer\nRMSNorm + Linear"]
    FINAL --> SLICE["slice image tokens"]
    SLICE --> UNPATCH["DiT unpatchify"]
    UNPATCH --> X0["predicted x0"]

DoubleStreamDiTBlock¶

MiniT2I 最核心的是 DoubleStreamDiTBlock。每个 block 分别处理 image tokens 和 text tokens，然后把二者的 Q/K/V 拼成 joint sequence 做 attention。

flowchart LR
    IMG["image tokens"] --> IN1["RMSNorm"]
    TXT["text tokens"] --> TN1["RMSNorm"]

    IN1 --> IQKV["img_qkv Linear"]
    TN1 --> TQKV["txt_qkv Linear"]

    IQKV --> IQ["img q/k/v"]
    TQKV --> TQ["txt q/k/v"]

    IQ --> CONCAT["concat text+image q/k/v"]
    TQ --> CONCAT

    CONCAT --> ROPE["RoPE"]
    ROPE --> ATTN["Self Attention over joint sequence\ntext + image tokens"]
    ATTN --> SPLIT["split attention output"]

    SPLIT --> IPROJ["img attn proj"]
    SPLIT --> TPROJ["txt attn proj"]

    IPROJ --> IRES["image residual"]
    TPROJ --> TRES["text residual"]

    IRES --> IMLP["RMSNorm + SwiGLU MLP"]
    TRES --> TMLP["RMSNorm + SwiGLU MLP"]

    IMLP --> IMGOUT["updated image tokens"]
    TMLP --> TXTOUT["updated text tokens"]

从 profiling 看，DoubleStreamDiTBlock 内部会带来几类典型 kernel：

cutlass_80_tensorop_*：Linear / GEMM；
soft_max_f32 / scale_f32：非 flash attention 路径下的 attention；
cpy_scalar / concat_*：Q/K/V concat、layout 转换；
k_bin_bcast：elementwise / broadcast；
rms_norm_f32：RMSNorm；
silu：SwiGLU MLP gate。

适配核心实现¶

1. 增加 MiniT2I 模型识别与加载¶

新增 MiniT2I 模型版本识别，并在模型初始化时选择：

Text Only
1 2 3	`src/model.h src/model_loader.cpp src/stable-diffusion.cpp`

MiniT2I diffusion runner 挂在：

Text Only
1 2	`src/model/diffusion/minit2i.hpp MiniT2I::MiniT2IRunner`

权重前缀支持多个候选：

Text Only
1 2 3	`model.net model.diffusion_model.net model.diffusion_model.model.net`

这样可以兼容不同保存格式下的 safetensors key。

2. 接入 flan-t5-large 文本编码器¶

MiniT2I 使用 T5 hidden states 作为文本条件。本次适配中使用：

Text Only
1	`google/flan-t5-large`

文本编码输出：

Text Only
1 2	`cond.c_crossattn -> T5 hidden states cond.c_vector -> prompt mask`

MiniT2I sampling 分支会检查二者是否存在：

Text Only
1	`MiniT2I requires T5 hidden states and prompt mask`

3. 实现 MiniT2I 专用 sampling¶

MiniT2I 不走通用 SD denoiser/sigma 路径，而是使用独立的 x0 / velocity update：

Text Only

x_t = noise * 2

for step i:
    t_cur  = i / steps
    t_next = (i + 1) / steps

    cond_x0   = MiniT2I(x_t, t_cur, prompt_mask)
    uncond_x0 = MiniT2I(x_t, t_cur, zero_mask)

    cond_v   = (cond_x0 - x_t) / (1 - t_cur)
    uncond_v = (uncond_x0 - x_t) / (1 - t_cur)
    v        = uncond_v + (cond_v - uncond_v) * cfg_scale

    x_t += v * (t_next - t_cur)

这一段位于：

Text Only
1 2	`src/stable-diffusion.cpp MiniT2I denoise loop`

当前 CFG 路径每个 step 需要两次 diffusion forward：

conditional forward；
unconditional forward。

这也是后续性能分析里最重要的瓶颈来源。

4. 缓存位置编码和 RoPE¶

MiniT2I 每个 step 的图像尺寸、文本长度、hidden size、head dim 通常不变。因此这些 tensor 不应该每个 denoise step 都重新生成和上传：

image 2D sin/cos pos_embed；
text RoPE；
joint text+vision RoPE。

优化 commit：

Text Only
1	`8de8f95 Optimize MiniT2I position cache`

实现方式：

Text Only
1	`MiniT2IRunner::ensure_position_cache(img_side, txt_len)`

缓存 tensor 分配在 runner-level backend buffer 中：

Text Only
1 2 3	`cached_pos_embed cached_txt_pe cached_joint_pe`

本地 Metal 10 steps 上，sampling 从约 18.77s 降到约 14.23s。远程 CUDA 上收益不明显，说明 CUDA 主要瓶颈不在这些 CPU 生成和上传。

5. 清理未使用 conditioning branch¶

原始图里有：

Text Only
1	`t_vec + pooled_text -> vec`

但当前 MiniT2I forward 中这条 branch 没被 block 或 final layer 使用。清理 commit：

Text Only
1	`dfb6ca2 Remove unused MiniT2I conditioning branch`

本地验证结果显示输出 hash 与清理前一致，因此属于安全清理。

官方 Python pipeline 对照¶

Python
import torch
from diffusers import DiffusionPipeline

HUB_MODEL_PATH = "/home/ken/tmp/MiniT2I/MiniT2I"

pipe = DiffusionPipeline.from_pretrained(
    HUB_MODEL_PATH,
    custom_pipeline=HUB_MODEL_PATH,
    trust_remote_code=True,
    local_files_only=True,
).to("cuda")

image = pipe(
    "A girl and a boy kiss",
    model_type="b16",
    guidance_scale=2.5,
    num_inference_steps=10,
    torch_dtype=torch.bfloat16,
).images[0]
image.save("minit2i-b16.png")

stable-diffusion.cpp CUDA 测试¶

Bash
cd stable-diffusion.cpp

./build-cuda/bin/sd-cli \
  --backend cuda \
  --model MiniT2I/MiniT2I/minit2i-b-16/transformer/diffusion_pytorch_model.safetensors \
  --t5xxl huggingface/hub/models--google--flan-t5-large/snapshots/0613663d0d48ea86ba8cb3d7a44f0f65dc596a2a/model.safetensors \
  --prompt "a cat" \
  --steps 10 \
  --cfg-scale 6 \
  --width 512 \
  --height 512 \
  --seed 42 \
  --sampling-method euler \
  --rng cpu \
  --output /tmp/minit2i_cuda_step10.png \
  --threads 8

性能分析¶

基线 Profiling¶

为了量化 MiniT2I 的瓶颈，加入过两个 profiling 开关：

Bash
SD_PROFILE_BACKEND_IO=1  # backend upload/readback + per-step host timing
SD_PROFILE_NVTX=1        # NVTX ranges for nsys/ncu

NVTX 范围：

Text Only
1 2 3 4 5	`MiniT2I sampling MiniT2I step N MiniT2I cond forward MiniT2I uncond forward MiniT2I CPU CFG update`

Mac Metal 结果¶

测试配置：

Text Only
1 2 3 4 5 6	`prompt: a cat steps: 10 cfg-scale: 6 size: 512x512 seed: 42 backend: metal`

结果：

Text Only
1 2 3 4	`sampling completed: 15.44s generate_image completed: 17.92s output hash: 73413ff571b005b2b786250d5cbab2a7660f7951e2c8f15521ab55811f1e0b77`

观察：

step 2 后，单次 forward 大多约 610-650ms；
每 step 有 cond/uncond 两次 forward；
cpu_update_ms 大多约 2ms；
latent 大小为 3145728 bytes；
backend upload/readback 多数在 0.05-0.2ms 级别。

结论：

Metal 上主要瓶颈仍然是 MiniT2I transformer forward。CFG / x_t 的 CPU 更新不是主要瓶颈。

Remote CUDA 结果¶

GPU：

Text Only
1 2	`NVIDIA GeForce RTX 4050 Laptop GPU VRAM 6140 MiB`

测试结果：

Text Only
1 2 3 4	`sampling completed: 4.12s generate_image completed: 5.52s output hash: 76ed368fe1447d0de3027021889fb2255a3a924992039be5ba0ddc467046c416`

稳定态观察：

单次 MiniT2I forward 大多约 122-135ms；
每 step 两次 forward 合计约 245-270ms；
CPU CFG update 大多约 7-17ms；
latent upload/readback 多数约 0.5-2ms。

结论：

CUDA 上 CPU CFG / x_t update 比 Metal 更显眼，但仍然低于两次 diffusion forward 的总耗时。若只把 sampler update 搬到 backend，收益大致受限在每 step 十几毫秒级别；更高优先级应该是减少 forward 次数或优化 forward 内部 attention/GEMM。

Nsight Systems 结果¶

nsys + NVTX 统计：

Text Only

MiniT2I sampling: 4189.238ms

step 1: 1775.540ms
  cond forward:   1592.161ms
  uncond forward: 144.853ms
  CPU CFG update: 37.960ms

step 2-10:
  per step:        253-276ms
  cond forward:    123-131ms
  uncond forward:  122-131ms
  CPU CFG update:  6-21ms

CUDA API 摘要：

Text Only
1 2 3 4	`cudaStreamSynchronize: 747 calls, total 2.197s cudaMemcpyAsync: 1464 calls, total 1.533s cudaMalloc: 9 calls, total 0.429s cudaLaunchKernel: 17872 calls, total 0.238s`

这个结果说明：

首 step 有明显 warmup；
稳定态性能主要由两次 diffusion forward 决定；
host-side CFG update 有优化空间，但不是最大项；
CUDA API 中同步和 memcpy 数量偏多，后续可以结合 backend-resident sampler / graph cache 继续分析。

Nsight Compute：Forward 内部画像¶

对 MiniT2I cond forward 进行 ncu 抽样，前 1000 个 kernel 基本覆盖一次稳定 forward 主体。两段采样合计 kernel duration 约：

Text Only
1	`131.57ms`

分类统计：

类别	kernel 数	耗时	占比
CUTLASS GEMM	204	54.73 ms	41.6%
Softmax	18	36.04 ms	27.4%
Scale	19	12.73 ms	9.7%
Copy/Layout	231	12.15 ms	9.2%
Elementwise/Broadcast	355	9.23 ms	7.0%
RMSNorm	136	5.66 ms	4.3%

主要发现：

GEMM 是最大头，但平均 SM throughput 约 37%，average achieved occupancy 约 10%，更像 small batch / skinny GEMM / tile shape 不充分；
非 flash attention 路径下 soft_max_f32 + scale_f32 合计约 48.8ms，这是非常明确的优化目标；
cpy_scalar / concat_* 约 12ms，对应 double-stream block 中 Q/K/V concat 和 layout 转换；
elementwise/broadcast 数量多，单个小，但累计也有可见开销；
RMSNorm 约 5.66ms，属于中等优先级优化点。

开启 Flash Attention 后的性能¶

MiniT2I 的 attention 路径是：

Text Only
1 2 3	`DoubleStreamDiTBlock -> Rope::attention -> ggml_ext_attention_ext(..., ctx->flash_attn_enabled, ...)`

因此理论上 --diffusion-fa 可以打开 MiniT2I diffusion model 内部的 flash attention。

测试命令：

Bash
cd /home/ken/cc_workspace/stable-diffusion.cpp

./build-cuda/bin/sd-cli \
  --backend cuda \
  --diffusion-fa \
  --model /home/ken/tmp/MiniT2I/MiniT2I/minit2i-b-16/transformer/diffusion_pytorch_model.safetensors \
  --t5xxl /home/ken/.cache/huggingface/hub/models--google--flan-t5-large/snapshots/0613663d0d48ea86ba8cb3d7a44f0f65dc596a2a/model.safetensors \
  --prompt "a cat" \
  --steps 10 \
  --cfg-scale 6 \
  --width 512 \
  --height 512 \
  --seed 42 \
  --sampling-method euler \
  --rng cpu \
  --output /tmp/minit2i_cuda_diffusion_fa_step10.png \
  --threads 8

运行日志确认：

Text Only
1	`Using flash attention in the diffusion model`

ncu 抽样确认出现 flash attention kernel：

Text Only
1 2	`flash_attn_ext_f16<64, 64, 64, ...> flash_attn_stream_k_fixup_general<...>`

并且 soft_max_f32 从 sampled forward 中消失，说明不是只打开 flag 后 fallback，而是真正走到了 fused flash attention 路径。

性能对比¶

同样配置：

Text Only
1 2 3 4 5 6	`prompt: a cat steps: 10 cfg-scale: 6 size: 512x512 seed: 42 backend: cuda`

模式	稳定步单次 forward	sampling 总耗时	输出 hash
no-FA	约 `120-130ms`	`4.32s`	`76ed368fe1447d0de3027021889fb2255a3a924992039be5ba0ddc467046c416`
`--diffusion-fa`	约 `61-73ms`	`2.54s`	`a165d4be3c97aca869c61d18062033680ec9f22f1e8244b88b2f67fc69923f23`

注意：输出 hash 不同是预期的。flash attention 走 fused / F16 attention 路径，不会与原始分解 attention bitwise 一致。验证中输出图片尺寸为 512x512，像素分布正常。

Flash Attention 后的 forward 内部变化¶

ncu 对 --diffusion-fa 的 MiniT2I cond forward 抽样 300 个 kernel：

类别	kernel 数	抽样耗时	占比
CUTLASS/GEMM	46	9.223 ms	50.16%
Copy/Layout	81	3.229 ms	17.56%
Elementwise/Broadcast	113	2.438 ms	13.26%
FlashAttention	8	1.678 ms	9.13%
RMSNorm	39	1.507 ms	8.20%
Activation	9	0.216 ms	1.17%

对比 no-FA，原来显著的 soft_max_f32 / scale_f32 开销被替换为 flash attention kernel。forward 总体明显变快，下一阶段优化重心应转向：

GEMM shape / batch 利用率；
Q/K/V concat 与 layout copy；
cond/uncond 合批；
elementwise / RMSNorm 融合。

后续优化方向¶

1. CFG cond/uncond 合批¶

当前 CFG 每 step 需要两次 forward：

Text Only
1 2	`cond forward uncond forward`

如果 MiniT2I batch=2 通路稳定，可以把 cond/uncond 拼成一个 batch 一次 forward，再拆输出做 CFG。潜在收益包括：

减少 graph build / launch 次数；
提高 GEMM batch 和 tile 利用率；
降低 CPU/GPU round-trip 次数。

风险点：

MiniT2I 的 batch 语义必须和 Python 原版一致；
text mask、context、pos/RoPE 缓存必须正确广播；
输出一致性需要重新验证。

2. Backend-resident sampler¶

当前 cond_v / uncond_v / v / x_t += ... 在 host-side sd::Tensor 上完成。CUDA profiling 显示它通常是 7-17ms/step，有优化空间，但不是最大瓶颈。

可以做 MiniT2I 专用原型：

Text Only
1 2 3 4 5 6	`x_t stays on backend -> cond forward -> uncond forward -> backend graph computes CFG and x_next -> x_next reused by next step -> final readback only once`

但这会改变跨 step 数据流，风险高于 flash attention 和位置缓存，应放在更后面。

3. Layout / concat 优化¶

ncu 显示 no-FA 下 cpy_scalar / concat_* 约 12ms。这些主要来自：

text/image QKV concat；
RoPE 前后的 permute/contiguous；
final concat/slice/unpatchify。

这类优化的方向包括：

避免不必要的 materialize；
尽量让 attention kernel 接收更接近原始 layout 的输入；
合并 concat + attention 的 layout 处理。

4. GEMM 利用率优化¶

GEMM 是 no-FA 和 FA 后都最大的计算项。ncu 显示部分 CUTLASS kernel achieved occupancy 较低，说明可能受小 batch、小矩阵形状或 shared memory/tile 配置影响。

可探索方向：

cond/uncond batch=2；
使用更适合小 batch 的 GEMM kernel；
检查 Linear 权重布局与输入 layout 是否导致额外 transpose/copy；
对特定 hidden size / token shape 做专用路径。

总结¶

MiniT2I 适配后，stable-diffusion.cpp 已经可以在 Metal 和 CUDA 上运行 MiniT2I b16 文生图模型。模型本质上是一个 T5 文本编码器 + MMJiT/DiT 图文联合 Transformer + patchify/unpatchify 输出头的轻量文生图架构。

本次适配中最重要的工程结论是：

MiniT2I 不是 U-Net，优化重点应放在 Transformer forward；
每 step cond/uncond 两次 forward 是主要瓶颈；
位置编码/RoPE 缓存在 Metal 上有明显收益；
CUDA 上 no-FA 的 attention softmax/scale 开销非常大；
--diffusion-fa 可以正确启用 MiniT2I flash attention，并将 10-step sampling 从 4.32s 降到 2.54s；
Flash Attention 后，优化重心转移到 GEMM 利用率、layout/copy、合批和小算子融合。

从这个适配过程看，端侧 / 本地大模型推理框架支持新模型时，真正困难的往往不只是“把图搭出来”，而是要持续回答几个问题：

模型结构和原始 pipeline 的语义是否对齐；
哪些张量是 step-invariant，能否缓存；
哪些计算在 host，哪些计算在 backend；
后端是否真正命中了 fused kernel，而不是静默 fallback；
profile 看到的最大瓶颈是否和直觉一致。