Mamba / Mamba-2 / Nemotron-H / Nemotron 3 / TensorRT-Edge-LLM

Mamba 与 Nemotron 系列模型深度调研

这份笔记把三件事连在一起：Mamba 如何用固定大小 recurrent state 替代长 KV cache，Nemotron 为什么从 dense Transformer 走向 hybrid Mamba-Transformer MoE，以及本地 TensorRT-Edge-LLM 如何把这种架构导出、编译、运行和维护状态。

调研日期：2026-05-05 本地源码提交：f673b99 输出：开发与使用导向

Transformer Attention历史 token 以 K/V 形式持续增长

Decode 成本读历史 KV，显存和访存随上下文增长

Mamba / SSM历史压缩到 recurrent state + conv state

Decode 成本每步更新固定大小状态

Nemotron Hybrid少量 Attention 保留检索能力，多数层换成 Mamba

工程代价运行时必须同时管理 KV cache 与 SSM state

核心结论

先抓住这四个判断，后面的细节会自然落位。

Mamba 不是注意力优化

它是 selective state space model：历史不以 KV 保存，而是通过状态递推被压缩到固定形状的 state。长上下文 decode 的吸引力来自这个状态规模不随 token 数线性增长。

Mamba-2 让它更像系统组件

Mamba-2 用 structured state space duality 把 SSM 与 attention 的矩阵结构联系起来，核心层相对 Mamba refined selective SSM 有 2-8 倍速度提升的官方论文表述。

Nemotron 的方向是 hybrid

Nemotron-4 340B 仍是 dense Transformer；Nemotron-H 与 Nemotron 3 则明确转向 Mamba-Transformer，目标是在推理效率、长上下文和 reasoning 之间找平衡。

落地难点在状态一致性

工程实现不能只替换一个 layer。prefill、decode、batch compaction、system prompt cache、MTP/speculative decoding 都要让 KV cache 与 Mamba state 同步演进。

Mamba 架构

从使用者角度，它像一个序列 mixer；从推理系统角度，它是一组显式 state input/output。

1. Selective SSM 的直觉

Mamba 的 SSM 状态可以理解成一个可学习的、按 token 更新的记忆槽。输入 token 经过投影产生 x、动态步长 dt，以及写入/读出系数 B、C。每个 token 都会更新状态，再从新状态读出输出。

new_state = state * exp(A * dt) + B * dt * x
output = sum(new_state * C) + D * x

本地 TensorRT plugin 的注释也采用这组公式。

2. Mamba block 的常见路径

in_proj 将 hidden state 投影成 gate、conv path、dt。
causal conv1d 维护短程局部状态。
conv 输出切成 SSM 输入 x、B、C。
selective state update 更新 recurrent state。
gated RMSNorm 与 out_proj 回到 hidden size。

这也是本地 Nemotron-H MambaMixer.forward 的主线。

Mamba 的关键不是“没有 attention 就更快”这么简单，而是把内容相关的选择性引入 SSM：每个 token 可以动态调节如何写入状态、如何从状态读出。Mamba-2 进一步把 SSM 与 attention 放到 structured semiseparable matrix 框架下理解，给了更清晰的并行 scan 和 kernel 优化空间。

decode 阶段天然是一 token 一步，single-step state update 很合适；prefill 阶段则要对长 prompt 做 scan，如果只在 plugin 内部按 token loop，就正确但慢。本地实现默认可走 step loop，并预留 CuTe DSL chunked SSD 路径给长 prefill。

官方 state-spaces/mamba 包提供 Mamba、Mamba2、Mamba3 模块，也建议安装 causal-conv1d 加速 causal conv。业务使用通常不会直接手写 selective scan，而是通过 Hugging Face model 或推理框架加载具体 checkpoint。

Nemotron 系列

Nemotron 不是单一模型，而是 NVIDIA 从数据、训练、后训练、推理部署一起推进的模型族。

2024-06

Nemotron-4 340B

以生成合成数据和 reward model 为重要定位的开放模型族，架构上仍是大规模 dense Transformer。它是后续 Nemotron 数据与 alignment 流水线的重要前史。

2025

Nemotron-H

NVIDIA Research 介绍的 hybrid Mamba-Transformer 系列，包含 8B、47B、56B 等版本。56B-Base 使用 54 个 Mamba-2 层、54 个 MLP 层和 10 个 self-attention 层；8B-Base 使用 24 个 Mamba-2、24 个 MLP 和 4 个 attention 层。

2026-03

Nemotron-Nano-9B-v2

NVIDIA arXiv 报告把它定位为 compute-efficient reasoning 模型，延续 Mamba-Transformer hybrid 路线，强调 reasoning、工具调用和长上下文等后训练能力。

2026-06

Nemotron 3 Super / Ultra

Nemotron 3 Ultra 模型卡写明 550B 总参数、55B active、Mamba-2 + MoE + Attention hybrid LatentMoE、MTP、最高 1M token context。Super 则是更小 active footprint 的 hybrid Mamba-Transformer MoE，用于 agentic reasoning。

模型线	架构关键词	开发重点	使用重点
Nemotron-4 340B	Dense Transformer	合成数据、reward model、alignment 数据生产	更像数据/评测/对齐基础设施，不是 Mamba 学习的主样本
Nemotron-H	Mamba-2 + MLP + 少量 Attention	减少 self-attention 层以改善长上下文推理效率；FP8 预训练；47B 蒸馏压缩	适合理解 hybrid state 管理，也是 TensorRT-Edge-LLM 本地实现最贴近的模型族
Nemotron Nano v2	Hybrid Mamba-Transformer reasoning model	小体量 reasoning 与工具使用能力	更适合单机或较低成本实验，注意自定义 chat template 与 thinking 开关
Nemotron 3 Ultra	LatentMoE: Mamba-2 + MoE + Attention + MTP	550B/55B active、NVFP4 训练、1M context、agentic reasoning	主要面向多节点 vLLM/SGLang/TensorRT-LLM；长上下文要显式配置 max model length

TensorRT-Edge-LLM 实现链路

你给的 deep dive 与源码相互印证：这里的 Mamba 不是一个独立模型类型，而是 hybrid decoder 中的一类 stateful layer。

Config 解析 hybrid layer

_parse_layer_types 读取 layers_block_type、layer_types 或 hybrid_override_pattern，把 attention、mamba、mlp、moe 等类型规范化。_parse_mamba_cfg 解析 num_heads、head_dim、ssm_state_size、conv_dim、conv_kernel、n_groups。

Python 模型保留 Mamba2 结构

MambaMixer 中 in_proj 输出 gate、conv path、dt；causal_conv1d 维护 conv state；update_ssm_state 更新 recurrent SSM state；最后 gated RMSNorm 和 out_proj 回到 hidden size。

ONNX custom op 穿过 dynamo export

causal_conv1d 和 update_ssm_state 是 trace-time stub，返回形状正确的 dummy tensor。dynamo translation 再把它们变成 trt_edgellm ONNX domain 的 custom op。

TensorRT Plugin 做 selective state update

MambaPlugin 注册为 update_ssm_state，输入包含 x, A, B, C, D, dt, dt_bias, state, context_lengths，输出 output 与 state_out。enqueue 会把 input state 拷到 output state，再原地更新。

Runtime 统一管理 KV 与 Mamba state

MambaCacheManager 每个 recurrent layer 持有 recurrent state 与 conv state。HybridCacheManager 根据 layer type 路由到 KVCacheManager 或 MambaCacheManager。

状态契约

理解开发和使用的关键，是把所有跨 token 状态显式化。

Attention state

KV cache 随上下文长度增长。它保留历史 token 的 key/value，decode 时当前 token attend 到历史。

Mamba recurrent state

形状为 [batch, num_heads, head_dim, state_size]，保存 SSM 的长期递推状态。

Mamba conv state

形状为 [batch, conv_dim, conv_kernel]，保存 causal conv1d 的短程窗口状态。

prefill 与 decode

prefill 输入通常包含 seq_len 维度，需要用 context_lengths 避免 padding 污染状态。decode 通常 seq_len=1，此时不应把 cumulative context length 当成 scan 长度。

MTP / speculative decoding

候选 token 会产生中间 Mamba state。最终接受多少 token 由 acceptLengths 决定，运行时必须把对应 intermediate recurrent/conv state scatter 回主状态，否则后续 decode 历史会错位。

开发与使用路线

把“研究模型”和“跑起来”拆成四层，会少踩很多坑。

开发一个 hybrid Mamba 模型适配

先把 HF config 中的 layers_block_type、Mamba 维度和 MoE/MLP 层类型解析成内部统一 config。
在 Python 模型中保持清晰的 block 结构，不要一开始就塞成单个黑盒 plugin。
对 ONNX 不友好的 kernel 用 custom op stub 加 translation。
在 TensorRT plugin 中实现真正 kernel，并严格固定输入输出 dtype 与 shape 约定。
运行时显式绑定 past/present state，保证 state 生命周期由 cache manager 控制。

使用 Nemotron 模型

小规模理解和实验：先看 Nemotron-H-8B 或 Nemotron-Nano，而不是直接上 Ultra。
服务推理：优先按模型卡选择 vLLM、SGLang 或 TensorRT-LLM，并确认该框架支持相应 custom architecture。
长上下文：不要只看模型号，要显式设置 max_model_len、context length 环境变量和显存预算。
reasoning/tool use：按 chat template 传入 thinking 或 tool-call 相关开关，Ultra 模型卡特别提醒 tool calls 与 reasoning parsing 的参数。
低精度：区分 BF16、FP8、NVFP4、FP4。低精度不仅是权重量化，还影响 kernel、state dtype 与服务框架兼容性。

最小概念代码：Mamba-2 block 试用

pip install causal-conv1d>=1.4.0 --no-build-isolation
pip install mamba-ssm[causal-conv1d] --no-build-isolation

from mamba_ssm import Mamba2

model = Mamba2(
    d_model=2560,
    d_state=64,
    d_conv=4,
    expand=2,
    headdim=128,
).cuda()

y = model(x)  # x: [batch, seq_len, d_model]

这段用于理解 block 接口。Nemotron-H / Nemotron 3 实际使用还依赖 NVIDIA 发布的模型代码、chat template 和推理框架支持。

容易混淆点

这些点是读论文、读模型卡和读 TensorRT-Edge-LLM 源码时最容易错位的地方。

“Mamba 层”不等于“没有状态”

它没有 KV cache，但有 recurrent state 与 conv state。batch eviction、prompt cache 和 speculative accept/reject 仍然要维护状态，只是状态形状不同。

hybrid layer 数不等于 decoder layer 总数

Nemotron-H 中 MLP/MoE 层可能不携带跨 token cache。本地 runtime config 会跳过非 stateful 层，只把 attention 和 mamba 写入 layer_types。

长上下文收益主要在 decode

Mamba 的固定状态减少 decode 的历史读写压力；prefill 仍要处理整段 prompt，没有并行 scan 优化时可能成为瓶颈。

Nemotron 不是全都 Mamba

Nemotron-4 是 dense Transformer；Nemotron-H 和 Nemotron 3 才是这里关注的 hybrid Mamba-Transformer 主线。Nemotron 3 Ultra 还加入 LatentMoE 与 MTP。

源码阅读顺序

按这个顺序读，会从模型语义一路读到运行时状态绑定。

配置解析
MambaConfig 与 layer type 解析。

模型结构
MambaMixer 与 Nemotron-H block forward。

导出边界
custom op stub 与 ONNX translation。

dtype 修正
initializer fixup 中保留 update_ssm_state 的 A 为 FP32。

TensorRT plugin
MambaPlugin 接口与 enqueue。

运行时状态
MambaCacheManager 与 HybridCacheManager。

engine 绑定
recurrent state binding 与 conv state binding。

runtime config
build_runtime_llm_config_dict 与 C++ config 解析。

参考资料

优先使用论文、官方仓库、NVIDIA 官方资料和模型卡。