Skip to main content

moe_sbo

v0.5.6.post2

DeepseekV2MoE sbo的实现

sbo加入插件hook机制

https://github.com/sgl-project/sglang/pull/13327/changes

在sglang框架中,目前sbo特性在deepseek_v2.py中应用。

类型1 foward_normal实现中

# forwar中overlap shared_experts计算,判断是否启用sbo特性
self._fuse_shared_experts_inside_sbo = SboFlags.fuse_shared_experts_inside_sbo()
def forward_normal(
self,
hidden_states: torch.Tensor,
should_allreduce_fusion: bool = False,
use_reduce_scatter: bool = False,
gemm_output_zero_allocator: BumpAllocator = None,
) -> torch.Tensor:
...
# 判断是否启用sbo特性
if self._fuse_shared_experts_inside_sbo:
shared_output = None

def _pre_combine_hook(
dispatcher: BaseDispatcher, combine_input: CombineInput
):

nonlocal shared_output
self.alt_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(self.alt_stream):
shared_output = self._forward_shared_experts(
hidden_states, gemm_output_zero_allocator
)

pre_combine_hook_handle.remove()

def _post_combine_hook(
dispatcher: BaseDispatcher, hidden_states: torch.Tensor
):
nonlocal shared_output
torch.cuda.current_stream().wait_stream(self.alt_stream)
post_combine_hook_handle.remove()

pre_combine_hook_handle = self.experts.dispatcher.register_pre_combine_hook(
_pre_combine_hook
)
post_combine_hook_handle = (
self.experts.dispatcher.register_post_combine_hook(_post_combine_hook)
)
# experts计算
final_hidden_states = self.experts(
hidden_states,
topk_output,
)
...

在_pre_combine_hook中实现了使用alt_stream计算shared_experts。通过combine过程overlap计算shared_experts。

forward_deepep实现overlap分为了两种类型,分别在dispatch和combine阶段overlap shared_experts计算。

类型2 forward_deepep实现--_deepep_dispatch阶段overlap shared_experts计算

if sbo_overlap_dispatch_flag:
shared_output = None

def _deepep_dispatch_hook(dispatcher: BaseDispatcher):
nonlocal shared_output
shared_output = self._forward_shared_experts(hidden_states)
for handle in deepep_dispatch_hook_handle:
handle.remove()

def _post_dispatch_hook(
dispatcher: BaseDispatcher, dispatch_output: DispatchOutput
):
...

def _post_combine_hook(
dispatcher: BaseDispatcher, hidden_states: torch.Tensor
):
...

assert isinstance(self.experts.dispatcher, MaybeTboDeepEPDispatcher)
deepep_dispatch_hook_handle = (
self.experts.dispatcher.register_deepep_dispatch_hook(
_deepep_dispatch_hook
)
)
post_dispatch_hook_handle = (
self.experts.dispatcher.register_post_dispatch_hook(_post_dispatch_hook)
)
post_combine_hook_handle = (
self.experts.dispatcher.register_post_combine_hook(_post_combine_hook)
)
# experts计算

类型2 forward_deepep实现--_deepep_combine阶段overlap shared_experts计算

elif sbo_overlap_combine_flag:
shared_output = None

def _post_dispatch_hook(
dispatcher: BaseDispatcher, dispatch_output: DispatchOutput
):

...

def _pre_combine_hook(
dispatcher: BaseDispatcher, combine_input: CombineInput
):

nonlocal shared_output

if (
e := dispatcher.meta_overlap_args.get("record_event_after_down")
) is not None:
e.record()

# TODO reduce sm for non-deepgemm
with deep_gemm_wrapper.configure_deep_gemm_num_sms(
dispatcher.meta_overlap_args["compute_num_sms"]
):
shared_output = self._forward_shared_experts(hidden_states)

pre_combine_hook_handle.remove()

def _post_combine_hook(
dispatcher: BaseDispatcher, hidden_states: torch.Tensor
):
...

# hook注册
post_dispatch_hook_handle = (
self.experts.dispatcher.register_post_dispatch_hook(_post_dispatch_hook)
)
pre_combine_hook_handle = self.experts.dispatcher.register_pre_combine_hook(
_pre_combine_hook
)
post_combine_hook_handle = (
self.experts.dispatcher.register_post_combine_hook(_post_combine_hook)
)
# experts计算
final_hidden_states = self.experts(
hidden_states=hidden_states,
topk_output=topk_output,
)

deepep

在deepep之前有使用allgather+allreduce做moe的token dispatch+combine的(过程是attention/MLA采用DP之后做一次allgather,在MLP执行EP后进行一次全量的allreduce),但是会存在通信量大效率低的问题。

DeepEP 是一个专门为混合专家系统(Mixture-of-Experts, MoE)和专家并行(Expert Parallelism, EP)设计和优化的通信库,提供基于Hopper GPU架构(以及在做其他GPU架构的支持)高吞吐和低延迟的all-to-all Kernel,这个all-to-all算子包括MoE dispatch and combine两阶段,以及支持 FP8 低精度量化运算,特别适用于 MoE(DeepSeek) 系列模型。

prefill和训练阶段:

  1. 高效利用NVLINK:RDMA=3.2(对于H800 NVLink带宽160 GB/s约为RDMA带宽50 GB/s的3倍)这个条件,将同节点的EP从RDMA转换成NVLINK,并做NVLINK和RDMA的overlap。
  2. 减少跨导轨的流量。
  3. 控制过多的跨节点流量。

举个例子,假设我们的SLO是TPOT=20ms且假设尽量将这些alltoall的通信隐藏,则要求每次alltoall 1.7MB在百us内完成。 以DeepSeekV3为例,一个token的传输量=hidden_size=7KB,每层MoE有两次all-to-all=dispatch + combine,一共61层,每个专家分布均匀。 假如batch_size=256(In the decoding stage, the batch size per expert is relatively small (usually within 256 tokens), and the bottleneck is memory access rather than computation.) 那么,一次all-to-all的通信大概是:256 * 7KB = 1.78MB, dispatch + combine 2次all-to-all = 1.78M2 = 3.56MB,全部层 3.56MB 61 = 218MB 假设decode阶段出一次token需要20ms,平均每张GPU需要218 * (1000ms/20ms)=10.9GB带宽。 接着考虑用两个micro-batch把all-to-all通信进行overlap,和prefill阶段不同,在decode阶段attention的计算占比是比较大的,考虑用attention和dispatch+MoE+combine进行overlap,那么如果需要完整的overlap: 假设每一层attention的时间如果是T,moe计算时间如果是t,那么每一次alltoall传输完成时间需要小于(T-t)/2。 假设每层attention时间是320us,MLP时间是120us,那么每次alltoall传输需要在100us内完成,传输量为1.78MB。