并行

https://zhuanlan.zhihu.com/p/1937556222371946860

NVIDIA NeMo Framework https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html

有几张动图可以看看，有一些其他文档没讲的并行方式，还有一些工程上的概念

MOE动图： https://pytorch.org/blog/training-moes/

在大模型（LLM）训练和推理中，随着上下文窗口（Context Window）的不断扩大，单张 GPU 的显存和计算能力迅速成为瓶颈。

1. Sequence Parallelism (SP) — 序列并行

核心关键词： Tensor Parallelism (TP) 的扩展、非线性层优化、显存节省

为什么需要 SP？

在传统的 Tensor Parallelism (TP)（如 Megatron-LM 的早期版本）中，虽然巨大的矩阵乘法（Linear Layers）被切分到了多个 GPU 上并行计算，但在两个 Tensor Parallel 模块之间，通常存在一些未被切分的操作，例如 LayerNorm（层归一化）和 Dropout。

TP 的问题： 在这些未切分的层上，每个 GPU 都保存了完整的序列副本（duplicated inputs）。这意味着，如果序列长度（Sequence Length）很长，这部分的激活值（Activation Memory）会占据大量显存，且存在重复计算。

SP 如何工作？

Sequence Parallelism 正是为了解决这个问题。它并没有改变矩阵乘法（GEMM）部分的切分方式，而是在那些之前没有并行化的部分（LayerNorm, Dropout），沿着序列维度（Sequence Dimension）进行切分。

流程变化：
- 原 TP： All-Reduce（同步完整结果） -> LayerNorm（每张卡都算完整的） -> 下一层。
- SP： Reduce-Scatter（每张卡只拿一部分序列结果） -> LayerNorm（每张卡只算自己那部分序列） -> All-Gather（收集完整序列用于下一次矩阵乘法）。
收益： 这里的 LayerNorm 和 Dropout 的激活显存被分摊到了 N 张卡上，显著减少了显存占用，允许训练更长的序列。

2. Context Parallelism (CP) — 上下文并行

核心关键词： Attention 计算切分、全层切分、超长上下文

为什么需要 CP？

SP 虽然节省了 LayerNorm/Dropout 的显存，但并没有解决**Attention（注意力机制）**本身的计算瓶颈。Attention 的计算复杂度与序列长度的平方成正比（ $O(N^2)$ ）。当序列长度达到 100K 或 1M 时，单张 GPU 根本无法存储 KV Cache，也无法完成 Attention 矩阵的计算。

CP 如何工作？

Context Parallelism 是一种更彻底的序列维度切分。它不仅仅处理 LayerNorm，而是将输入的 Tensor 在序列维度上彻底切分给不同的 GPU，让每个 GPU 负责序列的一部分（例如：GPU0 负责第 1-8000 个 token，GPU1 负责第 8001-16000 个 token）。

核心难点（Attention）： 计算 Attention 时，GPU0 的 token 需要去“关注” GPU1 上的 token（计算 $Q \times K^T$ ）。
实现方式： CP 通常通过精细设计的通信机制来解决这个问题，常见的实现包括：
- Ring Attention： GPU 之间通过环形通信传递 KV 块，分块计算 Attention Score 并累加。
- DeepSpeed Ulysses (All-to-All)： 将序列切分转换为 Head 切分（All-to-All 通信），算完 Attention 后再转回序列切分。
收益： 能够处理极长的上下文（Context）。只要 GPU 足够多，理论上可以无限扩展序列长度。

3. SP 与 CP 的核心区别对比

特性	Sequence Parallelism (SP)	Context Parallelism (CP)
主要目标	消除 Tensor Parallel 中的显存冗余（Redundancy）。	解决超长序列下 Attention 计算和 KV Cache 存不下的问题。
切分对象	主要是 LayerNorm、Dropout 等在 TP 中未被切分的部分。	针对所有层，尤其是核心的 Attention 计算部分。
通信模式	依赖于改进的 TP 通信（将 All-Reduce 拆分为 Reduce-Scatter + All-Gather）。	依赖于 P2P (Ring) 或 All-to-All 通信。
拓扑关系	通常在单机内（Intra-node）的 GPU 之间进行，因为带宽要求极高。	可以跨机（Inter-node）扩展，特别是 Ring Attention 对带宽容忍度相对较好。
一句话总结	"TP 的补丁"：让 TP 更省显存。	"Attention 的分身"：让多卡合力算一个超级大的 Attention 矩阵。

总结

如果你在使用 Megatron-LM 进行标准的模型并行训练，开启 SP 可以帮你节省显存，让你能跑稍微长一点的序列或更大的 Batch Size。
如果你要训练或推理 200k、1M 这种超长上下文模型，CP 是必须的，因为它真正解决了 Attention 算不动的问题。

Context并行

https://mp.weixin.qq.com/s/8fslGx6DjCL69bjoPvRYIA

learn

多batch的好处：

卡可以做并行计算，类似SIMD
分阶段计算，多个batch可以做访存和计算的overlap、多卡的overlap（卡不会空闲，比如AF分离）（https://mp.weixin.qq.com/s/GTy44thBYF_6YfOy2EG8og）

sglang two batch overlap how to realize

https://github.com/sgl-project/sglang/pull/8144

SBO (Stream-Based Overlap): 在单批次内通过 CUDA 流实现专家计算和通信的细粒度重叠。

layers/attention/tbo_backend.py

srt/batch_overlap/two_batch_overlap.py

IB_DEVICES=$(find /dev/infiniband/* -maxdepth 1 -not -type d | xargs -I{} echo '--device {}:{}')
docker run  ${IB_DEVICES} --gpus all --ipc=host --cap-add SYS_NICE --cap-add IPC_LOCK -v /home/xxx/:/workspace --name xxx_vllm --entrypoint bash -it  vllm-openai:0.8.0

docker run  --gpus all --ipc=host -v /data/:/workspace --name xxx_vllm --entrypoint bash -it  vllm/vllm-openai:v0.8.3 

docker run  --gpus all --ipc=host -v /home/xxx/:/workspace --name xxx_vllm --entrypoint bash -it  vllm-openai:0.8.0

git clone https://github.com/deepseek-ai/DeepEP.git
wget https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz
tar xf nvshmem_src_3.2.5-1.txz

######################## 安装NVSHMEM ########################
cd nvshmem_src
git apply ../DeepEP/third-party/nvshmem.patch

# 注：以下参数和DeepEP官方的不太一样，因为我们的机器需要这样设置才能使用internode功能
export NVSHMEM_IBGDA_SUPPORT=1
export NVSHMEM_IBRC_SUPPORT=1

export NVSHMEM_SHMEM_SUPPORT=0
export NVSHMEM_UCX_SUPPORT=0
export NVSHMEM_USE_NCCL=0
export NVSHMEM_PMIX_SUPPORT=0
export NVSHMEM_TIMEOUT_DEVICE_POLLING=0
export NVSHMEM_BUILD_TESTS=0
export NVSHMEM_BUILD_EXAMPLES=0
export NVSHMEM_MPI_SUPPORT=0
export NVSHMEM_BUILD_HYDRA_LAUNCHER=0
export NVSHMEM_BUILD_TXZ_PACKAGE=0
export NVSHMEM_TIMEOUT_DEVICE_POLLING=0

export NVSHMEM_DIR="${HOME}/nvshmem"
export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:$LD_LIBRARY_PATH"
export PATH="${NVSHMEM_DIR}/bin:$PATH"
export CUDA_HOME=/usr/local/cuda

export NVSHMEM_USE_GDRCOPY=1
export GDRCOPY_HOME=/root/paddlejob/gdrcopy

cmake -G Ninja -S . -B build -DCMAKE_INSTALL_PREFIX="${NVSHMEM_DIR}"
cmake --build build/ --target install


######################## 安装DeepEP ########################
cd ../DeepEP
python setup.py build
python setup.py install

python tests/test_intranode.py
# python tests/test_internode.py 这个需要多机，见下面
python tests/test_low_latency.py

# https://blog.csdn.net/eloudy/article/details/143486017
https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.5.1.tar.gz

$ sudo apt install build-essential devscripts debhelper fakeroot pkg-config dkms

$ cd packages
$ CUDA=/usr/local/cuda ./build-deb-packages.sh

cp -r /usr/src/nvidia-550.142 .
export NVIDIA_SRC_DIR=/workspace/examples/ep/nvidia-550.142/nvidia
# $ make prefix=<install-to-this-location> CUDA=/usr/local/cuda all install
$ make CUDA=/usr/local/cuda all install
make CUDA=/data/xxx/examples/ep/cuda-12.4 all install
$ sudo ./insmod.sh

多机用法：

START_RANK=2
END_RANK=4
if [[ ${PADDLE_TRAINER_ID} -lt $START_RANK ]]; then exit; fi
if [[ ${PADDLE_TRAINER_ID} -ge $END_RANK ]]; then exit; fi

export WORLD_SIZE=$(($END_RANK - $START_RANK))
export RANK=$(($PADDLE_TRAINER_ID - ${START_RANK}))
echo "rank: ${RANK}, nnodes: ${WORLD_SIZE}"

unset PADDLE_ELASTIC_JOB_ID PADDLE_TRAINER_ENDPOINTS DISTRIBUTED_TRAINER_ENDPOINTS FLAGS_START_PORT PADDLE_ELASTIC_TIMEOUT
for name in `env | grep -E 'PADDLE|ENDPOINT' | awk -F'=' '{print $1}'`; do unset ${name}; done

export MASTER_ADDR="10.54.101.209"
export MASTER_PORT=58978

export FLAGS_eager_communication_connection=1

export NCCL_DEBUG=WARN
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_IB_GID_INDEX=3
export NCCL_NVLS_ENABLE=0
export NCCL_IB_GID_INDEX=3

export NVSHMEM_IB_GID_INDEX=3
export NVSHMEM_IB_TRAFFIC_CLASS=162
export NVSHMEM_BOOTSTRAP=UID
export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME=eth0

python tests/test_internode.py

NVSHMEM

######################## 安装nvshmrun ########################
cd nvshmem_src/scripts
export NVSHMEM_HOME="${HOME}/nvshmem"
./install_hydra.sh . "${NVSHMEM_HOME}"

######################## 编译HelloWorld ########################
将 https://docs.nvidia.com/nvshmem/api/using.html 里面的第一份示例代码保存为 nvshmem_hello.cu

nvcc -rdc=true -ccbin g++ -gencode=arch=compute_90,code=sm_90 \
  -I $NVSHMEM_HOME/include nvshmem_hello.cu -o nvshmem_hello.out \
  -L $NVSHMEM_HOME/lib -lnvshmem -lnvidia-ml -lcuda -lcudart

######################## 运行HelloWorld ########################
nvshmrun -n 4 ./nvshmem_hello.out

MoE 和大规模 EP 并行相较于传统的 dense 架构，对整体基础设施和通信优化带来了新的挑战。特别是，大规模 EP 必然会引入跨节点的两次 all-to-all 传输。在训练阶段如何提升吞吐量，在推理阶段如何降低时延，并有效隐藏这些通信，成为基础架构设计的关键考量。

proxychains4 git clone https://github.com/NVIDIA/cuda-samples.git
CUDACXX=/usr/local/cuda-12/bin/nvcc cmake -B build -S .
CUDACXX=/usr/local/cuda-12/bin/nvcc cmake --build build --target p2pBandwidthLatencyTest
# vim Samples/5_Domain_Specific/p2pBandwidthLatencyTest/CMakeLists.txt 删除多余的arch

1. Sequence Parallelism (SP) — 序列并行​

为什么需要 SP？​

SP 如何工作？​

2. Context Parallelism (CP) — 上下文并行​

为什么需要 CP？​

CP 如何工作？​

3. SP 与 CP 的核心区别对比​

总结​

Context并行​