profile and debug

编译kernel

git clone --recursive git@github.com:flashinfer-ai/flashinfer.git
rm -rf  ~/.cache/flashinfer
python -m pip install -v . 

git clone git@github.com:sgl-project/sglang.git
cd sglang 
pip install -e "python[all]" 
cd sgl-kernel && make build -j8
pip install dist/sgl_xxx.whl --force-install

profile

python3 -m sglang.launch_server --model openai/gpt-oss-120b --tp 8 --port 30000 --attention-backend trtllm_mha

python3 -m sglang.bench_serving --model openai/gpt-oss-120b --dataset-name random --backend sglang-oai --random-range-ratio 1 --random-input-len 1200 --random-output-len 20 --max-concurrency 1 --num-prompts 5 --profile --port 30000

The user is asking about how to debug CUDA kernels written in C++ that are bound to Python via TORCH_LIBRARY_FRAGMENT, where the entry point is a Python call. This is a technical knowledge question about debugging workflows. Let me provide a comprehensive answer.

This is a knowledge-based question, not a codebase task. I'll answer directly without needing to explore the codebase extensively.

调试通过 TORCH_LIBRARY_FRAGMENT 绑定的 C++ CUDA Kernel

这是一个多层调试问题：Python → PyTorch C++ 扩展 → CUDA Kernel。以下是几种主要方法：

1. 使用 `cuda-gdb` 调试 CUDA Kernel

这是最直接的方法。

步骤：

编译时加调试符号：确保你的 CUDA 代码用 -g -G 编译（-G 是 device debug info）：

# setup.py 中
extra_compile_args={
    'nvcc': ['-g', '-G', '-O0', '--ptxas-options=-v'],
    'cxx': ['-g', '-O0']
}

用 cuda-gdb 启动 Python：
```
cuda-gdb --args python your_script.py
```

在 kernel 中设断点：

(cuda-gdb) break my_kernel_name
(cuda-gdb) run
# 命中断点后
(cuda-gdb) cuda thread        # 查看当前 CUDA 线程
(cuda-gdb) cuda block         # 查看当前 block
(cuda-gdb) print threadIdx.x  # 打印线程索引
(cuda-gdb) info cuda kernels  # 列出正在运行的 kernel

切换到指定线程：

(cuda-gdb) cuda thread (0,0,0) block (0,0,0)

2. 使用 NVIDIA Nsight Systems / Nsight Compute 进行性能和正确性分析

Nsight Compute（kernel 级分析）：

ncu --set full -o profile_output python your_script.py

Nsight Systems（整体时间线）：

nsys profile -o timeline_output python your_script.py

这些工具不做断点调试，但能帮助定位性能瓶颈和 kernel 执行异常。

3. 使用 `compute-sanitizer` 检查内存错误

类似于 CPU 上的 AddressSanitizer：

compute-sanitizer --tool memcheck python your_script.py
compute-sanitizer --tool racecheck python your_script.py  # 检查 race condition
compute-sanitizer --tool initcheck python your_script.py  # 检查未初始化内存

这是排查非法内存访问（illegal memory access）最快的方式。

4. Printf 调试法（最简单实用）

在 kernel 代码中直接使用 printf：

__global__ void my_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx == 0) {  // 只让一个线程打印，避免输出爆炸
        printf("data[0] = %f, n = %d\n", data[0], n);
    }
}

注意：

CUDA printf 缓冲区有限（默认 1MB），可通过 cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size) 调整
一定要限制打印线程数，否则输出量巨大

5. 调试 C++ 绑定层（TORCH_LIBRARY_FRAGMENT 部分）

如果问题出在 C++/Python 绑定层而非 kernel 本身：

# 用 gdb 调试 C++ 侧
gdb --args python your_script.py

(gdb) break my_cpp_wrapper_function
(gdb) run

也可以结合 Python 调试：

# 先在 Python 脚本中加 breakpoint
import pdb; pdb.set_trace()

# 运行后在 pdb 中确认参数正确，再继续执行到 C++ 层

6. 混合调试的实用工作流

推荐的完整调试流程：

步骤	工具	目的
1	Python `print` / `pdb`	确认传入 C++ 的 tensor shape、dtype、device 正确
2	`compute-sanitizer`	快速检查是否有非法内存访问
3	kernel 内 `printf`	验证 kernel 内部逻辑和数据值
4	`cuda-gdb`	精确定位复杂 bug
5	`ncu` / `nsys`	性能调优

7. 常见技巧

同步执行：在 Python 侧调用 torch.cuda.synchronize() 可以让异步 CUDA 错误立即抛出，精确定位出错的 kernel。
环境变量：CUDA_LAUNCH_BLOCKING=1 python your_script.py 强制所有 kernel 同步执行，方便定位错误。
小规模复现：构造最小输入数据复现问题，减少调试复杂度。
CPU fallback：先写一个 CPU 版本的等价实现，对比 CUDA 输出，确认算法正确性。

ReportID: 122e1ee0-911c-4b0c-82f9-e3f2f6468d01 ConversationID: 8727b660-e049-4afc-b45d-27010d161704

编译kernel​

profile​

调试通过 TORCH_LIBRARY_FRAGMENT 绑定的 C++ CUDA Kernel

1. 使用 cuda-gdb 调试 CUDA Kernel​

2. 使用 NVIDIA Nsight Systems / Nsight Compute 进行性能和正确性分析​

3. 使用 compute-sanitizer 检查内存错误​

4. Printf 调试法（最简单实用）​

5. 调试 C++ 绑定层（TORCH_LIBRARY_FRAGMENT 部分）​

6. 混合调试的实用工作流​

7. 常见技巧​