跳转至

KenForever1

GPU的pin_memory是什么？

KenForever1
Creator

元数据
- 2024年12月9日
- 分类于 LLM推理
- 需要 1 分钟阅读时间

GPU的pin_memory是什么？

gpu的pin_memory¶

pin_memory就是在RAM上固定了一块内存，这个内存范围是被锁住的。pin这个单词很形象，很像rust中pin含义，用钉子把钉住，这个内存就不会释放，是安全的意思。GPU在传递数据的时候，就可以用DMA的方式，高效传输数据。否则，普通的cpu_memory，就会swap掉，然后访问的时候缺页中断，这样速度肯定就慢了很多。

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

参考：why-is-cuda-pinned-memory-so-fast

推理库中的使用¶

vllm中相关code¶

在vllm中就有根据GPU平台和环境的不同，判断pin_memory是否可用。比如：Pinning memory in WSL is not supported.

@lru_cache(maxsize=None)
def is_pin_memory_available() -> bool:

    if in_wsl():
        # Pinning memory in WSL is not supported.
        # https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-applications
        print_warning_once("Using 'pin_memory=False' as WSL is detected. "
                           "This may slow down the performance.")
        return False
    elif current_platform.is_xpu():
        print_warning_once("Pin memory is not supported on XPU.")
        return False
    elif current_platform.is_neuron():
        print_warning_once("Pin memory is not supported on Neuron.")
        return False
    elif current_platform.is_hpu():
        print_warning_once("Pin memory is not supported on HPU.")
        return False
    elif current_platform.is_cpu() or current_platform.is_openvino():
        return False
    return True

https://github.com/vllm-project/vllm/issues/188

在lmdeploy使用¶

在lmdeploy中，同样有关于pin_memory的判断。

lmdeploy-0.6.1.2/lmdeploy/pytorch/engine/cache_engine.py

评论