跳转至

GPU的pin_memory是什么?

gpu的pin_memory

pin_memory就是在RAM上固定了一块内存,这个内存范围是被锁住的。pin这个单词很形象,很像rust中pin含义,用钉子把钉住,这个内存就不会释放,是安全的意思。GPU在传递数据的时候,就可以用DMA的方式,高效传输数据。否则,普通的cpu_memory,就会swap掉,然后访问的时候缺页中断,这样速度肯定就慢了很多。

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).

参考:why-is-cuda-pinned-memory-so-fast

推理库中的使用

vllm中相关code

在vllm中就有根据GPU平台和环境的不同,判断pin_memory是否可用。 比如:Pinning memory in WSL is not supported.

@lru_cache(maxsize=None)
def is_pin_memory_available() -> bool:

    if in_wsl():
        # Pinning memory in WSL is not supported.
        # https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-applications
        print_warning_once("Using 'pin_memory=False' as WSL is detected. "
                           "This may slow down the performance.")
        return False
    elif current_platform.is_xpu():
        print_warning_once("Pin memory is not supported on XPU.")
        return False
    elif current_platform.is_neuron():
        print_warning_once("Pin memory is not supported on Neuron.")
        return False
    elif current_platform.is_hpu():
        print_warning_once("Pin memory is not supported on HPU.")
        return False
    elif current_platform.is_cpu() or current_platform.is_openvino():
        return False
    return True

https://github.com/vllm-project/vllm/issues/188

在lmdeploy使用

在lmdeploy中,同样有关于pin_memory的判断。

lmdeploy-0.6.1.2/lmdeploy/pytorch/engine/cache_engine.py

评论