lightllm杂记

https://lightllm-cn.readthedocs.io/en/latest/framework/router.html#efficient-router

import numpy as np

def demo():
    max_total_token_num = 100
    req_list = [(5, 4), (4, 3), (5, 3), (3, 2), (4, 2)]  # (run_len, left_output_len)
    req_list.sort(key=lambda x: -x[1])

    left_out_len_array = np.array([e[1] for e in req_list])
    has_run_len_array = np.array([e[0] for e in req_list])
    cum_run_len_array = np.cumsum(has_run_len_array)
    size_array = np.arange(1, len(req_list) + 1, 1)
    need_max_token_num = (left_out_len_array * size_array + cum_run_len_array).max()

    if need_max_token_num <= max_total_token_num:
        print("ok")
    else:
        print("oom")

这段代码是一个 LLM 推理中的内存预算检测算法，用于判断一批并行请求在执行过程中是否会超出最大 token 限制（OOM 检测）。

核心思路

输入数据：每个请求是 (run_len, left_output_len) 元组
- run_len：已经运行/占用的 token 长度
- left_output_len：剩余需要生成的输出 token 数
按剩余输出长度降序排序：让"还需生成更多 token"的请求排在前面

关键计算：对于排序后的第 i 个位置（1-indexed）：

需要的空间 = 前i个请求已占用的token累计 + 第i个请求的剩余输出长度 × i

算法逻辑

因为按 left_output_len 降序排列，当第 i 个请求完成其所有输出时：

前 i 个请求各自至少还要生成 left_out_len_array[i] 个 token
所以此刻最坏情况下的总占用 = 累计已运行长度 + 剩余输出长度 × 并行请求数

数值示例

排序后：[(5,4), (4,3), (5,3), (3,2), (4,2)]

i	left_out	cum_run	need = cum_run + left_out×i
1	4	5	5 + 4×1 = 9
2	3	9	9 + 3×2 = 15
3	3	14	14 + 3×3 = 23
4	2	17	17 + 2×4 = 25
5	2	21	21 + 2×5 = 31

最大值 31 ≤ 100，所以输出 "ok"。

这是 continuous batching / iteration-level scheduling 中常用的贪心策略，确保在任意时刻都不会超出预算。

核心思路​

算法逻辑​

数值示例​

核心思路

算法逻辑

数值示例