(코드 탐색) vllm

(코드 탐색) vllm (github)

Class hierarchy

VllmConfig class 가 공통 config 로 사용됨.
개별 모델의 constructor 를 Model 로 통일.

Huggingface 모델 로딩 방법

config.json 에서 정보 취득.
AutoTokenizer 로 tokenize.
weight 는 AutoModel 이 아닌 vLLM 자체 코드 사용.

CUDA graphs

In vLLM V1, piecewise cudagraphs are captured between attention operation (i.e. the first graph before any attention operation, the last graph after all the attention operation)
cudagraphs are captured and managed by the compiler backend, and replayed when the batch size has corresponding cudagraph captured

# graph capture 예V시
bool graphCreated=false;
cudaGraph_t graph;
cudaGraphExec_t instance;
for(int istep=0; istep<NSTEP; istep++){
  if(!graphCreated){
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
      shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
    }
    cudaStreamEndCapture(stream, &graph);
    cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
    graphCreated=true;
  }
  cudaGraphLaunch(instance, stream);
  cudaStreamSynchronize(stream);
}

vLLM-torch.compile overview

full graph is captured via TorchDynamo
TorchInductor to compile each graph into a compiled artifact (vLLM custom Inductor passes may be used to further optimize the graph)
compiled artifact is saved to vLLM’s compile cache for future use

Paged Attention

scalar_t: e.g. FP32, FP16
HEAD_SIZE: # elements at one head
BLOCK_SIZE: # tokens at one head
NUM_THREADS: # threads in each gpu thread block
VEC_SIZE: # elements for query & key
V_VEC_SIZE: # elements for value
THREAD_GROUP_SIZE: # threads that fetches 1 query token (or 1 key token) at a time
x: # total elements processed by one thread group
PARTITION_SIZE: # tensor parallel GPUs
WARP_SIZE: 32 threads per warp (warp processes the calculation between 1 query token and key tokens of one entire block at a time)

# query pointer
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;

# query vector
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];

# key pointer
const scalar_t* k_ptr = k_cache + physical_block_number * kv_block_stride
                    + kv_head_idx * kv_head_stride
                    + physical_block_offset * x;

# key vector
K_vec k_vecs[NUM_VECS_PER_THREAD]

# key

# output
scalar_t* out_ptr = out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
                + head_idx * max_num_partitions * HEAD_SIZE
                + partition_idx * HEAD_SIZE;

Plugins

Plugins: user-registered code that vLLM executes (allows users to add custom features without modifying the vLLM codebase)
- plugins 종류: model, platform, pre/post-processing, logger
IO Processor Plugins: pass a custom input to vLLM that is converted into one or more model prompts and fed to the model encode method.
LoRA Resolver Plugins: dynamically load LoRA adapters at runtime.

Prefix caching

cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests.

Hybrid KV Cache Manager

Sliding window attention (sw) + full attention (full): gpt-oss, Gemma 2/3, Ministral, cohere, etc.
Mamba + full: Bamba, Jamba, Minimax, etc.
Local chunked attention + full: Llama4

P2P NCLL Connector

Prefill 수행하는 gpu -> decode 수행하는 gpu
ZeroMQ (ZMQ): asynchronous messaging library
NCCL: NVIDIA Collective Communications Library

Dual Batch Overlap

Data Parallel + Expert Parallel 인 경우 대상.
2 개의 CPU worker threads 가 ping-pong 하면서 계산-통신하면서 MoE layer 처리.

Fused MoE Kernel

FusedMoE layer 의 expert parallelism (EP) 를 구현하기 위한 all2all communication backends 가 여럿 존재함 (e.g. triton, flashinfer).
Modular Kernel 도 지원.