(코드 탐색) vllm (github)

Class hierarchy

  • VllmConfig class 가 공통 config 로 사용됨.
  • 개별 모델의 constructor 를 Model 로 통일.

Huggingface 모델 로딩 방법

  • config.json 에서 정보 취득.
  • AutoTokenizer 로 tokenize.
  • weight 는 AutoModel 이 아닌 vLLM 자체 코드 사용.

CUDA graphs

  • In vLLM V1, piecewise cudagraphs are captured between attention operation (i.e. the first graph before any attention operation, the last graph after all the attention operation)
  • cudagraphs are captured and managed by the compiler backend, and replayed when the batch size has corresponding cudagraph captured
# graph capture 예V시
bool graphCreated=false;
cudaGraph_t graph;
cudaGraphExec_t instance;
for(int istep=0; istep<NSTEP; istep++){
  if(!graphCreated){
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
      shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
    }
    cudaStreamEndCapture(stream, &graph);
    cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
    graphCreated=true;
  }
  cudaGraphLaunch(instance, stream);
  cudaStreamSynchronize(stream);
}

vLLM-torch.compile overview

  • full graph is captured via TorchDynamo
  • TorchInductor to compile each graph into a compiled artifact (vLLM custom Inductor passes may be used to further optimize the graph)
  • compiled artifact is saved to vLLM’s compile cache for future use

Paged Attention

  • scalar_t: e.g. FP32, FP16
  • HEAD_SIZE: # elements at one head
  • BLOCK_SIZE: # tokens at one head
  • NUM_THREADS: # threads in each gpu thread block
  • VEC_SIZE: # elements for query & key
  • V_VEC_SIZE: # elements for value
  • THREAD_GROUP_SIZE: # threads that fetches 1 query token (or 1 key token) at a time
  • x: # total elements processed by one thread group
  • PARTITION_SIZE: # tensor parallel GPUs
  • WARP_SIZE: 32 threads per warp (warp processes the calculation between 1 query token and key tokens of one entire block at a time)
# query pointer
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;

# query vector
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];

# key pointer
const scalar_t* k_ptr = k_cache + physical_block_number * kv_block_stride
                    + kv_head_idx * kv_head_stride
                    + physical_block_offset * x;

# key vector
K_vec k_vecs[NUM_VECS_PER_THREAD]

# key

# output
scalar_t* out_ptr = out + seq_idx * num_heads * max_num_partitions * HEAD_SIZE
                + head_idx * max_num_partitions * HEAD_SIZE
                + partition_idx * HEAD_SIZE;

Plugins

  • Plugins: user-registered code that vLLM executes (allows users to add custom features without modifying the vLLM codebase)
    • plugins 종류: model, platform, pre/post-processing, logger
  • IO Processor Plugins: pass a custom input to vLLM that is converted into one or more model prompts and fed to the model encode method.
  • LoRA Resolver Plugins: dynamically load LoRA adapters at runtime.

Prefix caching

  • cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests.

Hybrid KV Cache Manager

  • Sliding window attention (sw) + full attention (full): gpt-oss, Gemma 2/3, Ministral, cohere, etc.
  • Mamba + full: Bamba, Jamba, Minimax, etc.
  • Local chunked attention + full: Llama4

P2P NCLL Connector

  • Prefill 수행하는 gpu -> decode 수행하는 gpu
  • ZeroMQ (ZMQ): asynchronous messaging library
  • NCCL: NVIDIA Collective Communications Library

Dual Batch Overlap

  • Data Parallel + Expert Parallel 인 경우 대상.
  • 2 개의 CPU worker threads 가 ping-pong 하면서 계산-통신하면서 MoE layer 처리.

Fused MoE Kernel

  • FusedMoE layer 의 expert parallelism (EP) 를 구현하기 위한 all2all communication backends 가 여럿 존재함 (e.g. triton, flashinfer).
  • Modular Kernel 도 지원.