(모델 요약) Large Language Models (paper)

핵심 내용

DeepSeek Sparse Attention (DSA)

  • lightning indexer + fine-grained top‑k token selection
  • $O(Lk)$ attention up to 128K tokens

Continued pre‑training from DeepSeek‑V3.1‑Terminus

  • align the indexer to dense attention using a KL objective
  • only top‑k key‑values are attended and both the model and indexer are optimized with 1T tokens

Post‑training

  • specialist distillation and large‑scale mixed RL (GRPO)
    • across reasoning, agent, and alignment data, unifying domains (math, programming, logic reasoning, general agent tasks, agentic coding, and search)
    • with both thinking and non‑thinking modes

large‑scale GRPO with several key techniques

  • An unbiased KL estimator correcting K3 bias
  • Off‑Policy Sequence Masking based on KL divergence for negative-advantage samples
  • preserve the expert routing paths used during sampling in the inference framework and enforce the same routing paths during training
  • Keep Sampling Mask to align truncated action spaces for top‑p/top‑k sampling

A large agentic task

  • synthesis pipeline generates 1,800+ environments and 85k prompts
  • across search, code agents, code interpretation (Jupyter), and synthetic general tasks