(모델 요약) Lightweight Llama Models(blog)

  • designed the current quantization scheme with PyTorch’s ExecuTorch inference framework and Arm CPU backend in mind

  • Quantization scheme
    • quantize all linear layers in all transformer blocks to a 4-bit groupwise scheme (with a group size of 32) for weights and 8-bit per-token dynamic quantization for activations.
    • The classification layer is quantized to 8-bit per-channel for weight and 8-bit per-token dynamic quantization for activation.
    • an 8-bit per-channel quantization for embedding
  • Quantization-Aware Training with LoRA adaptors
    • utilize BF16 Llama 3.2 model checkpoints obtained after supervised fine-tuning (SFT) and perform an additional full round of SFT training with QAT
    • then freeze the backbone of the QAT model and perform another round of SFT with low-rank adaptation (LoRA) adaptors applied to all layers within the transformer block
    • Finally, fine-tune the resulting model (both backbone and LoRA adaptors) using direct preference optimization (DPO)
  • SpinQuant
    • post-training quantization
    • utilize WikiText, a small calibration dataset, to learn rotation matrices in SpinQuant.
    • rotation matrices enable the smoothing of outliers and facilitate more effective quantization.
    • After this, best practices in quantization such as range setting and generative post-training quantization are applied.