Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free(paper)

  • head 의 output 에 query-dependent sparse gating sigmoid 추가
    • $Y’=Y\odot \sigma(XW_{\theta})$
      • $X$: hidden states after pre-normalization
      • $Y$: output
    • elementwise
    • headwise
  • 효과
    • attention sink (excess attention on the first token) 를 줄임
    • loss spikes 를 줄임