(논문 요약) GaLore; Memory-Efficient LLM Training by Gradient Low-Rank Projection

(논문 요약) GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (Paper)

핵심 내용

Gradient Low-Rank Projection (GaLore)
- reduces memory usage by up to 65.5% in optimizer states for pre-training LLaMA 1B and 7B on C4 dataset.
- 8-bit GaLore reduces optimizer memory by up to 82.5% and total training memory by 63.3% compared to BF16.
Method
- Gradient update 를 rank 가 낮은 subspace 로 제한함.
- Component-wise gradient statistics 를 사용하는 optimizer (e.g. Adam, Adafactor) 에서 low-rank 정보만 들고 있음.
- 주기적으로 subspace 를 업데이트 함.
- Adam 을 사용할때의 알고리즘
Result