Table of contents
- (논문 요약)
- (논문 요약) Better & Faster Large Language Models via Multi-token Prediction
- (논문 요약) BitNet
- (논문 요약) CUT YOUR LOSSES IN LARGE-VOCABULARY LANGUAGE MODELS
- (논문 요약) Extreme Compression of Large Language Models via Additive Quantization
- (논문 요약) FlashAttention-3; Fast and Accurate Attention with Asynchrony and Low-precision
- (논문 요약) From GaLore to WeLore; How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients
- (논문 요약) GaLore; Memory-Efficient LLM Training by Gradient Low-Rank Projection
- (논문 요약) LLM Pruning and Distillation in Practice; The Minitron Approach
- (논문 요약) LLM.int8(); 8-bit Matrix Multiplication for Transformers at Scale
- (논문 요약) Layer-Condensed KV Cache for Efficient Inference of Large Language Models
- (논문 요약) LazyLLM; DYNAMIC TOKEN PRUNING FOR EFFICIENT LONG CONTEXT LLM INFERENCE
- (논문 요약) MobileLLM; Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- (논문 요약) QuIP#; Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
- (논문 요약) RecurrentGemma; Moving Past Transformers for Efficient Open Language Models
- (논문 요약) TOKEN MERGING; YOUR VIT BUT FASTER
- (논문 요약) The Unreasonable Ineffectiveness of the Deeper Layers
- (모델 요약) LayerSkip; Enabling Early Exit Inference and Self-Speculative Decoding
- (모델 요약) Lightweight Llama Models
- (모델 요약) Mixture-of-Transformers; A Sparse and Scalable Architecture for Multi-Modal Foundation Models