Compute Efficiency | Jaemin’s Arxiv

Skip to main content

Book
Code Review
Computer Vision
Economy
Language Model
Reinforcement Learning
Robot
Thoughts
Vision Language Model

Language Model
Compute Efficiency

Table of contents

(논문 요약)
(논문 요약) Better & Faster Large Language Models via Multi-token Prediction
(논문 요약) BitNet
(논문 요약) CUT YOUR LOSSES IN LARGE-VOCABULARY LANGUAGE MODELS
(논문 요약) Efficient Memory Management for Large Language Model Serving with PagedAttention
(논문 요약) Extreme Compression of Large Language Models via Additive Quantization
(논문 요약) FlashAttention-3; Fast and Accurate Attention with Asynchrony and Low-precision
(논문 요약) From GaLore to WeLore; How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients
(논문 요약) GaLore; Memory-Efficient LLM Training by Gradient Low-Rank Projection
(논문 요약) Jet-Nemotron; Efficient Language Model with Post Neural Architecture Search
(논문 요약) LLM Pruning and Distillation in Practice; The Minitron Approach
(논문 요약) LLM.int8(); 8-bit Matrix Multiplication for Transformers at Scale
(논문 요약) Layer-Condensed KV Cache for Efficient Inference of Large Language Models
(논문 요약) LazyLLM; DYNAMIC TOKEN PRUNING FOR EFFICIENT LONG CONTEXT LLM INFERENCE
(논문 요약) MobileLLM; Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
(논문 요약) QuIP#; Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks
(논문 요약) RecurrentGemma; Moving Past Transformers for Efficient Open Language Models
(논문 요약) TOKEN MERGING; YOUR VIT BUT FASTER
(논문 요약) The Unreasonable Ineffectiveness of the Deeper Layers
(모델 요약) Inverse Scaling in Test-Time Compute
(모델 요약) LayerSkip; Enabling Early Exit Inference and Self-Speculative Decoding
(모델 요약) Lightweight Llama Models
(모델 요약) Mixture-of-Transformers; A Sparse and Scalable Architecture for Multi-Modal Foundation Models