(논문 요약) Layer-Condensed KV Cache for Efficient Inference of Large Language Models

(논문 요약) Layer-Condensed KV Cache for Efficient Inference of Large Language Models (paper)

핵심 내용

몇몇 layer 에서는 standard attention, 그 외 layer 에서는 top layer 에서만 Key, Value 사용.
- save memory consumption by caching fewer layers
- omit the key-value computation and save key-value parameters
학습
- perform $n$ iterations of bottom-up transformer computation on all tokens in parallel
- in each iteration, pair the queries of all layers with KVs of the top layer from the previous iteration
- compute the cross entropy loss only after the last iteration