(논문 요약) Distributed Inference and Fine-tuning of Large Language Models Over The Internet

(논문 요약) Distributed Inference and Fine-tuning of Large Language Models Over The Internet (Paper)

핵심 내용

distributed inference
- client: Runs inference or fine-tuning jobs through the swarm of servers. Holds input and output embeddings and delegates running transformer blocks
- server: GPU-enabled node holding a set of consecutive transformer blocks
algorithm
- find_best_chain 은 D* algorithm 으로 실시간 throughput (compute time + network latency) 을 고려하여 계산