Table of contents
- (논문 요약) APIGen; Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets
- (논문 요약) Data curation via joint example selection further accelerates multimodal learning
- (논문 요약) DataComp-LM; In search of the next generation of training sets for language models
- (논문 요약) MINT-1T; Scaling Open-Source Multimodal Data by 10x; A Multimodal Dataset with One Trillion Tokens
- (논문 요약) MTEB; Massive Text Embedding Benchmark
- (논문 요약) OBELICS; An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
- (논문 요약) Scaling Instructable Agents Across Many Simulated Worlds
- (논문 요약) WorkBench; a Benchmark Dataset for Agents in a Realistic Workplace Setting
- (데이터 요약) common crawl filtered data
- (모델 요약) Automatic Data Curation for Self-Supervised Learning; A Clustering-Based Approach
- (모델 요약) Reflection Llama 70B
- (모델 요약) Yi-Coder