Data | Jaemin’s Arxiv

Skip to main content

Book
Code Review
Computer Vision
Economy
Language Model
Life
Realtor
Reinforcement Learning
Robot
Thoughts
Vision Language Model

Language Model
Data

Table of contents

(논문 요약) APIGen; Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets
(논문 요약) Data curation via joint example selection further accelerates multimodal learning
(논문 요약) DataComp-LM; In search of the next generation of training sets for language models
(논문 요약) LIMA; Less Is More for Alignment
(논문 요약) MINT-1T; Scaling Open-Source Multimodal Data by 10x; A Multimodal Dataset with One Trillion Tokens
(논문 요약) MTEB; Massive Text Embedding Benchmark
(논문 요약) OBELICS; An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
(논문 요약) Scaling Instructable Agents Across Many Simulated Worlds
(논문 요약) WizardLM; Empowering Large Language Models to Follow Complex Instructions
(논문 요약) WorkBench; a Benchmark Dataset for Agents in a Realistic Workplace Setting
(데이터 요약) common crawl filtered data
(모델 요약) Automatic Data Curation for Self-Supervised Learning; A Clustering-Based Approach
(모델 요약) Reflection Llama 70B
(모델 요약) Yi-Coder