(데이터 요약) common crawl filtered data CommonCrawl 을 filtering - ablation of dataset
- fineweb: 15T tokens, 45TB(hf-download), summer of 2013 ~ March of 2024
- RefinedWeb: ~600B tokens, 1.68TB(hf-download)
- C4: 242GB(hf-download)
- Dolma v1.6 (the CommonCrawl part): 3T tokens, 4.5TB(gzip)
- The Pile: 825GB
- SlimPajama: 627B tokens, 895GB(hf-download)