(논문 요약) OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents (Paper)
OBELICS dataset
- open web-scale filtered dataset
- image-text documents (141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens)
- 여러 step 에 걸쳐 filter.
- image-text documents (141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens)