HuggingFaceM4/idefics2-8b · Pretraining deduplication of data to prevent data leakage?

Yes, the pretraining datasets are already released:
OBELICS: https://huggingface.co/datasets/HuggingFaceM4/OBELICS
LAION COCO: https://huggingface.co/datasets/laion/laion-coco
Conceptual Captions, WIT, etc... are available from the official websites
We put the proportions of the mixture in the paper in the appendix.
We have only deduplicated the images of the benchmarks from the SFT dataset, not the pretraining. I think there's a very low risk, for example given the fact that MMMU or MathVista were released last year, but OBELICS was made from Common Crawls dumps before that.