Data curation at the scale of pretraining

Abstract

While unsupervised learning has unlocked an enormous amount of data for models to train on, it abandoned previous notions of data quality like cleanliness, and it became unclear what data should be included in pretraining. This talk will go over the sensitivity results in pretraining, and the challenges in building new tools for assessing data quality from a statistical and systems perspective.

Date
Apr 26, 2024 3:45 PM — 4:15 PM
Location
Polytechnique Montreal
2500 Chem. de Polytechnique, Montréal, QC H3T 1J4
Josh McGrath
Josh McGrath
Founding Member of Technical Staff - DatologyAI

Josh is a member of technical staff at DatologyAI and an incoming PhD student at the University of Waterloo. Previously, Josh has done data curation work at inductiv (acquired by Apple), Apple, and Snorkel AI. He is interested in ML for data curation, and systems for ML.