Abstract
      While unsupervised learning has unlocked an enormous amount of data for models to train on, it abandoned previous notions of data quality like cleanliness, and it became unclear what data should be included in pretraining. This talk will go over the sensitivity results in pretraining, and the challenges in building new tools for assessing data quality from a statistical and systems perspective.
    
    
      
      
        
          Date
          
            Apr 26, 2024 3:45 PM — 4:15 PM
          
         
       
      
     
    
    
    
    
    
    
    
      
      
        
          Location
          Polytechnique Montreal
          
            
            2500 Chem. de Polytechnique, Montréal, QC H3T 1J4
          
         
       
      
     
    
    
    
    
      
    
    
  
  
  
  
  
  
  
  
    
      
       
    
    
      Josh McGrath
      Founding Member of Technical Staff - DatologyAI
      Josh is a member of technical staff at DatologyAI and an incoming PhD student at the University of Waterloo. Previously, Josh has done data curation work at inductiv (acquired by Apple), Apple, and Snorkel AI. He is interested in ML for data curation, and systems for ML.