Hacker News new | past | comments | ask | show | jobs | submit login

Better data is still critical, even if bigger data isn't. The linked article emphasizes this.



I'd bet on a 2030 model trained on the same dataset as GPT-4 over GPT-4 trained with perfect-quality data, hands down. If data quality were that critical, practitioners could ignore the Internet and just train on books and scientific papers and only sacrifice <1 order of magnitude of data volume. Granted, that's not a negligible amount of training data to give up, but it places a relatively tight upper bound on the potential gain from improving data quality.


It's possible that this effect washes out as data increases, but researchers have shown that for smaller data set sizes average quality has a large impact on model output.


So true. There are still plenty of areas where we lack sufficient data to even approach applying this sort of model. How are we going to make similar advances in something like medical informatics where we not only have less data readily available but its much more difficult to acquire more data




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: