Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I figured this was pretty obvious given that MLPs are universal function approximators. A giant MLP could achieve the same results as a transformer. The problem is the scale - we can’t train a big enough MLP. Transformers are a performance optimization, and that’s why they’re useful.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: