Hacker News new | past | comments | ask | show | jobs | submit login

It's becoming ever more certain that the transformer architecture is one of the largest contributions to AI (not merely machine learning, but AI), often beating LSTMs despite LSTMs being expressive enough to capture Turing Equivalence (at least in theory). Its main ideas are three: shorter paths help gradient flow, the training setup and the final key aspect, unhelpfully called self-attention. Self-attention is better thought of as a form of similarity gated key-value soft memory on which learning operations allows Transformers to learn non-trivial programs with contextual weights look-ups.

I also notice reported tries, suggesting some level of curation. While this level of generation is undoubtedly impressive and a sign of non-trivial levels of understanding, the ability to project along arbitrary dimensions of similarity at a fine-grained level and to learn from text instruction is more useful than text generation. Although the unicorn story was a really fun read, better than many humans already, I doubt it could have gone on for much longer. It maintains a theme but not coherently or fluently (see especially the Kennedy nanotech and recycling examples, comparing the dis-fluency there versus the excellence of the Civil War report suggest at least some over-fitting). These relatively minor caveats aside, this is unambiguously an outstanding result.

Winograd Schemas are the single metric to track if interested in understanding how language understanding is truly improving. OpenAI reports 71% and wrongly report the previous record as 63%. The current record here is at 65% https://gluebenchmark.com/leaderboard though not fully comparable. Will OpenAI be submitting? Note that you can get to 60% using about 1-2 orders of magnitude less data and compute.

It concerns me that results here are so far dependent on such large data and computation. However, based on several papers I've read, I do not believe this to be inherent even in transformers. I plan to do some experiments on this when I free up some bandwidth.

If everyone is pulled in by the glamour of working for a well funded, prestigious operation then it should be no surprise that they do not consider paths which operate on several orders of magnitude less data and computational resources.

We all should consider bringing about a group of researchers who swear to an austere computational life of a single GPU, no more than 4-8x average RAM and CPUs that do not cross 90 Watts. The Bicameral Order would be a good name for such a group.




Yeah, there are definitely still places the samples fall short! Keep in mind we're still using very naive sampling techniques.

RE Winograd: WNLI is different, see https://arxiv.org/pdf/1804.07461.pdf


Amazing results, how excited are you? :)

You're right, I noted too that the comparison isn't direct but then, I wasn't justified in calling out the gap claim as wrong, so sorry for that. I think it'd be nice however, to have it undergo an external or more neutral test of performance. I say this without at all doubting the quality of the results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: