I never know if I have an inside scoop or an outside scoop. Has Hyena not addressed the scaling of context length [1]? I know this version is barely a month old but it was shared to me by a non-engineer the week it came out. Still, giving interviews where the person takes away that the main limitation is context length and requires a big breakthrough that already happened makes me seriously question whether or not he is qualified to speak on behalf of OpenAI. Maybe he and OpenAI are far beyond this paper and know it does not work but surely it should be addressed?
As someone who is in the field: papers proposing to solve the context length problem come out every month. Almost none of the solutions stick or work as well as a dense or mostly dense model.
You'll know when the problem is solved when model after consistently use a method. Until then (and especially if you're not in the field as a researcher), assume that every paper claiming to tackle context length is simply a nice proposal.
Yes. Solving context length has been tried in hundreds of different approaches, and yet most LLMs are almost identical to the original one from 2017.
Just to name a few families of approaches: Sparse Attention, Hierachical Attention, Global-Local Attention,Sliding Window Attention, Locality sensitive hashing Attention, State space model, EMA gated attention.
Notably, human working memory isn't great either. Which begs the question (if the comparison is valid) as to whether that limitation might be fundamental.
The failure mode is that only long context tasks benefit, short ones work fast enough with full attention, and better. It's amazing that OpenAI never used them in any serious LLM even though training costs are huge.
[1] - https://arxiv.org/pdf/2302.10866.pdf