So you agree that the problem is fundamentally sparse? Embeddings are used to make sparse (e.g. categorical) data possible on GPUs, and real-world models are limited by how large they can make the embeddings to fit in GPU memory. Embedding lookups is also a compute bottleneck:
An example is facebook DLRM: https://arxiv.org/pdf/1906.00091.pdf