I think we are only at the beginning. Call it first generation. LLaMA was even missing many improvements that existed before it (xPos, Multi-Query Attention, Blockwise Parallel Transformer)
Many researchers are improving very fast, and I would bet that soon we will see more efficient LLMs.