you might say that, but the literal benchmark of LLMs (or any supervised learning algorithm for that matter) is loss, or how much 'distance' is between their output, and the validation set, when seeded with the training set.
With a loss of well below 1%, which is typical, it means it can pretty much recreate the training data.