You can measure it either way, and you’ll see it both ways in the literature. In...

DavidSJ on Nov 11, 2023 | parent | context | favorite | on: Google Cloud TPU Multislice Training

You can measure it either way, and you’ll see it both ways in the literature. In this case it doesn’t matter much how you measure since 2048 is a typical pretraining sequence length. 300k sequences per batch is huge compared to typical batch sizes in the literature, which are closer to 2048, for about 4M tokens total.