You can measure it either way, and you’ll see it both ways in the literature. In this case it doesn’t matter much how you measure since 2048 is a typical pretraining sequence length. 300k sequences per batch is huge compared to typical batch sizes in the literature, which are closer to 2048, for about 4M tokens total.