+1, we still have a lot of performance we can extract! JIT-compiled train steps,...

+1, we still have a lot of performance we can extract! JIT-compiled train steps, more optimized data loading and sharding, gradient accumulation, and activation checkpointing. We will continue building and will do another blog soon after implementing all the improvements!