+1, had the fortune to work with him at a previous startup and meetup in person. Our convo very much broadened my perspective on engineering as a career and a craft, always excited to see what he's working on. Good luck Simon!
Had this exact problem (Heroku Postgres to RDS) at my old co. Data migration went as bad as it possibly could (dropped indices, foreign keys, everything but the data itself). This would have saved us months of pain.
Karpathy covers this in Makemore, but the tl;dr is that if you don’t normalize the batch (essentially center and scale your activations down to be normally distributed), then at gradient/backprop time, you may get values that are significantly smaller or greater than 1. This is a problem, because as you stack layers in sequence (passing outputs to inputs), the gradient compounds (because of the Chain Rule), and so what may have been a well behaved gradient at the end layers has either vanished (the upstream gradients were 0<x<1 at each layer) or exploded (the gradients were x>>1 upstream). Batch normalization helps control the vanishing/exploding gradient problem in deep neural nets by normalizing the values passed between layers.
Interpreting Modular Addition in MLPs https://www.lesswrong.com/posts/cbDEjnRheYn38Dpc5/interpreti...
Paper Replication Walkthrough: Reverse-Engineering Modular Addition https://www.neelnanda.io/mechanistic-interpretability/modula...
reply