Interesting. My assumption was one of the innovations of DeepSeek and the modern...

		laidoffamazon 19 days ago \| parent \| context \| favorite \| on: Writing Speed-of-Light Flash Attention for 5090 in... Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway