I don't see any explanation for why they trained 8B instead of 7B.
I thought that If you have a 16GB GPU, you can put 14GB(7B*16bits) model into it, but how does it fit If the model is exactly 16GB?
The bigger size is probably from the bigger vocabulary in the tokenizer. But most people are running this model quantized at least to 8 bits, and still reasonably down to 3-4 bpw.
No reason to go 4090 as it's no more capable, and the 5090 is probably not going to have more than 24GB on it either simply because nVidia wants to maintain their margins through market segregation (and adding more VRAM to that card would obsolete their low-end enterprise AI cards that cost 6000+ dollars).
I'd also consider dual A6000-48GB (96GB total) if you have a budget of $8000 or dual V100-32GB (64GB) if you have a budget of $4000.
V100 is old and slower, but for AI applications, RAM is king and there are lots of enterprise V100's coming off racks and being sold on eBay for cheap.