I haven't read the code but it sounded like they encode in frequency space so if they're already putting all the bits into encoding below 20kHz it seems like it would not change the size (as 44.1kHz to 48kHz already have no bits allocated to it).
Since the MDCT is discrete, I assume it operates on power-of-2-sized batches of samples. So (like you, without looking at the code) I would have assumed that more samples/s mean you need more transform blocks, which means you have to allocate fewer output bits per output block to hit your target rate.
You are probably right. I forgot about the whole power of two thing for ffts. That would definitely irritate the same part of my brain that would be put off by interpolating discrete samples even if they're inaudible. Same vein as how 7 is more random than 6.