Most of the memory usage comes from the user stack, and you can customize the size of the stack. So memory consumption isn't really the limiting factor here. More important is the cost of context switching.
2KB is tiny, that must be a default which grows or is extended as required? Obviously it works out okay for them, but I feel like it would add nontrivial overhead to every function call if you're always having to check whether or not you need to grow your stack to accommodate the new frame.
Linux does not immediately map 2MB of stack for every thread. That would be ridiculous. Look in proc/smaps to see how much space your stacks actually occupy.
A deeper problem is with Linux threads you can’t grow the stack beyond whatever limit you choose, so you either have to write all code with very limited stack usage or set a big enough stack for the worst case. Which will be too big for millions of threads.
The kernel has to restore registers, including the stack pointer, on a context switch. But it doesn't care about how big the allocation underpinning that stack is.