The key thing is that I need to be able to suspend 3-5 layers deep into the call frame. The instruction dispatcher calls into an instruction which calls into a bus memory read function which triggers a DMA transfer that then needs to switch to the video processor, and then I need to resume right there inside the DMA transfer function once the video processor has caught up in time. So the extra stack frame for each fiber/thread is essential.