Fortress makes a convincing argument that it must be supported at the language level. It has a smart work splitting algorithm that does for managing parallelism granularity what garbage collection did for managing memory allocation. It works much better when it's baked into the core and is available from the ground up.
Not to detract from anything you said (which I agree with), but when Guy Steele gave a guest lecture at my university on Fortress a few months ago, he said that the work splitting algorithm still needed work, in particular the part that decided the right amount of granularity for the given task (i.e. when to stop splitting the task into smaller subtasks).
shouldn't it also be set up so that different algorithms can be swapped in? I think its pretty accepted at this point that different application scenarios perform better with differently tuned schedulers..