There are 2 parts in writing good GPU code, parallelizing the algorithm and writing the kernels. Automatization of one part will not save time on other.
Based on practical experience the compilers are pretty good nowadays. The fine details of the kernel do not matter that much. The performance issues tend to float around usage of local memory, bank conflicts and how much one kernel instance does work, which require hand tuning and in these cases the compilers are underperforming. Thankfully the poor kernels are 'just' constant factor in the general time complexity of the algorithm.
On higher level the most important thing is to describe the actual algorithm. If the algorithm is described as serial one there is no automated way (and most likely will not ever be general way) of parallelizing it, except running it to check data dependencies after which you already have the result, and the dependencies can change based on inputs so result of one run cannot be generalized.
This could probably be proved by similar method as with halting. The program calls the autoparallelizer and if the parallelizer says there is no data dependency between 2 parts it will make them dependent, if it says there is it will make them independent.
Thus let it be clear:
There is no way whatsoever to take the hard parts away (thinking in parallel). Nothing will take bunch of serial code in and spit parallel programs out.
Based on practical experience the compilers are pretty good nowadays. The fine details of the kernel do not matter that much. The performance issues tend to float around usage of local memory, bank conflicts and how much one kernel instance does work, which require hand tuning and in these cases the compilers are underperforming. Thankfully the poor kernels are 'just' constant factor in the general time complexity of the algorithm.
On higher level the most important thing is to describe the actual algorithm. If the algorithm is described as serial one there is no automated way (and most likely will not ever be general way) of parallelizing it, except running it to check data dependencies after which you already have the result, and the dependencies can change based on inputs so result of one run cannot be generalized.
This could probably be proved by similar method as with halting. The program calls the autoparallelizer and if the parallelizer says there is no data dependency between 2 parts it will make them dependent, if it says there is it will make them independent.
Thus let it be clear: There is no way whatsoever to take the hard parts away (thinking in parallel). Nothing will take bunch of serial code in and spit parallel programs out.