Hacker News new | past | comments | ask | show | jobs | submit login

You can treat it as a multiplexing operation. The first y[i] is computed normally. y[i+1], y[i+2] etc. are computed with a parallel form, up to as many cores as is optimal. Normally cores will wait for data, finish it very quickly, and then sit idle since they're waiting on memory, but this allows each processing core to return more results from data at a given time i without a serial readback(which introduces memory bandwidth pressure). The optimal throughput strategy is to push the parallelization upwards until the latency of doing this heavy computation outstrips the memory latency savings.



Interesting thanks for the answer. From what you said it feels like it could be good for AVX-like CPU accelerated instructions with all those latencies it would be an optimization like loop unrolling ; but for GPU, really ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: