For actual mobile devices, yes there is no need. FP64 only has a use in scientific research, maybe finance and a few other fields. Even there you would do a lot of mixed precision stuff.
The reason that the support was there was probably that they wanted to design a single chip and either remove/disable cores for truly mobile or general purpose boards while having the logic available for customers that would actually want it.
I once needed FP64 in a GPU for physics calculations. One reason that impulse/constraint won out over spring/damper is that spring/damper has a total loss of precision problem with 32-bit floats.
I don't remember there ever was a time were FP64 was considered a big deal for mobIle GPUs.
The article is trying to debunk a claim that was never popular to begin with.
Allow me to refresh your memory then. How does the following sound to you?
"We also support Full Profile and 64-bit natively, in hardware. After years of evangelising the benefits of such an approach it is nice to see other players in the industry join down this avenue."
https://community.arm.com/groups/arm-mali-graphics/blog/2013...
"Mali-T622 was specifically tailored for this job. Mali-T622 also supports OpenCL Full Profile and includes double-precision FP64 and full IEEE-754-2008 floating-point support which are essential features in order to enhance the user experience"
https://community.arm.com/groups/arm-mali-graphics/blog/2013...
I could go on with the examples but I think there's no need to spam the thread with tens of blog articles that say FP64 and "native 64-bit" (whatever that means) are essential to the mobile experience.
Is there any point in actually damning FP64 this hard anymore? There is no reason, imo, for a modern GPU to get worse than 1/3rd performance on FP64 over FP32.
Side note: Non-Quadro/Tesla "professional" GeForces have FP64 performance specifically locked out, and AMD has started also doing this with GCN-era chips.
Even if you're doing fintech simulations, FP16 could well be plenty of precision for a first pass, and then you'd get all those extra cores and ops/watt.
FP64 seems like a very small use case for most of the parallelized workflows I can imagine.
For scientific workloads, double precision is a must have. The 7 digits of FP32 is not enough. In my lab, we haven't updated our Kepler based GPU since 2013 for this reason.
Since I don’t plan to run any DNA analysis or fintech simulations on my smartphone anytime soon, I am very satisfied having FP32/FP16 precision in mobile right now. And so should you.
You are right, I thought vessenes meant: "FP64 seems like a very small use case for most of the parallelized workflows I can imagine" for any platform.
It may be that mobile graphics gets by fine with FP32 or less, but I worry that if FP64 gets sidelined then none of the effort going into this hardware today will benefit applications that need real precision like science, weather prediction, GPS, etc.
Not really a problem. NVIDIA sells cards based around 32-bit (and now increasingly, 16-bit) ALUs for desktop usage, while offering more expensive ones with more 64-bit-focused ALUs for workstations and compute. Compute is important enough to their bottom line to justify it.
The real problem is that NVIDIA has compute locked down with CUDA. Mobile chipset vendors can't expand into compute if they're barred from entry at the API level.
Vulkan/SPIR-V looks promising just needs chip vendors (ARM, Qualcomm, AMD, Intel) to come together and invest in CuDNN equivalents.
Although I reckon deep learning on mobile (at least for some use cases like cameras) will use dedicated silicon from Movidius etc and ultimately be embedded in the camera chips directly
The cross-platform nature is actually part of the problem--the whole point of doing GPGPU work is that you're playing to the hardwares' strengths, which can be difficult when the hardware can be nearly anything from a CPU to a GPU to an FPGA.
It doesn't help that until recently, AMD hasn't tried to push OpenCL nearly as hard as nVIDIA pushes CUDA.
Modern AMD and NVIDIA GPUs are fairly similar hardware-wise, and it is not hard to write OpenCL code that executes efficiently on both. I agree that it is pretty hopeless to write performance-portable OpenCL across entirely different architectures, however.
Sure, but if you go with nVIDIA, you also get access to all the other goodies they distribute (thrust, cudaFFT, cudaDNN, etc) and all the CUDA-compatible stuff other people have written, like Theano and TensorFlow.
It does seem like people have gotten a little more interested in OpenCL lately, but it still lags pretty far behind. As dharma1 says below, AMD seems weirdly uninterested in catching up. If I were in change of AMD, I'd be throwing money and programmers at this: "Want to port your library to OpenCL? Here, have a GPU! We'll help."
AMD management has completely missed the memo on deep learning. No mention of deep learning or FP16 perf yesterday when Polaris was announced - it was all around VR.
They are just not turning up to the party and as a company are running out of time if Polaris and Zen dont sell.
> Given the quality of OpenCL and its cross platform nature
I'm sorry, WHAT? OpenCL is absolute shit. Cumbersome API definition, lack of low-level control, stringly typed programs (all programs are provided as strings and kernels are identified with those too). Which means nearly no compile-time feedback, it's hard to embed GPU kernels into a single binary. The API is woefully lacking in flexibility (no dynamic launch), OpenCL 2.0 is better, (EDIT: Apparently AMD supports it now, I'd have to check whether Intel/NVidia have also added support), but no one supports it so it's also irrelevant.
Not only that, AMD hardware is terrible. Atomics on NVidia's maxwell are orders of magnitude faster than on AMD (to the point of being comparable to non-atomic operations with low contention).
CUDA's environment provides: Better documentation, better feature support, saner development and debugging, possibility to ship both generic & specialised binary kernels, JITtable kernels in intermediate representation, better compile time sanity checking, the ability to generate your own IR/CUDA assembler from non CUDA languages...
The reason everyone does CUDA and uses NVidia is because there's zero real competition. AMD is the only company that cares about OpenCL, Intel and NVidia just implement the bare minimum to have AMD's OpenCL code be portable to them. Intel has OpenMP and TBB for the Phi, NVidia has CUDA.
To me it's crazy that anyone keeps mentioning OpenCL as a serious alternative. In theory I agree that an open standard would be nice, but over here in reality where I have to actually write code there is no realistic alternative to CUDA if you want to stay sane.
You write OpenCL if you want to target anything other than AMD/NVIDIA/Intel. If you're writing code for an embedded application (with some heterogeneous core), or for a mobile application, you absolutely have to write OpenCL code, as there's no alternative. OpenCL is shit, but it's cross platform shit.
If your aim is to get 100% performance in a GPU heavy cluster, then sure, you're going to need to write CUDA code, and buy some NVIDIA GPUS, however there are a lot of applications which run in entirely different environments which _only_ support OpenCL.
Does anyone actually implement OpenCL 2.0 yet? Last I checked not even AMD supported it, and they're the only company that has a reason to care about advancing OpenCL.
A more optimal approach might be to look at what you are trying to accomplish and figuring out the scale that works best for what a local view appears to be.
For example, if I imagine a game world where it's open, but there are natural limits to useful render distance, then it is possible to define absolute maximum scale sizes.
My new to this problems space view is that even if the world is larger than those sizes, there is probably still some limited observer and scale that makes sense. Building in some spare room and padding in to the scale and it can then be transformed to center on different points. As movement towards one of those points happens the new centering for each object could be pre-computed in spare cycles (or at least spread out so it isn't a single noticeable hit).
I'm going to be downvoted because this isn't particularly on-topic, but nonetheless I'd like to suggest you try hyphenating phrases like this to make them easier to read, so you don't construct garden-path sentences where the correct parse exists but isn't obvious.
The reason that the support was there was probably that they wanted to design a single chip and either remove/disable cores for truly mobile or general purpose boards while having the logic available for customers that would actually want it.