I think a lot of that comes from C/C++ language corners.
Have you seen XLA (https://www.tensorflow.org/performance/xla/) or Glow (https://facebook.ai/developers/tools/glow)? These are very high level abstractions called from C++ that can do a great job vectorizing high level operations for the given device.
I think a lot of that comes from C/C++ language corners.
Have you seen XLA (https://www.tensorflow.org/performance/xla/) or Glow (https://facebook.ai/developers/tools/glow)? These are very high level abstractions called from C++ that can do a great job vectorizing high level operations for the given device.