The fact that batching is faster does not inherently imply some sort of inefficiency, but rather is indicative of the fact that sequential memory access is faster than random.
I am curious what the basis behind the idea that Python is the performance bottleneck for inference is.
It's not that Python is by definition much slower than C++, rather, doing inference in C++ makes it much easier to control exactly when memory is initialised, copied and moved between CPU and GPU. Especially on frame-by-frame models like object detection this can make a big difference. Also, the GIL can be a real problem if you are trying to scale inference on multiple incoming video streams for example.
Control is probably the main point. The python interface makes things easy but doesn't offer enough control for my case. I tested it with a cut down example (no video decoding, no funny stuff) and it all comes down to the batch size that is passed to model.predict. Large batches level out at around 10000 fps depending on the GPU and batch size 1 goes down to 200 fps independent of the GPU. This tells me that some kind of overhead (hidden to me) is slowing things down. I guess that I have to go much deeper into the internals of TF to find out more - so far I did not because it's a large time hole that only offers better performance in a part that is not super critical right now.
The GIL and slowness of Python become a problem when processing multiple streams or doing further time consuming calculations in Python.
It depends, e.g. if you are moving data from memory into a Python data structure and then sending it to the GPU you will have a huge performance bottleneck in loading the data into Python.
I am curious what the basis behind the idea that Python is the performance bottleneck for inference is.