All the deep learning libraries are Python wrappers around C/C++ (which then call into CUDA). If you call the C++ layers directly, you have control over the memory operations applied to your data. The biggest wins come from reducing the number of copies, reducing the number of transfers between CPU and GPU memory, and speeding up operations by moving them from the CPU to the GPU (or vice versa).
This is basically what the article does, but if you want to squeeze out all the performance, the Python layer is still an abstraction that gets in the way of directly choosing what happens to the memory.
There are lots of cases where people use e.g. ROS on robots and Python to do inferences, which basically converts a ROS binary image message data into a Python list of bytes (ugh), then convert that into numpy (ugh), and then feed that into TensorFlow to do inferences. This pipeline is extremely sub-optimal, but it's what most people probably do.
All because nobody has really provided off the shelf usable deployment libraries. That Bazel stuff if you want to use the C++ API? Big nope. Way too cumbersome. You're trying to move from Python to C++ and they want you to install ... Java? WTF?
Also, some of the best neural net research out there has you run "./run_inference.sh" or some other abomination of a Jupyter notebook instead of an installable, deployable library. To their credit, good neural net engineers aren't expected to be good software engineers, but I'm just pointing out that there's a big gap between good neural nets and deployable neural nets.
I could see this working for the evaluation which basically just glues OpenCV video reading with Tensorflow to extract a handful of parameters per frame. The rest could stay in Python.
Do you have experience how single frame processing compares between Python and C++? I see that batched processing in Python gives me a huge speed boost which hints at inefficiencies at some point but I don't know if those are related to Python, Tensorflow or CUDA itself. (Or just bad resource management that requires re-initalization of some costly things in between evaluations.)
The fact that batching is faster does not inherently imply some sort of inefficiency, but rather is indicative of the fact that sequential memory access is faster than random.
I am curious what the basis behind the idea that Python is the performance bottleneck for inference is.
It's not that Python is by definition much slower than C++, rather, doing inference in C++ makes it much easier to control exactly when memory is initialised, copied and moved between CPU and GPU. Especially on frame-by-frame models like object detection this can make a big difference. Also, the GIL can be a real problem if you are trying to scale inference on multiple incoming video streams for example.
Control is probably the main point. The python interface makes things easy but doesn't offer enough control for my case. I tested it with a cut down example (no video decoding, no funny stuff) and it all comes down to the batch size that is passed to model.predict. Large batches level out at around 10000 fps depending on the GPU and batch size 1 goes down to 200 fps independent of the GPU. This tells me that some kind of overhead (hidden to me) is slowing things down. I guess that I have to go much deeper into the internals of TF to find out more - so far I did not because it's a large time hole that only offers better performance in a part that is not super critical right now.
The GIL and slowness of Python become a problem when processing multiple streams or doing further time consuming calculations in Python.
It depends, e.g. if you are moving data from memory into a Python data structure and then sending it to the GPU you will have a huge performance bottleneck in loading the data into Python.
Done that since 2015. You can look at https://github.com/jolibrain/deepdetect. C++ doesn't sound ideal to many, but when your target is production, it's pretty powerful, and since c++11 probably much more comfortable than most non practitioners think. For deep learning, it is excellent for bare metal and fitting with industrial applications. Never looked back. For R&D (gans, flows, RL, ...) Python remains easier to play with.
Do you have any experience with that?