This is such a great post. It really shows how much room for improvement there is in all released deep learning code. Almost none of the open source work is really production ready for fast inference, and tuning the systems requires a good working knowledge of the GPU.
The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.
I'd alter your conclusion that open source work isn't production ready. As long as it works as described, it is production ready for at least some subset of use cases. There's just a lot of low hanging fruit re: performance improvement.
It's entirely valid to trade-off either a more straight-forward design or minimizing development time for performance and just throw hardware at the problem as needed.... companies do it all of the time.
All the deep learning libraries are Python wrappers around C/C++ (which then call into CUDA). If you call the C++ layers directly, you have control over the memory operations applied to your data. The biggest wins come from reducing the number of copies, reducing the number of transfers between CPU and GPU memory, and speeding up operations by moving them from the CPU to the GPU (or vice versa).
This is basically what the article does, but if you want to squeeze out all the performance, the Python layer is still an abstraction that gets in the way of directly choosing what happens to the memory.
There are lots of cases where people use e.g. ROS on robots and Python to do inferences, which basically converts a ROS binary image message data into a Python list of bytes (ugh), then convert that into numpy (ugh), and then feed that into TensorFlow to do inferences. This pipeline is extremely sub-optimal, but it's what most people probably do.
All because nobody has really provided off the shelf usable deployment libraries. That Bazel stuff if you want to use the C++ API? Big nope. Way too cumbersome. You're trying to move from Python to C++ and they want you to install ... Java? WTF?
Also, some of the best neural net research out there has you run "./run_inference.sh" or some other abomination of a Jupyter notebook instead of an installable, deployable library. To their credit, good neural net engineers aren't expected to be good software engineers, but I'm just pointing out that there's a big gap between good neural nets and deployable neural nets.
I could see this working for the evaluation which basically just glues OpenCV video reading with Tensorflow to extract a handful of parameters per frame. The rest could stay in Python.
Do you have experience how single frame processing compares between Python and C++? I see that batched processing in Python gives me a huge speed boost which hints at inefficiencies at some point but I don't know if those are related to Python, Tensorflow or CUDA itself. (Or just bad resource management that requires re-initalization of some costly things in between evaluations.)
The fact that batching is faster does not inherently imply some sort of inefficiency, but rather is indicative of the fact that sequential memory access is faster than random.
I am curious what the basis behind the idea that Python is the performance bottleneck for inference is.
It's not that Python is by definition much slower than C++, rather, doing inference in C++ makes it much easier to control exactly when memory is initialised, copied and moved between CPU and GPU. Especially on frame-by-frame models like object detection this can make a big difference. Also, the GIL can be a real problem if you are trying to scale inference on multiple incoming video streams for example.
Control is probably the main point. The python interface makes things easy but doesn't offer enough control for my case. I tested it with a cut down example (no video decoding, no funny stuff) and it all comes down to the batch size that is passed to model.predict. Large batches level out at around 10000 fps depending on the GPU and batch size 1 goes down to 200 fps independent of the GPU. This tells me that some kind of overhead (hidden to me) is slowing things down. I guess that I have to go much deeper into the internals of TF to find out more - so far I did not because it's a large time hole that only offers better performance in a part that is not super critical right now.
The GIL and slowness of Python become a problem when processing multiple streams or doing further time consuming calculations in Python.
It depends, e.g. if you are moving data from memory into a Python data structure and then sending it to the GPU you will have a huge performance bottleneck in loading the data into Python.
Done that since 2015. You can look at https://github.com/jolibrain/deepdetect. C++ doesn't sound ideal to many, but when your target is production, it's pretty powerful, and since c++11 probably much more comfortable than most non practitioners think. For deep learning, it is excellent for bare metal and fitting with industrial applications. Never looked back. For R&D (gans, flows, RL, ...) Python remains easier to play with.
Funny how blaming GIL for being a bottleneck is the least researched/not backed by performance measurement (before/after) part of the article. Everyone loves to hate GIL. maybe there should be T-shirts made for this for the C++ loving folks out there.
To me, seeing the GIL held for 40% of time and significant time spent waiting on GIL by other threads was a fairly strong indicator. Keen to hear your thoughts/experience on it.
> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.
At least for the PyTorch bits of it, using the PyTorch JIT works well. When you run PyTorch code through Python, the intermediate results will be created as Python objects (with GIL and all) while when you run it in TorchScript, the intermediates will only be in C++ PyTorch Tensors, all without the GIL.
We have a small comment about it in our PyTorch book in the section on what improvements to expect from the PyTorch JIT and it seems rather relevant in practice.
The JIT is hands down the best feature of PyTorch. Especially compared to the somewhat neglected suite of native inference tools for TensorFlow. Just recently I was trying to get a TensorFlow 2 model to work nicely in C++. Basically, the external API for TensorFlow is the C API, but it does not have proper support for `SavedModel` yet. Linking to the C++ library is a pain, and both of them cannot do eager execution at all if you have a model trained in Python code :(
PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)
The underappreciated (in my view/experience) part is that it also gets rid of a lot of GIL when used from Python because the part inside the JITed doesn't use Python anymore.
When you have multithreaded setups, this typically is more significant than the Python overhead itself (which comes in at 10% for the PyTorch C++ extension LLTM example, but would be less for convnets).
How do you keep track of the shutter clock in this kind of system? For example the camera clocks at 60fps, but the image processing is a few frames late, the gyroscope clocks at 4kHz, the accelerometer way slower, lidar is a slug, etc. Then you have to get all that stuff in your kalman filter to estimate the state and the central question is: “when did you collect this data?” I guess “no clue it comes from USB then disappeared into a GPU pipeline” is not a scientifically sound answer, you want to know if it goes before or after sample no 3864 of the gyroscope.
Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?
I've been trying to coax better performance out of a Jetson nano camera, currently using Python's Open CV lib, with some threading, and can only manage at best about 29fps.
I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.
Author here. As other commenters are saying, the Pytorch JIT and torchscript might be your friend here.
Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.
Quite often, people seem to overestimate the performance overhead Python brings (one can take the PyTorch C++ extension example (LLTM) and create a 1-1 LibTorch implementation to see a ~10% speedup or so).
But Paul's situation is multithreaded and his analysis has numbers that seem to indicate that something is up with the GIL. We know is a limitation in multithreaded PyTorch due to any Tensor creation at the Python level needing the GIL and these models typically creating quite a few of them.
It's always easier to know how the performance impact of something is when you have an experiment removing the bits. Maybe the using the JIT or moving things to C++ gives us that, I look forward to seeing a sequel.
The advantage of involving something like TensorRT or TVM is that they'll apply holistic optimizations - they may eliminate writing to memory and reading back (which would not show as underutilized GPU, but can be a big win, see e.g. the LSTM speedup with the PyTorch JIT fuser). The current disadvantage of TVM is that TVM currently is a bit of an all-or-nothing affair, so you can't give it a JITed model and say "optimize the bits that you can do well". TensorRT with TRTorch is a bit ahead there.
Of course, PyTorch itself is getting better too, with the new profiling executor and new fusers for the PyTorch JIT, so we might hope that you can have good perf for more workloads with just PyTorch.
Good job digging into all of this Paul! At my company (onspecta.com) we solve similar problems (and more!) to accelerate AI/deep learning/computer vision problems, across both CPUs, GPUs as well as other types of chips.
This is a fascinating space, and there are tons of speed up opportunities. Depending on the type of the workload you're running, you might be able to ditch the GPU entirely and run everything just on the CPU, greatly reducing cost & deployment complexity. Or, at the very least, improve SLAs and 10x decrease the GPU (or CPU) cost.
I've seen this over and over again. Glad someone's documenting this publicly :-) If any one of you readers have more questions about this I'm happy to discuss in the comments here. Or you can reach out to me at victor at onspecta dot com.
I think this is a great explanation. Are this kind of manual optimisations still needed when using the higher level frameworks? Or at least those should make it clear in the types when a pipeline moves from cpu to gpu and vice versa.
How would one accelerate object tracking on a video stream where each frame depends on the result of the previous one? Batching and multi-threading doesn't work here.
Are there some CNN-libraries that have way less overhead for small batch sizes? Tensorflow (GPU accelerated) seems to go down from 10000 fps on large batches to 200 fps for single frames for a small CNN.
It depends on the algorithm you're using, but here are some places to start:
1. How many times is the data being copied, or moved between devices?
2. Are you recomputing data from previous frames that you could just be saving? For example, some tracking algorithms apply the same CNN tower to the last 3-5 images, and you could just save the results from the last frame instead of recomputing. (Of course, you also want to follow hint #1 and keep these results on the GPU).
3. Change the algorithm or network you're using.
Really you should read the original article carefully. The article is showing you the steps for profiling what part of the runtime is slow. Typically, once you profile a little you'll be surprised to find that time is being wasted somewhere unexpected.
Great point - dependencies between frames are inherently problematic for many of these techniques.
Everything lostdog says. I've had experience speeding up tracking immensely using the same big hammer I talk about in the article - moving the larger parts of tracking compute to GPU.
Also, in a tracking pipeline you'll generally have the big compute on pixels done up front. Object detection and ReID take the bulk of the compute and can be easily batched and run in parallel. The results (metadata) can then be fed into a more serial process (but still doing the N<->N ReID comparisons on GPU).
I can't attest to the usefulness of pytorch's multiprocessing module, but using python's multiprocessing module feels like low-level programming (serializing, packing and unpacking data-structures, etc. where you'd hope the environment would handle it for you).
I found python multiprocessing to work well to parallelize deep learning data loading and preprocessing, because all I needed to communicate was a couple of tensors which are easy to allocate in shared memory. I didn't need complex data structures or synchronization.
Processing separate video streams works well with separate processes. There is some cost related to starting the other processes and sometimes libraries may stumble (e.g. several instances of ML libraries allocating all the GPU memory) but once it's running it's literally two separate processes that can do their work independently.
Multiprocessing could be a pain if you need to pass frames of a single video stream. Traditionally you'd need to pickle/unpickle them to pass them between processes.
Yes, I have done non-trivial implementations of a number of SoTA models in Julia. The framework I've used is Flux[1] which I love for it's simplicity, it is very much like the DarkNet[2] framework in that regard which is refreshing after using TensorFlow. PyTorch is much better about not having unnecessary complexity and a sensible API but Flux is certainly better.
The ability for Julia to compile directly to PTX assembly[3][4] means that you can even write the GPU kernels in Julia and eliminate the C/C++ CUDA code. Unfortunately, there is still a lot of work to be done to make it as reliably fast and easy as TensorFlow/PyTorch so I don't think it is usable for production yet.
I hope it will be production ready soon but it will likely take some time to highly tune the compute stacks. They are already working on AMD GPU support with AMDGPU.jl[5] and with the latest NVIDIA GPU release which has IMHO purposefully decreased performance (onboard RAM, power) for scientific compute application I would love to be able to develop on my AMD GPU workstation and deploy on whatever infrastructure easily in the same language.
I do have some gripes with Julia but the biggest of them are mostly cosmetic.
Has any company tried putting the GPU and CPU in the same chip, sharing the same data caches? That could greatly increase the performance of the CPU-GPU data transfers.
The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.