I've been trying to coax better performance out of a Jetson nano camera, currently using Python's Open CV lib, with some threading, and can only manage at best about 29fps.
I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.
Author here. As other commenters are saying, the Pytorch JIT and torchscript might be your friend here.
Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.
Quite often, people seem to overestimate the performance overhead Python brings (one can take the PyTorch C++ extension example (LLTM) and create a 1-1 LibTorch implementation to see a ~10% speedup or so).
But Paul's situation is multithreaded and his analysis has numbers that seem to indicate that something is up with the GIL. We know is a limitation in multithreaded PyTorch due to any Tensor creation at the Python level needing the GIL and these models typically creating quite a few of them.
It's always easier to know how the performance impact of something is when you have an experiment removing the bits. Maybe the using the JIT or moving things to C++ gives us that, I look forward to seeing a sequel.
The advantage of involving something like TensorRT or TVM is that they'll apply holistic optimizations - they may eliminate writing to memory and reading back (which would not show as underutilized GPU, but can be a big win, see e.g. the LSTM speedup with the PyTorch JIT fuser). The current disadvantage of TVM is that TVM currently is a bit of an all-or-nothing affair, so you can't give it a JITed model and say "optimize the bits that you can do well". TensorRT with TRTorch is a bit ahead there.
Of course, PyTorch itself is getting better too, with the new profiling executor and new fusers for the PyTorch JIT, so we might hope that you can have good perf for more workloads with just PyTorch.
I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.