I've been trying to coax better performance out of a Jetson nano camera, current...

briggers · on Oct 10, 2020

Author here. As other commenters are saying, the Pytorch JIT and torchscript might be your friend here.

Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.

t-vi · on Oct 11, 2020

Quite often, people seem to overestimate the performance overhead Python brings (one can take the PyTorch C++ extension example (LLTM) and create a 1-1 LibTorch implementation to see a ~10% speedup or so).

But Paul's situation is multithreaded and his analysis has numbers that seem to indicate that something is up with the GIL. We know is a limitation in multithreaded PyTorch due to any Tensor creation at the Python level needing the GIL and these models typically creating quite a few of them.

It's always easier to know how the performance impact of something is when you have an experiment removing the bits. Maybe the using the JIT or moving things to C++ gives us that, I look forward to seeing a sequel.

The advantage of involving something like TensorRT or TVM is that they'll apply holistic optimizations - they may eliminate writing to memory and reading back (which would not show as underutilized GPU, but can be a big win, see e.g. the LSTM speedup with the PyTorch JIT fuser). The current disadvantage of TVM is that TVM currently is a bit of an all-or-nothing affair, so you can't give it a JITed model and say "optimize the bits that you can do well". TensorRT with TRTorch is a bit ahead there.

Of course, PyTorch itself is getting better too, with the new profiling executor and new fusers for the PyTorch JIT, so we might hope that you can have good perf for more workloads with just PyTorch.

ilaksh · on Oct 10, 2020

I wonder if the tool used in the article can be applied?

Seems like Xavier NX is more realistic for my needs right now personally though. Of course it's much more expensive etc.