In terms of accuracy, the authors mention it as a limitation, so probably it could be a problem.
In terms of runtime, it should not matter. Generally speaking though, the overhead of optical flow is often overlooked. For video DL applications, optical flow calculation often takes more time than inference itself. For academic purposes, datasets are often already preprocessed and the optical flow runtime is not mentioned. Doing real time video analysis with optical flow is quite impractical though.
I couldn't agree more, especially with the latter part. I've worked on action recognition with I3D for over a year now, and found that seemingly equivalent implementations in Keras, TensorFlow 2 or PyTorch will produce wildly different results. Worse yet, I found a bunch of papers that will claim SOTA results compared against one of those non-original implementations with just a few percentage-point differences. It makes no sense! It took me hundreds of hours to hunt down the differences between how these frameworks implement their layers before I could come even close to the expected accuracy...
This is very cool, I’ll be studying your implementation of I3D. Did you ever attempt to train I3D end-to-end as done in the Quo Vadis paper? And it so, did you get comparable Top1/Top5 accuracy?
The single RGB stream top1 goes up to 73.48% with resnet50, and up to 74.71% equipped with non-local. Both are much higher than the original paper with two-streams.
The JIT is hands down the best feature of PyTorch. Especially compared to the somewhat neglected suite of native inference tools for TensorFlow. Just recently I was trying to get a TensorFlow 2 model to work nicely in C++. Basically, the external API for TensorFlow is the C API, but it does not have proper support for `SavedModel` yet. Linking to the C++ library is a pain, and both of them cannot do eager execution at all if you have a model trained in Python code :(
PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)
The underappreciated (in my view/experience) part is that it also gets rid of a lot of GIL when used from Python because the part inside the JITed doesn't use Python anymore.
When you have multithreaded setups, this typically is more significant than the Python overhead itself (which comes in at 10% for the PyTorch C++ extension LLTM example, but would be less for convnets).
It's not that Python is by definition much slower than C++, rather, doing inference in C++ makes it much easier to control exactly when memory is initialised, copied and moved between CPU and GPU. Especially on frame-by-frame models like object detection this can make a big difference. Also, the GIL can be a real problem if you are trying to scale inference on multiple incoming video streams for example.
Control is probably the main point. The python interface makes things easy but doesn't offer enough control for my case. I tested it with a cut down example (no video decoding, no funny stuff) and it all comes down to the batch size that is passed to model.predict. Large batches level out at around 10000 fps depending on the GPU and batch size 1 goes down to 200 fps independent of the GPU. This tells me that some kind of overhead (hidden to me) is slowing things down. I guess that I have to go much deeper into the internals of TF to find out more - so far I did not because it's a large time hole that only offers better performance in a part that is not super critical right now.
The GIL and slowness of Python become a problem when processing multiple streams or doing further time consuming calculations in Python.
> The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/pflops-d.
It's definitely true that the RTX 2080 Ti would be more efficient money-wise, but the Tensor Cores are not going to get you the advertised speedup. Those speedups can only be reached in ideal circumstances.
Nevertheless, the article as a whole makes a very good point. The thing that is most scary about this is that it would become very hard for new players to enter the space. Large incumbents would be the only ones able to make the investments necessary to build competitive AI. Because of that, I really hope the author isn't right - unfortunately they probably are.
If we’re going down this road of theorizing about the human brain based on DNNs, what is the deal with dropout? Could we help human brains with generalization by randomly removing 10% of our newly created connections at the end of each day to improve long term learning? :)
That’s called synaptic pruning and, while it most happens as a human matures, there’s evidence indicating that it occurs during sleep in adults to help consolidate the most important connections and remove the unimportant ones. It’s not exactly like dropout, but at a high level it kind of looks like it.
Is there a way to do this easily on S3 if you’re hosting a static website with custom domain? Last time I checked you still needed to put a CloudFront instance in front of it.
Instead of using Cmd+V to paste, use Cmd+Option+V to cut from the original location and paste. I like it because it lets me postpone the decision to copy or cut until the very end :)
Very cool! Does anyone know how is software support for all these features? It seems that TF doesn’t support the TP32/16 types as of yet. Is this something only CUDA engineers can use right now?
It does seem a little fishy to me that NVIDIA often boasts with figures like 10x performance upgrade whilst in practice those are only possible if you use one of their non-default float types which are hardly supported in most deep learning libraries :(
Both PyTorch and Tensorflow teams have announced they'll support TF32. By design, it interoperates well with existing code, since calling code can just treat it as a regular FP32 value.
The connectedness of neurons in neural nets is usually fixed from the start (i.e. between layers, or somewhat more complicated in the case CNNs etc). If we could eliminate this and let neurons "grow" towards each other (like this article shows), would that enable smaller networks with similar accuracy? There's some ongoing research to prune weights by finding "subnets" [1] but I haven't found any method yet where the network grows connections itself. The only counterpoint I can come up with is that is probably wouldn't generate a significant performance speed up because it defeats the use of SIMD/matrix operations on GPUs. Maybe we would need chips that are designed differently to speed up these self-growing networks?
I'm not an expert on this subject, does anybody have any insights on this?
I think this is a really interesting area of machine learning. Some efforts have been made in ideas that are tangential to this one. Lots of papers in neuroevolution deal with evolving topologies. NEAT is probably the prime example http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf and another paper I read recently called pathnet that is different but very interesting https://arxiv.org/abs/1701.08734.
I experimented with networks where weights were removed if they did not contribute much to the final answer.
My conclusion was I could easily set >99% of weights to zero on my (fully connected) layers with minimal performance impact after enough training, but the training time went up a lot (effectively after removing a bunch of connections, you have to do more training before removing more), and inference speed wasn't really improved because sparse matrices are sloooow.
Overall, while it works out for biology, I don't think it will work for silicon.
Not really - I had to do multiple steps of 'prune a bit, train a bit' to be able to prune to 99%. If I had done all the pruning in one big step as they do, I don't think it would have trained well, even if I had been able to see the future and remove the same weights.
The only reason we architect ANNs the way we do is optimization of computation. The bipartite graph structure is optimized for GPU matrix math. Systems like NEAT have not been used at scale because they are a lot more expensive to train and to utilize the trained network with. ASICs and FPGAs have a change to utilize a NEAT generated network in production, but we still don't have a computer well suited to training a NEAT network.
So this might be an enormous opportunity for low-cost and more performant AI if someone was able to build an FPGA of some sort that could handle these types of computations as efficiently right?
Running the post-training network is a solved problem. (FPGA and ASIC can do it just fine). TRAINING the network is the difficulty. The problem is that the structure of the network is arbitrary and is a result of the learning process. You can't optimize a computation for a structure you don't know yet. Bipartite layer networks have the benefit of never changing structure but they can approximate other subset structures. I don't know if we could easily tell where we are on the tradeoff between "bipartite graphs are trained efficiently but are inefficiently simulating a smaller network in practice"
It's not that simple. Backpropagating a bipartite graph of nodes works out to a series of matrix operations that parallelize efficiently on a GPU as long as the matrices fit into the GPU's working memory. Running a GA (part of neat) doesn't normally work well on a GPU. The good NEAT algorithms even allow different neurons to have different firing response curves. This inherently defies the "same operation multiple values" style of parallelization in GPUs. The way GPUs work just fundamentally isn't well suited to speeding up NEAT.
You don't. You need a different parallelism model than a GPU provides. It could work well on machines with very high CPU count, but the speedup on GPUs is the main reason bipartite graph algorithms have seen such investment.
In terms of runtime, it should not matter. Generally speaking though, the overhead of optical flow is often overlooked. For video DL applications, optical flow calculation often takes more time than inference itself. For academic purposes, datasets are often already preprocessed and the optical flow runtime is not mentioned. Doing real time video analysis with optical flow is quite impractical though.