More

g_airborne · on April 17, 2021

In terms of accuracy, the authors mention it as a limitation, so probably it could be a problem.

In terms of runtime, it should not matter. Generally speaking though, the overhead of optical flow is often overlooked. For video DL applications, optical flow calculation often takes more time than inference itself. For academic purposes, datasets are often already preprocessed and the optical flow runtime is not mentioned. Doing real time video analysis with optical flow is quite impractical though.

weidi_xie · on April 17, 2021

The time for computing optical flow is indeed a bottleneck, and some researchers in the community is working on that: - https://arxiv.org/abs/2103.04524 - https://arxiv.org/pdf/2103.17271.pdf

g_airborne · on Jan 25, 2021

I couldn't agree more, especially with the latter part. I've worked on action recognition with I3D for over a year now, and found that seemingly equivalent implementations in Keras, TensorFlow 2 or PyTorch will produce wildly different results. Worse yet, I found a bunch of papers that will claim SOTA results compared against one of those non-original implementations with just a few percentage-point differences. It makes no sense! It took me hundreds of hours to hunt down the differences between how these frameworks implement their layers before I could come even close to the expected accuracy...

innerlee · on Jan 25, 2021

shameless ad: try mmaction2, where every result is reproducible https://github.com/open-mmlab/mmaction2 . Modelzoo: https://mmaction2.readthedocs.io/en/latest/modelzoo.html

g_airborne · on Jan 25, 2021

This is very cool, I’ll be studying your implementation of I3D. Did you ever attempt to train I3D end-to-end as done in the Quo Vadis paper? And it so, did you get comparable Top1/Top5 accuracy?

innerlee · on Jan 25, 2021

Sure, checkpoints, configs and detailed training logs all are available at modelzoo https://mmaction2.readthedocs.io/en/latest/recognition_model...

The single RGB stream top1 goes up to 73.48% with resnet50, and up to 74.71% equipped with non-local. Both are much higher than the original paper with two-streams.

g_airborne · on Oct 10, 2020

The JIT is hands down the best feature of PyTorch. Especially compared to the somewhat neglected suite of native inference tools for TensorFlow. Just recently I was trying to get a TensorFlow 2 model to work nicely in C++. Basically, the external API for TensorFlow is the C API, but it does not have proper support for `SavedModel` yet. Linking to the C++ library is a pain, and both of them cannot do eager execution at all if you have a model trained in Python code :(

PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)

t-vi · on Oct 11, 2020

The underappreciated (in my view/experience) part is that it also gets rid of a lot of GIL when used from Python because the part inside the JITed doesn't use Python anymore.

When you have multithreaded setups, this typically is more significant than the Python overhead itself (which comes in at 10% for the PyTorch C++ extension LLTM example, but would be less for convnets).

g_airborne · on Oct 10, 2020

It's not that Python is by definition much slower than C++, rather, doing inference in C++ makes it much easier to control exactly when memory is initialised, copied and moved between CPU and GPU. Especially on frame-by-frame models like object detection this can make a big difference. Also, the GIL can be a real problem if you are trying to scale inference on multiple incoming video streams for example.

threatripper · on Oct 11, 2020

Control is probably the main point. The python interface makes things easy but doesn't offer enough control for my case. I tested it with a cut down example (no video decoding, no funny stuff) and it all comes down to the batch size that is passed to model.predict. Large batches level out at around 10000 fps depending on the GPU and batch size 1 goes down to 200 fps independent of the GPU. This tells me that some kind of overhead (hidden to me) is slowing things down. I guess that I have to go much deeper into the internals of TF to find out more - so far I did not because it's a large time hole that only offers better performance in a part that is not super critical right now.

The GIL and slowness of Python become a problem when processing multiple streams or doing further time consuming calculations in Python.

g_airborne · on July 27, 2020

> The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/pflops-d.

It's definitely true that the RTX 2080 Ti would be more efficient money-wise, but the Tensor Cores are not going to get you the advertised speedup. Those speedups can only be reached in ideal circumstances.

Nevertheless, the article as a whole makes a very good point. The thing that is most scary about this is that it would become very hard for new players to enter the space. Large incumbents would be the only ones able to make the investments necessary to build competitive AI. Because of that, I really hope the author isn't right - unfortunately they probably are.

tomp · on July 27, 2020

OpenAI is kind-of a new player. Well-funded, but still - there's a lot of money available for this kind of exponential opportunities.

g_airborne · on July 26, 2020

If we’re going down this road of theorizing about the human brain based on DNNs, what is the deal with dropout? Could we help human brains with generalization by randomly removing 10% of our newly created connections at the end of each day to improve long term learning? :)

rtkaratekid · on July 26, 2020

That’s called synaptic pruning and, while it most happens as a human matures, there’s evidence indicating that it occurs during sleep in adults to help consolidate the most important connections and remove the unimportant ones. It’s not exactly like dropout, but at a high level it kind of looks like it.

g_airborne · on July 11, 2020

Is there a way to do this easily on S3 if you’re hosting a static website with custom domain? Last time I checked you still needed to put a CloudFront instance in front of it.

philjackson · on July 14, 2020

Ah, no, you're right. Cloudfront is required.

g_airborne · on July 11, 2020

Instead of using Cmd+V to paste, use Cmd+Option+V to cut from the original location and paste. I like it because it lets me postpone the decision to copy or cut until the very end :)

nprateem · on July 12, 2020

It should be possible from the right click menu like in every other OS, but for some reason it's not. Same for creating directories.

antipaul · on July 12, 2020

To create new folder, right click in the window (not on a file nor a folder).

Copy paste also works with right click. To move (a la cut), hold option key.

Also hold option key to see "move" menu option under Edit - Paste (Move)

g_airborne · on July 7, 2020

Very cool! Does anyone know how is software support for all these features? It seems that TF doesn’t support the TP32/16 types as of yet. Is this something only CUDA engineers can use right now?

It does seem a little fishy to me that NVIDIA often boasts with figures like 10x performance upgrade whilst in practice those are only possible if you use one of their non-default float types which are hardly supported in most deep learning libraries :(

zetazzed · on July 8, 2020

Both PyTorch and Tensorflow teams have announced they'll support TF32. By design, it interoperates well with existing code, since calling code can just treat it as a regular FP32 value.

g_airborne · on June 29, 2020

The connectedness of neurons in neural nets is usually fixed from the start (i.e. between layers, or somewhat more complicated in the case CNNs etc). If we could eliminate this and let neurons "grow" towards each other (like this article shows), would that enable smaller networks with similar accuracy? There's some ongoing research to prune weights by finding "subnets" [1] but I haven't found any method yet where the network grows connections itself. The only counterpoint I can come up with is that is probably wouldn't generate a significant performance speed up because it defeats the use of SIMD/matrix operations on GPUs. Maybe we would need chips that are designed differently to speed up these self-growing networks?

I'm not an expert on this subject, does anybody have any insights on this?

1. https://www.technologyreview.com/2019/05/10/135426/a-new-way...

tjwhitaker · on June 29, 2020

I think this is a really interesting area of machine learning. Some efforts have been made in ideas that are tangential to this one. Lots of papers in neuroevolution deal with evolving topologies. NEAT is probably the prime example http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf and another paper I read recently called pathnet that is different but very interesting https://arxiv.org/abs/1701.08734.

g_airborne · on June 29, 2020

This is very cool! Thanks!

londons_explore · on June 29, 2020

I experimented with networks where weights were removed if they did not contribute much to the final answer.

My conclusion was I could easily set >99% of weights to zero on my (fully connected) layers with minimal performance impact after enough training, but the training time went up a lot (effectively after removing a bunch of connections, you have to do more training before removing more), and inference speed wasn't really improved because sparse matrices are sloooow.

Overall, while it works out for biology, I don't think it will work for silicon.

pshc · on June 30, 2020

Would you say you found a result similar to the lottery ticket hypothesis? https://arxiv.org/abs/1803.03635

londons_explore · on June 30, 2020

Not really - I had to do multiple steps of 'prune a bit, train a bit' to be able to prune to 99%. If I had done all the pruning in one big step as they do, I don't think it would have trained well, even if I had been able to see the future and remove the same weights.

mennis16 · on June 29, 2020

Here is a relevant paper, which was the coolest thing I saw at this past NeurIPS: https://weightagnostic.github.io/

It is based on NEAT (as other commenters mentioned) and also ties in some discussion of the Lottery Ticket Hypothesis as you mentioned.

blamestross · on June 29, 2020

(See sibling comment NEAT is awesome)

The only reason we architect ANNs the way we do is optimization of computation. The bipartite graph structure is optimized for GPU matrix math. Systems like NEAT have not been used at scale because they are a lot more expensive to train and to utilize the trained network with. ASICs and FPGAs have a change to utilize a NEAT generated network in production, but we still don't have a computer well suited to training a NEAT network.

g_airborne · on June 29, 2020

So this might be an enormous opportunity for low-cost and more performant AI if someone was able to build an FPGA of some sort that could handle these types of computations as efficiently right?

blamestross · on June 29, 2020

Running the post-training network is a solved problem. (FPGA and ASIC can do it just fine). TRAINING the network is the difficulty. The problem is that the structure of the network is arbitrary and is a result of the learning process. You can't optimize a computation for a structure you don't know yet. Bipartite layer networks have the benefit of never changing structure but they can approximate other subset structures. I don't know if we could easily tell where we are on the tradeoff between "bipartite graphs are trained efficiently but are inefficiently simulating a smaller network in practice"

Der_Einzige · on June 29, 2020

NEAT just doesn't have good, modern GPU powered implementations.

NEAT would totally be competitive if someone actually gets a version running in PyTorch/Tensorflow

blamestross · on June 29, 2020

It's not that simple. Backpropagating a bipartite graph of nodes works out to a series of matrix operations that parallelize efficiently on a GPU as long as the matrices fit into the GPU's working memory. Running a GA (part of neat) doesn't normally work well on a GPU. The good NEAT algorithms even allow different neurons to have different firing response curves. This inherently defies the "same operation multiple values" style of parallelization in GPUs. The way GPUs work just fundamentally isn't well suited to speeding up NEAT.

jawarner · on June 29, 2020

You may be interested in this implementation [1] which builds the networks using PyTorch.

[1] https://github.com/uber-research/PyTorch-NEAT

blamestross · on June 29, 2020

It uses pytorch (and I'm probably going to use it), but doesn't effectively leverage a GPU for training.

jawarner · on June 29, 2020

What do you think is the best way to accomplish this?

blamestross · on July 1, 2020

You don't. You need a different parallelism model than a GPU provides. It could work well on machines with very high CPU count, but the speedup on GPUs is the main reason bipartite graph algorithms have seen such investment.