Maybe “TF 3” or whatever they call it will be ergonomic and a pleasure to use, b...

blululu · on Oct 21, 2022

The ergonomics for Tensorflow are probably always going to be behind PyTorch (or Keras for that matter). The fact that the API has not been stable for the past 6 years has really burned me one too many times to not flinch at using it. It is basically an internal Google tool that has been made available to the public, and like most internal tools at Google the deprecated/developmental dichotomy applies (https://goomics.net/50/). That said the deployment of TensorFlow models onto mobile devices or the browser is really good, so sometimes the pain is necessary.

mardifoufs · on Oct 21, 2022

That comic reminds of the current state of azure machine learning. What a mess. I'd actually love to use GCP at this point, I've never had such a weird insanity of SDKs with Google Cloud.

Though I agree that it is very weird that google is treating TF as an internal product, versus something more akin to GCP. There's no reason for them to do so especially after they had the chance to break and redo tons of stuff for TF2.

williadc · on Oct 21, 2022

That Goomics comic was shown in my Noogler training. I thought it was a joke, but I've learned it's definitely not a joke.

liuliu · on Oct 21, 2022

PyTorch does have issues with both distributed tensor (which is easier to solve) and deployment (which is harder to solve, but solvable).

I also think Keras' Functional API is superior in terms of composability than PyTorch's OOP model, but I am biased as a software engineer. It does feel like the community thinks the OOP model is much more hackable thus easier to use.

All in all, it is still early days. We don't have a competent all-in-one OSS SQL database until late 2000s, which is 20-ish years after the theory was ready and taught extensively in the school. And even after that, we have plenty of innovations around database in 2010s for new use cases. Frameworks for differentiable programming have long way to go.

p1esk · on Oct 21, 2022

Their development on ShardedTensor seems to have slowed or stopped recently

What makes you say that?

atty · on Oct 21, 2022

Just looking at commit frequency in the main branch, and activity on the relevant RFCs in GitHub issues. I have no visibility into what’s going on at Meta of course so it could just be there’s a lot of currently internal development, maybe there’s a lot of work going into a different branch at the moment, or maybe they’re waiting for things to happen in torch’s RPC support, I’m not sure. Or it could be they’re just waiting to release what they have as a beta in the 1.13 or 1.14 release?

I’ll note that checking right now there was a commit 15 hours ago, but the last commit before that seems to be 28 days. So some work is still going on at least, thankfully :)

smhx · on Oct 21, 2022

ShardedTensor was merged / subsumed into DTensor: https://dev-discuss.pytorch.org/t/rfc-pytorch-distributedten...

Lots of development and traffic happening here: https://github.com/pytorch/tau/

baggiponte · on Oct 21, 2022

So is DTensor a Google thing like XLA or more of an open standard?

learndeeply · on Oct 21, 2022

Looks like its just a name collision. It's a tensor used in distributed models, thus Distributed Tensor, or DTensor for short.

atty · on Oct 21, 2022

Awesome thanks for pointing this out!

brrrrrm · on Oct 21, 2022

how do you envision a distributed tensor API as working? (perhaps a code snippet of an ideal API?)

atty · on Oct 21, 2022

Sorry, on my phone at the moment so I don’t think I can really type some decent code right now!

I actually think that torch’s ShardedTensor looks very promising. Essentially you can initialize a sharded tensor from an already initialized tensor, or initialize a sharded tensor on a meta device where it’s not allocated locally and each shard gets initialized on the specified remote devices (useful for extremely large tensors)

The sharding is described by a ShardingSpec, where you can either let it shard equally sized shards across the requested devices, where the split happens along a single dimension, or you can do grid sharding along multiple dimensions. They also have a more general sharding spec that allows you to choose explicitly which indices go on which devices, if you need non uniform shards.

I think once these are implemented (along with some special cases like cloned tensors, and things like that), and once the distributed autograd engine has full support for CUDA, it should be pretty easy to start building out distributed versions of common neural net operations.

The one thing (that I haven’t thought about a ton, to be frank, and I’m sure other smarter people have :)) is that you’ll end up in cases with both a sharding spec for the weights as well as for the inputs, and what’s the best way to make sure everything matches up. Is the best way to handle that custom logic for each operation? And do you have each operation just reshard the input automatically? Seems potentially like a pretty big performance pitfall.