I normally use Netron for quickly inspecting models and making this 'mental picture' of the architecture, but this hierarchical approach seems a better fit for my needs.
I'm just starting and the first impression is pretty good!
These tools are eye candy and have been around from tensorflow/tensorboard 0.x 10 years ago but never used after just trying them for fun. You need to read the source code no easy way around it.
Looks cool but seems like it doesn't work on torch 2.0
"AttributeError: module 'torch' has no attribute 'export'"
The torch.export API is currently in active development with planned breaking changes. The installation guide for this is still very minimal, anyone knows how to get it working on torch 2.0?
I haven't managed to successfully export my custom ViT model yet, but I've not had an issue accessing the export methods in torch 2.3 within the nvcr.io/nvidia/pytorch:24.02-py3 container.
I may have some more time to debug my trace tonight (i.e. remove conditionals from model + make sure everything is on CPU) and will update if I have any new insights.
I've never really understood the point of these visualizer things. The idea that a model is always well represented by a directed acyclic graph seems extremely dated.
I really would love a PyTorch/JAX profiler that shows, in annotated Python, where your code is allocating memory, using compute or doing device copies.
It may be that the way you like to think about things and the way others like to are different.
I find that quickly grasping a new architecture is easiest with a graph-based diagram first. Then code for details. All with the goal of internalizing the information processing steps. Not memory allocation per se.
In my mind, how the network implementation allocates memory is a different question.
But I think both of our desires just reflect our jobs, our interests, and simply how our brains conceptualize things differently.
I think it's a trap of visual elegance. When you start thinking of models this way you miss the way a lot of models are actually written.
E.g. how do you represent an online fine-tuning process? I want to randomly switch between a reference impl and an approximation method, but when using the approx method I want to back-propagate so that it gets better over time.
full disclosure: I've written plenty of these little visualizers and also fallen for the trap of "everything should be a declarative graph."
Lol I interned inside pytorch a few years ago (you and I even met/talked about tangential things :)) and worked on tracking such allocations (although I didn't hook it up to profiler). Spoiler alert: you can't track such provenance because everything gets muddled in the dispatcher.
EDIT: not completely accurate to say you can't do it. I prototyped a little allocator that would stamp every allocation (the pointer itself, in the unused bits, a trick I learned from zach) with the thread id and a timestamp (just an incrementing counter) and then percolate that up to the surface. Obv that didn't land lol.
but also while generally "in all things i defer to horace" (ok not really) so maybe i'm not looking closely enough (and missed it) but the bottom of that stack shows (roughly) the autograd dispatch key and not the python call site (or some such). and maybe it's a pedantic difference (depends on what bram wants) but i wanted provenance back to the TS op so that i could then do static memory allocation things with that representation (now i've probably fully de-anonymized myself...) and for that use-case, even what you have now, isn't enough (you can't get a total sum for how much each TS op or whatever allocates and when the corresponding free happens).
Not mocking you, but I'd make a guess you don't like to draw block diagrams when discussing designs or code architecture with others either? Some folks aren't "visual" thinkers, and took me a long time working to realize that some folks are like that.
I'm just starting and the first impression is pretty good!