An attempt at a summary: They use a sigmoid function to make differentiable "soft" branches, and stack them to construct a binary tree, with the goal of only taking one branch at inference time (but training the whole tree) leading to log(W) instead of W inference cost. They gradually harden the branches so they become hard branches at the end of training.
A branch is computed as branch(input, N), with a neural network N computing a scalar c=N(input), then using a sigmoid to do a soft branch by returning the weighted sum of the recursive call s(c)*branch(input, N_left) + (1-s(c)) * branch(input, N_right) (the two weights s(c) and 1-s(c) sum to 1). They only do "proper processing" using the leaf nodes.
Then they add a new loss term that encourages hard decisions by minimising the entropy of the Bernoulli distribution, making the 2 weights converge to 0 and 1, at which point only one branch needs to be taken at inference. They also state that this hardening often happens automatically though.
It's a simple idea but the loss formulation is nice, you usually want your loss terms to be a measure of information.
From the previous paper you cited
>Pushing FFFs to the limit, we show that they can use as little as 1% of layer neurons for inference in vision transformers while preserving 94.2% of predictive performance.
This feels like that often misinterpreted Einstein meme/qoute about humans only using a fraction of their brain power.
Is this only for inference though? could it boost training?
That's an interesting question. It actually provides a nice way to parallelized training: Pretrain e.g. the first 3 branch levels, which effectively fragments the model into 8 separate parts, which you can continue training across 8 independent servers/nodes with no further communication between the nodes. A central server would run the 1st 3 levels and mark parts of the training set that each node has to train on. Maybe you could do this for the whole network and distribute the training in SETI@HOME style all over the world.
Hold on, you don't even need to freeze the branches completely: each node could train 1 branch on the path to its leaf node and communicate a change in the branch node to a central server, so you can distribute training without having to pre-freeze the branches. Still would need some pre-training though, and the splits would change slowly, and the attention mechanism could complicate things.
Currently distributed neural network training SETI@HOME style looks like a complete pipe dream that nobody is taking seriously. But a smart branching mechanism like this could suddenly make it possible. Folding@home reached 1.5 exaflops, which made it the world's largest supercomputer. Imagine the models we could train that way, they would far surpass whatever OpenAI or Google could train and would be public.
You should check out Hivemind[1]. It is very similar to what you described except it used MoE for "fragmentation". They have a couple of examples of pre-training in their repo. Hivemind was also used to build Petals[2] but it only supports fine-tuning and inference[3] afaik.
Apologies for layman question: how much tera/peta/exa-flops do current models use to train?
Well, I'm assuming they'd use whatever they're given, so maybe the question should be "how much less time would training take on a 1.5 exaflops computer?"
A lot of clusters are totally homogeneous, at least within some very large domains, so for a given interconnect and a generation of GPU you know the maximum message latency, the peak sustained pflop rate, and so on but what often matters is some combination of the depreciation-cost-per-time and the watt hours per unit time, where you can sort of approximate both if you ignore the unfortunate realities, which then act as a multiplier.
For example, a problem is network issues - and not just scale - as the training sequence often involve billions of cycles of short compute-sync sequences which are bursty (e.g., all-to-all, barrier, compute, barrier, all to all, ...) but between which there isn't enough time to engage low power modes so you're burning $ due to slack and waste. This is true in different ways for a lot of training approaches.
You can approximate this, but it's so sensitive to data set size, specific training schedule, etc. that you won't be able to get the most important answer.
It's mentioned briefly in the paper(1), but I'm more interested in the interpretability implications of this approach. In some respects, this marries the interpretability/editability of a small decision tree with the expressive power of a large neural network. Usually you see those two on extreme opposite ends of a tradeoff spectrum - but this approach, if it scales, might shift the pareto frontier.
(1): As a byproduct, the learned regions can also be used as a partition of the input space for interpretability, surgical model editing, catastrophic forgetting mitigation, reduction of replay data budget, etc..
…ETH Zurich is an illustrious research university that often cooperates with Deepmind and other hyped groups, they're right there at the frontier too, and have been for a very long time. They don't have massive training runs on their own but pound for pound I'd say they have better papers.
ETH Zurich is one of the top labs in the world. Disney Research also works with them a lot. Another "sleeper" is University of Amsterdam that has rockstars like Max Welling and his students Kingma, Salimans,van den Berg, and Hoogeboom.
It's easy to get hyped up on the big tech labs because they have the most compute, but the best papers come from smaller labs and unfortunately more lately face larger challenges in getting published. It's the smaller works that create the foundations that end up in these giant models. ML is in a really weird space right now.
From the first author on Twitter:
"It could quite a big deal for people who don't have access to a colocated cluster of GPUs:
e.g. with DiLoCo you could train your model, with data-parallelism, across all GPU providers, looking in real-time for the cheapest price, even if pre-emptable, even across continents"
It is not surprising. The assumption is that they have the best people. That you can objectively search 8 billion people for the best people globally is folly of course. There are geniuses without US citizenship / visas / green cards. And so outside brains are going to figure this out. Mix in GDP of $rest_of_world has much more resources than any company, and the luck-driven nature of making AI discoveries, and I reckon most progress will be outside of OpenAI etc. Driven by a problem the big guys don't need to solve: how do I avoiding buying a $5k graphics card.
I wouldn't be so quick to conspiracy. I'm the author of a work and a famous blog post that trains a particular common architecture much faster (don't want to dox myself too much) and with far fewer parameters, but it has been rejected several times and is now arxiv only. Our most common complaint was "who would use this? Why not just take a large model and tune it?" That question alone held us back a year (had over a hundred citations by then and remains my most cited work) until it switched to "use more datasets" and "not novel" (by that time true, others had built off of us, cited us, and published in top venues).
I don't think this was some conspiracy by big labs to push back against us (we're nobodies) but rather that people get caught up in hype and reviewers are lazy and incentivized to reject. You're trained to be critical of works and especially consider that post hoc most solutions appear far simpler than they actually are. But context matters because if you don't approach every paper with nuance it's easy to say "oh, it's just x." But if those ideas were so simple and obvious they would also be prolific. I see a lot of small labs suffer the same fate simply due to lack of compute. If you don't make your new technique work on many datasets it becomes the easiest thing to reject a paper by. ACs aren't checking that reviews are reasonable. I've even argued with fellow reviewers about papers in workshops -- papers I would have accepted in the main conference -- that are brushed off and the reviewers admit in their reviews that they do not work on these topics. I don't understand what's going on but at times it feels like a collective madness. A 10 page paper with 4 very different datasets that solves a problem, is clearly written, has no major flaws, and is useful to the community should not need defending when submitted to a workshop just because reviewers aren't qualified to review the work (this paper got in btw). We are moving into a "pay to play" ecosystem and that will only create bad science due to group think. (another aspect of "pay to play" is in the tuning. Spending $1M to tune your model to be the best doesn't mean it is better than a model that could not afford the search. Often more than half of resources are spent on tuning now)
Is there a place where you guys discuss... things? I'm layman interested in this topic akin to pop-physics/maths, but have no chance to just read papers and "get it". On the other hand, immediately available resources focus more on how-to part of it rather than on what's up overall. Also, do you have something like 3b1b/pbs/nph for it? Content that you can watch and say "well, yep, good job".
I don't have any great recommendations and unfortunately my advice may be not what you want to hear. What I tell my students is "You don't need to know math to build good models, but you need to know math to know why your models are wrong." But this is even a contentious statement within the community. (Personally I'm more interested in exploring what we can build and understand rather than focusing on throwing more compute and data at problems. There's a lot of work to be done that does not require significant compute, but it isn't flashy and you'll get little fame. Every famous model you know has some unsung hero(s) who built the foundation before compute was thrown at the problem). I was previously a physicist and we similarly frequently express that you do not know the material unless you can do the math. Physicists are trained in generating analogies as they help communication but this sometimes leads to people convincing themselves that they understand things far more than they actually do. They say the devil is in the details, and boy are there a lot of details. (Of the science communicators, I'm happy those are the ones you mention though!) But do not take this as gatekeeping! These groups are often happy to help with the math and recommend readings. ML is kinda a while west and you can honestly pick a subdomain of math and probably find it useful, but I would start by making sure you have a foundation in multivariate calculus and linear algebra.
As to paper reading, my suggestion is to just start. This is a fear I faced when I began grad school and it feels overwhelming and like everyone is leagues ahead of you and you have no idea where to begin. I promise that is not the case. Start anywhere, it is okay, as where you end up will not matter too much on where you begin. Mentors help, but they aren't necessary if you have dedication. As you read you will become accustomed to the language and start to understand the "lore." I highly suggest following topics you find interesting backwards through time, as this has been one of the most beneficial practices in my learning. I still find revisiting some old works reveals many hidden gems that were forgotten. Plus, they'll be easier to read! Yes, you will have to reread many of those works later, as you mature your knowledge, but that is not a bad thing. You will come with newer eyes. Your goal should be to first understand the motivation/lore, so do not worry if you do not understand all the details. You will learn a lot through immersion. It is perfectly okay if you barely understand a work when first starting because a mistake many people make (including a lot of researchers!) is that a paper is not and cannot be self contained. You cannot truthfully read a work without understanding its history and that only comes with time and experience. Never forget this aspect; it is all too easy to deceive yourself that things are simpler than they are (the curse of hindsight).
I'd also suggest to just get building. To learn physics you must do physics problems. To learn ML you must build ML systems. There are no shortcuts but progress is faster than it looks. There's hundreds of tutorials out there and most are absolute garbage but I also don't have something I can point to that's comprehensive. Just keep in mind that you're always learning and so are the people writing tutorials. I'm going to kinda just dump some links, they aren't in any particular order sorry haha. Its far from comprehensive, but this should help you getting started, nothing in here is too advanced. If it looks complicated, spend more time, you'll get it. It's normal if it doesn't click right away and there's nothing wrong with that.
Unless they were very confident of acceptance, a top research prof would rewrite and resubmit before publishing on arxiv so that others could "build on it" (scoop you at a top conference).
Welcome to ML. And idk, I'd feel pretty confident that a paper that gets so many citations gets accepted. The review system is like a slot machine if you aren't a big tech lab
They certainly have an incentive to keep these kinds of improvements in-house and not publish them, since they are commercial entities and this represents a competitive advantage.
Nvidia can't make GPUs fast enough. I doubt 10xing training and/or inference efficiency would result in a decrease in demand. I would be surprised if it didn't instead increase demand. Mind you, Nvidia is pushing hard on TensorRT which optimizes models at inference time and results in major increases in throughput (not 10x though lol).
But if things get too efficient for individual users, you won't need an Nvidia GPU anymore. People will use cheaper hardware instead. I'm looking forward to running good models at decent speed on a low-end CPU or whatever crappy GPU is in my phone.
I had the same thought this morning and was debating selling my nvda stock when I saw this - feels like they are well-positioned right now, as with crypto a few years ago, but if there were an efficiency breakthrough that allowed commodity CPUs to do the inference instead, this advantage could vanish quickly.
Many labs doing foundational work like this and making progress don’t have the anything near the budget or compute to implement at scale. In other words they don’t have a Sam and his backers or a Zuck and his budget.
"""
One may ask whether the conditionality introduced by the
use of CMM does not make FFFs incompatible with the
processes and hardware already in place for dense matrix
multiplication and deep learning more broadly. In short, the
answer is “No, it does not, save for some increased caching
complexity."
"""
You can do that today, the only advantage today though is being able to fix the model in memory. It’s sequential and slower due to communication costs, though batching might be faster?
You can't get flops on a Hailo-8, they're fixed-point only. As much as these specialised inference chips are cool, we're a long way from just being able to drop them in where a GPU was. Not to mention the memory is hugely constrained. The Hailo chips I've worked with were all limited to 20MiB for the weights which is a squeeze even at 4-bit.
This approach feels like pruning, but the speedup is considerably higher. Interestingly, I'm curious how this will play out on more recent transformer architectures though: I guess the speedup will be more important for the largest architectures, but even if we can get 2x or 10x speedup on Mistral/Zephyr, Orca 2 or OpenChat3.5, that would be a tremendous achievement!
I find running 7B models on my 6 year old small form factor HP EliteDesk to be fast enough for casual everyday use. If this speedup can be applied generally to commonly used models, I can serve a local ChatGPT experience for both friends and family from my tiny homelab in my basement.
This is why I'm not understanding the excitement around open source models, they pale in comparison to GPT-4 quality, so I have no use for them until we have something comparable.
We utilize a LLama that frankly competes directly with gpt-4, to the point I can't tell which model we are using (we randomly switch between providers to ensure a robust backend).
I think it likely depends on the use case, but many llama models can be returned and there are quite literally thousands of free versions available.
You're using an LLM as if you don't understand how or why it works. Math is not what it's for. Here, look at what ChatGPT 4 does (for a case that hasn't been fed to it by a hundred other users yet):
What is the sum of odd numbers in this set: 12345654321, 123456543212, 123456543213, 12345654324? Output only the sum, no code.
The sum of the odd numbers in the set {12345654321, 123456543212, 123456543213, 12345654324} is 246913086434.
What does that prove? Only that I am using an LLM poorly and/or do not understand what it is. Using OpenChat-3.5 for what LLMs are actually good at (e.g. asking it for shell commands to perform certain operations, getting some general information about a topic) seems to work surprisingly well for a 7B model.
Not sure what you mean by 'using'. The prompt is from one of the benchmarks used in scoring LLM models - don't recall which. Here is another from BBH [1]:
"Unfit for deployment" or "not intended for deployment" is semi-standard wording for research models that are just raw language models with none of the safety/bias/offensiveness filtering that is usually desired for product applications. For example, if you deploy it as a customer-service chatbot, it might tell your customers to kill themselves, or call them racial slurs.
It doesn't mean that there's anything technically wrong with the language model per se as a model of language, just that there has been no effort made to ensure it's fit to be deployed as-is for any given generative-AI use case, and the model authors would prefer you didn't do that.
This requires retraining from scratch, so no, you can't use Llama 2 pretrained weight.
As far as I can tell you can take Llama 2 modelling code, training infrastructure, training data and apply proposed modification (they provide PyTorch nn.Module which should be drop in replacement of nn.Linear) and run the training if you have enough compute and it should work. Doesn't mean it would work, there are always lots of practical problems, but it should work in principle.
I'm not 100% sure, but those seem mostly mutually exclusive (or redundant), with the decision tree in maddness taking on a similar function as the binary tree in FFF that decides which neurons to activate.
Just from the abstract I don't see why not, it's just replacing the feed forward network that's part of all of these models with a very sparse one. The bigger problem is you seemingly have to retrain the model, so you couldn't just drop in llama2 weights from meta and have it work. Which makes it much more limiting. Something that used existing weights would be a lot more practical (like quantization for example).
For BERT, I can see this being useful if you had to make a really fast embedding model. There was a discussion about a fast embedding use case not long ago https://news.ycombinator.com/item?id=37898001
It certainly could, and I wouldn't be surprised if the authors want to try it out on those. You do have issues of past improvements often not quite enhancing more powerful models nearly as much. I'd expect this to possibly not work as well, something like the bigger models ending up with more polysemantic neurons because they're given more ''incentive'' (training time, neuron count, dataset size which they're encouraged to be able to reconstruct) to extract as much possible. This might make so the method performs worse due to this intermingling.
(See the transformer circuits website for that)
(Though I expect there's ways to recover a good chunk of extra lost throughput/accuracy, maybe by doing extra steps to directly steer the training towards breaking apart polysemantic neurons)
There are two issues here -- for one, in big transformers, more compute is in the attention layers, while this work improves only feed-forward layers, which are more important for smaller models and smaller sequence lengths. Second, in many typical scenarios LLM inference is memory bandwidth bound, I'm not sure if it's possible to utilize their approach to reduce required memory bandwidth.
Yes it might. "Reduction of number of neurons" is not static here, unlike traditional pruning approaches, here they still keep all weights, but the network dynamically selects which sub-portion of them to use. There is a related discussion of this in section 3.2 (page 4), but they don't think they mention actual memory bandwidth requirements/wins of their implementation, and probably there can be different tradeoffs for different devices.
Another noob Question: So a 50% size reduction in BERT? let's see if I am getting these numbers right. At inference time you need a fraction of the neurons in the FF layer to do the inference based on the input data and the previous dot product. Here some quick math for BERT-Base which has 110M params according to the original paper:
----
L (Number of Layers): 12 transformer blocks.
H (Hidden Size): 768 units in the hidden layers.
A (Number of Attention Heads): 12 attention heads.
Each transformer block has the following components:
Self-Attention Layer: Each attention head has 768 / 12 = 64 units.
Query (Q), Key (K), Value (V) matrices: 3 * (64 * 768) = 147,456 parameters per head.
Across 12 heads: 147,456 * 12 = 1,769,472 parameters.
Output layer of the attention mechanism: 768 * 768 = 589,824 parameters.
Feed-Forward Network (FFN):
First layer: 768 (input) * 3,072 (intermediate size) = 2,359,296 parameters.
Second layer: 3,072 * 768 = 2,359,296 parameters.
Total FFN parameters per block: 2,359,296 + 2,359,296 = 4,718,592 parameters. -----------------> *This is the number to keep in mind.*
Total Parameters per Block: 1,769,472 (self-attention) + 589,824 (output) + 4,718,592 (FFN) = 7,077,888 parameters.
Total for 12 Blocks: 7,077,888 * 12 = 84,934,656 parameters.
Layer Norm and Other Parameters:
Each transformer block also includes layer normalization and other small components, which add a relatively small number of parameters.
Total Parameters:
Embeddings: 23,835,648
Transformer Blocks: 84,934,656
Layer Norm and Others: A small number, completing the total to around 110 million.
--------------------------------------
4.718M FF Params per block * 12 ~ 56.6 Million/110M Params which is a staggering ~50% reduction in size at inference time if you use 0.3% of the FF neurons for FFF??
Noob Question: So is the idea to load only specific branches (and by extension log(n) order neurons), right based on the input data. Would this be something that a compiler would do using a JIT trick(because the input needs to be known to get the right branch) to issue a call to the right neurons into memory(SIMD?) to do the Feed Forward?
Both. Cheaper CPU-based inference, GPUs are not as competitive for sparse linear algebra.
This could lead to much larger models, as you only touch a small portion of the matrix during inference. However, the training here is still dense-LA on a GPU, so you still blow up the compute cost when increasing model size.
GPU utilization should be down when using this technique. I’m hoping this could allow for more efficient batch inference on GPUs. If you can predict 10 tokens for the price of 1 it should allow you to do tree of thought much more efficiently.
A lot of CPU inference libraries (llama.cpp included) use as much SIMD as possible, sometimes by hand-writing loops. The one I hack on, llama.rs, uses portable_simd but specializes to your CPU at compile time.
My experience has been that most CPU inference is actually not compute limited, but memory bandwidth limited, since most weights are used for a few operations per token (how quickly can you load and unload the entire 70 GB of weights into your registers?). It's not quite that bad but I found most vectorization changes didn't meaningfully change performance.
I think with that magnitude of a speed improvement it should become feasible to do just-in-time embedding creation for semantic search for much larger documents.
How would this scale for a use case like writing code? I could imagine that some inputs would require a large number of neurons. Would this architecture be able to do that if it were scaled up?
I'm also curious if this model architecture would achieve the grokking of more complex concepts at scale.
I would have to go back and reread the paper to be sure, but FF layers are applied position-wise, meaning independently and in parallel on all input tokens/positions. Because of that, I could imagine contexts where the sequence dimension isn't relevant, i.e., for computational complexity.
> Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.
Conclusions
> We present UltraFastBERT, a modified version of the
(crammed)BERT architecture that uses fast feedforward instead of feedforward networks in its intermediate layers. UltraFastBERT serves as proof that large language models only
really need to engage an exponential fraction of their parameters to perform individual inferences. UltraFastBERT-1x11,
our deepest model with the highest promise of acceleration,
uses only 0.3% of its neurons during inference and already
achieves a 78x CPU speedup over the inference time of
the corresponding feedforward layer. With a theoretical
speedup promise of 341x at the scale of BERT-base models,
we hope that our work will inspire an effort to implement
primitives for conditional neural execution as a part of device programming interfaces.
They are getting a 78x speedup w/o hardware support which is pretty good: they think they can speed it up another 4x if they had the right hardware support. So it looks useful now with possibility to get better.
So long as I've been involved with neural networks for text analysis it's seemed to me that we really should be using sparse activations because any particular document only involves a limited set of concepts.
For instance a search engine for patents might be looking at a patent for adhesive tape which activates a certain set of concepts but is not going to activate concepts involved with bicycle derailleurs or public key cryptography: a sparse representation reflects this and dense representations don't.
As far as I understood it: Forget GPUs, this thing is plenty fast on CPUs.
In general, GPUs are bad at branching. The fastest way to implement it on GPUs is probably to let it calculate both sides of the branch and then only use the result of the one that was taken. Which won't be faster than a normal NN.
The bottleneck for "easy integration" into games and applications right now is as much the RAM usage as is the slowness. This would probably bring the speed to an acceptable level but you would still have to hold the whole model in RAM.
That would make it a lot more feasible to run models in the cloud (triple digit RAM is a lot more abundant than VRAM), but wouldn't do that much for consumer hardware.
Interesting idea. Like texture streaming, you'd just stream in the parts of the model from disk to fill up all available RAM. If the NPC needed to think about something not cached in RAM, you'd throw up a "hmm, let me think about this" while stuff loads from disk.
But in reality, sparse NN is just loose it's performance, mean loose precision and recall. Precision, means, larger probability of errors; recall - if you work with piece of information, which could consist of few, ie predicates, it will see not all predicates.
To be concrete, for good trained full-scale NN, usually considered 70-90% for precision and for recall; but if use small fraction of weights, usually will got drop of performance to about 40-70%, which is good enough for many cases, considering saves on size and computations.
Yep. And we only notice the obvious ones. And the quality of AI comments will only improve over time. Which raises the question (unanswered as far as I can tell) of what to do about it, and whether it even really matters...
If the comment is a quality comment, I don't care if the poster used AI, if its a low-quality or inappropriate comment (and a summary of the existing comments counts), then downvote or, for egregious cases, flag it.
This is rather scary. I feel we are witnessing the evolution of language models and artificial intelligence, which seems intellectually laudable until you realize that the underlying evolutionary framework for this evolution is the global capitalistic system whose only criteria for selection in short-term monetary gain.
I absolutely disagree - I believe everyone else is blind, the same way we are blind that our current lifestyles are an exercise in extreme violence on the nonhuman world.
Rather than looking to capitalism which has provided tremendous benefits to society as well as unintended consequences you may want to update your thinking to focus on the incentives alignment problem in general.
https://arxiv.org/abs/2308.14711
An attempt at a summary: They use a sigmoid function to make differentiable "soft" branches, and stack them to construct a binary tree, with the goal of only taking one branch at inference time (but training the whole tree) leading to log(W) instead of W inference cost. They gradually harden the branches so they become hard branches at the end of training.
A branch is computed as branch(input, N), with a neural network N computing a scalar c=N(input), then using a sigmoid to do a soft branch by returning the weighted sum of the recursive call s(c)*branch(input, N_left) + (1-s(c)) * branch(input, N_right) (the two weights s(c) and 1-s(c) sum to 1). They only do "proper processing" using the leaf nodes.
Then they add a new loss term that encourages hard decisions by minimising the entropy of the Bernoulli distribution, making the 2 weights converge to 0 and 1, at which point only one branch needs to be taken at inference. They also state that this hardening often happens automatically though.
It's a simple idea but the loss formulation is nice, you usually want your loss terms to be a measure of information.