Hacker News new | past | comments | ask | show | jobs | submit login
New Method for Compressing Neural Networks Better Preserves Accuracy (amazon.com)
120 points by georgecarlyle76 on Jan 15, 2019 | hide | past | favorite | 36 comments



Interesting 'cause once you think of it as a lossy compression task, lots of possible tricks come to mind. One is that you could "cheat" to correct the most important errors: look at which input words have high (distance from original point * frequency of use), and save the (in their example) 300-element error vector (original embedding - estimate) for those. Since we use the most common words a lot, you might be able to recover a noticeable chunk of accuracy without using much additional space.

Really, you want to correct the errors that are important, and importance to the task is a different thing from error magnitude * word frequency. You might run the network on real tasks with the uncompressed table and the compressed table, and add correction vectors for the words to which you can attribute the most mistakes made by the compressed net. And you don't necessarily have to correct the whole vector or correct it perfectly; you could round values in the correction vector down to zero when they're small enough, saving space and computation, or save your corrections with low precision, or try another round of SVD on some error vectors and only store 30 more weights for correcting important errors (and one 30x300 matrix) not 300, or some combo of tricks like that.

(I think this paper is saying that after you get the SVD-compressed lookup table, you do more NN training that tweaks the embedding and possibly the rest of the net. The links and weights for correcting important errors could be tweaked in the same process.)

It's tempting to treat neural networks as their own black box in a world apart from other CS sometimes, when in fact there can be lots of interesting connections. Now where all those connections ought to be and their ideal weights I don't know :P


The article's title is actually "New Method for Compressing Neural Networks Better Preserves Accuracy".

Summary of a key result: "In one set of experiments, for instance, we showed that our system could shrink a neural network by 90 percent while reducing its accuracy by less than 1%. At the same compression rate, the best prior method reduced the accuracy by about 3.5%."

Link to the 2019 AAAI paper: https://arxiv.org/pdf/1811.00641.pdf


We've reverted the title from the submitted “Method shrinks NNs by 90% with sub-1% accuracy dropoff, vs. 3.5% for prior SOA”. Submitters: there's no need to try to pack the article into 80 characters! Titles and articles are two separate things.


Do they mean compressing the amount of neurons involved in computing, or do they mean compressing the outputted model?


Its mostly the lookup table which takes up the most space. This work is about breaking it into 2 layers and continuing to train to gain accuracy. The output model becomes 90% smaller compared to the original model.


it would be interesting if they also show whether those 90% are necessary or not to get that 1% back. I mean we kind of suspect that probably not, the interesting part here would be to show proof or at least some good foundation.


This is shown in one of the plots, the less you compress the less you loose. While there is some analysis in the paper on how the computations reduce etc, the results are mostly emperical.


The goal of the post is to compress an already trained network.

An more ambitious related goal is to perform the training of the network with at most 10% (or other small percentage) of the network weights being nonzero at any time during training. If possible without a great loss in accuracy, this could be particularly useful to save memory during training. The paper https://arxiv.org/abs/1711.05136 propose a method to achieve this.


I would not consider word embeddings to be state of the art anymore. Word Embeddings are like TF-IDF when word embeddings came out. Have a look at BERT model that just recently got published and is outperforming all kind if NLP tasks with one main Architecture. I would consider BERT language model two levels higher than word embeddings, as it considers full context sensitive embeddings, dependent on the text left and right of the word in parallel.


I would second this; sentence embeddings outperform word embeddings on basically all tasks where you actually have sentences to work with. The only downside is that they're significantly more computationally intensive, especially for Transformer models like BERT.

(Note: I'm fairly biased, since I work on https://www.basilica.ai, which among other things makes sentence embeddings available over a REST interface.)


Is BERT computationally (and sample-wise) equivalent to previous SOTA?

(I do DRL but not NLP)

I sometimes read these DL papers and the requirements are not really feasible if you have to re-implement them in a modified domain.


BERT is more computationally expensive. It might end up giving better results on the task mentioned in the paper but we don't know. At the time of writing this all of the contextual word embedding techniques were fairly new and were not tried.


I can see utility in demonstrating breakthroughs with the simpler compatible technique as opposed to a more complicated state of the art technique. The goal of scientific communication is to communicate with the simplest examples possible the rationale and effect of a proposal. Then anyone using more advanced frameworks can understand and consider implementing it in theirs.


Just to clarify a bunch of questions below: The idea is more like : start from a full embedding layer train the network, after the training stabilizes break the first layer into two using the SVD formulation in an online fashion and continue training - after only a few more epochs you gain back all accuracy almost. This is particulary interesting and efficient compared to quantization / pruning methods as there is no way to regain lost accuracy in those methods... quantized distillation can be an option but too slow as pointed out by the authors, offline embedding compression like FastText.zip compresses embedding matrix offline and seems to be less information preserving than this method leading to worse performance. Another important thing to note: this method shows significant improvement over quantization/ pruning/ hashing when the network is deep. Deeper the network more is the quantization error due to more FLOPs in low precision - whereas this algorithm being principled from an information theoretic perspective maintains steady performance irrespective of network depth actually deeper nets will give better gains. In many of our experiments on smaller tasks this method actually acts as a cheap regularization as well leading to boosted performance. Also i guess we need to appreciate the mathematical elegance and ease of implementation. Also look at the Numerical Analysis that guarantees and provides bounds compared to Quantization methods.


So they took a NN taking embeddings as input, added a first layer going from one-hot encoding to embeddings (you just use the embedding matrix as first layer), did a SVD on it and retrained the network on top.

Using a SVD+retraining to compress a neural network is a common idea, their gain come mostly from the idea of applying that to the embeddings which are usually considered separately from the network (but do take space on your device).


The idea ismore like : start from a full embedding layer train the network, after the training stabilizes break the first layer into two using the SVD formulation in an online fashion and continue training - after only a few more epochs you gain back all accuracy almost. This is particulary interesting and efficient compared to quantization / pruning methods as there is no way to regain lost accuracy in those methods... quantized distillation can be an option but too slow as pointed out by the authors, offline embedding compression like FastText.zip compresses embedding matrix offline and seems to be less information preserving than this method leading to worse performance. Another important thing to note: this method shows significant improvement over quantization/ pruning/ hashing when the network is deep. Deeper the network more is the quantization error due to more FLOPs in low precision - whereas this algorithm being principled from an information theoretic perspective maintains steady performance irrespective of network depth actually deeper nets will give better gains. In many of our experiments on smaller tasks this method actually acts as a cheap regularization as well leading to boosted performance. Also i guess we need to appreciate the mathematical elegance and ease of implementation. Also look at the Numerical Analysis that guarantees and provides bounds compared to Quantization methods.


Interesting. It took me a while to figure out what the main contribution here is, since doing dimensionality reduction on embeddings is fairly common. I think the main contribution is an empirical measure of A) how little is lost by reducing the dimensionality of the embeddings, and B) how much the dimensionality-reduced embeddings benefit from being fine-tuned on the final dataset.

It looks like fine-tuning the lower-dimensionality embeddings helps a lot (1% degradation vs. 2% for the offline dimensionality reduction according to the paper).

This makes me pretty happy, since I run a startup that trains and hosts embeddings (https://www.basilica.ai), and we made the decision to err on the side of providing larger embeddings (up to 4096 dimensions). We made this decision based on the efficacy of offline dimensionality reduction, so if fine-tuned dimensionality reduction works even better, that's great news.


my background is physics, but I have been reading and following AI papers for many years, without actually applying this knowledge.

I have an idea I wish to try out regarding word embeddings, and I am looking for either a minimalist or a very clearly documented library for word embeddings. With minimalist I do not refer to simplest in usage, but simplest in getting acquainted with the underlying codebase.

I have no interest in becoming a proficient user of a commercially relevant library, since machine learning is not how I earn money. But I can imagine one of the more theoretical courses on machine learning has a code base for students to tinker with. i.e. where the educational facet of how the word embedding algorithms work is somewhat prioritized over the commercial relevance and computational efficiency facets. Do you know of any such minimal implementations?

The idea concerns objective bias removal (potentially at the cost of loss of factual knowledge), so if it works it may be interesting outside of word embeddings as well...

EDIT: the only functional requirement I have of the codebase is that given a corpus, it can generate an embedding such that the classic "king" + "woman" - "man" = ~ "queen" is satisfied. the only computational requirement I have regarding computational efficiency is that it can generate this embedding on a single core laptop within a day for a roughly billion word corpus. any additional functionality is for my purpouses overcomplicating. So I am looking for the simplest (well-documented, preferably in paper form) codebase which achieves this.


>I have an idea I wish to try out regarding word embeddings, and I am looking for either a minimalist or a very clearly documented library for word embeddings.

Using AI on natural language is still in its infancy. You're not going to find a clean off-the-shelf solution if you're doing something unique.

Google Cloud has a natural language AI module that makes solving certain problems pretty easy (little to no coding). The next step is to look at packages like NLTK, which is more complex but can handle a larger set of natural language processing (you need to be comfortable with Python). If neither of the above tools does the job, you're going to need to dive much deeper, and learn about the fundamentals of neural nets and natural language processing and become familiar with tools like Tensorflow.

Natural language processing is a much more open field than image processing, despite a seemingly much lower bandwidth and data size. As far as data-size, all of Shakespeare's works can fit into a single iPhone image, but understanding what's going on in natural language is often a far more difficult task than what's going on in an image.


I apologize for my late edit, and thank you for your response. I just need a minimal codebase that generates a word embedding that satisfies these quasi linear relationships.


This leads in a very interesting direction. Previously I read that DNN is just as content to fit random input as it is to fit meaningful categories of input. This makes it seem that the DNN is just a memory trick and can't possibly extract any meaning.

Now being able to compress the network after being trained, it's evident that not all the nodes were needed/used. A good experiment would be to train a DNN on random input samples and verify that it is far less compressible. If so, the degree of compressibily could be used as a proxy estimate of how much 'conceptualization' is being done.


Overcomplete nature of DNN is a common phenomenon and often advocated to convert saddle points into local minima --- understanding the actual landscape is difficult due to the inherent nonlinearity.... this indeed leads to an interesting research direction and many prominent research groups have started investing heavily on this


> Previously I read that DNN is just as content to fit random input as it is to fit meaningful categories of input.

That's not exactly right. DNN's can memorise randomly given labels, but it's harder than learning real data, because real data has more regularities.


Very cool idea. How does fine tuning the SVD initialization compare to training from random initialization using the same architecture? I couldn't find this in the paper.


Something I don't understand:

I thought that when using LSTM for classification, the input vector for the sentence was a single vector of length 300, recurrently updated with each word in the text. And for the averaging method, it was a vector of length 300, the average of each of the vectors of length 300 for the words in the sentence. At what stage in the standard process would a matrix of size V*300 be used to represent a sentence or document?


I think the big matrix in the post is the "dictionary" used to map words to NN input vectors, not a representation of one input sentence or document. So its size is (total words understood by NN * dimensions of embedding).


Yes, sorry that's actually quite clear in the text now that I read it again.


"NLP" would be welcome in the title since this seems specific to natural language processing.



They're just SVDing the embeddings? Is that really new?


I think the new things are integrating the SVD output into the NN as a couple of layers and then training it more, and adjusting the learning rate differently from how it was done before. The training step will tend to tune the compression to work well for the specific task.

Getting good results on the task is different from making the compressed representation of the embedding better on all-purpose metrics (like average magnitude of difference from the original embedding). For example, you might end up with changes that improve representation of common words, or words where errors actually mess up results, at the cost of making the decompression worse by simple metrics.

(It wasn't clear to me if the training adjusts the whole net's weights or just the two layers they add up front. If it adjusts everything, you could also get joint changes to the original net + decompression machinery that only work well together.)


Training adjusts all whole net's weights. And you are right, we get the joint changes and decompression machinery to work together.


Can some translate the title into a non-obfuscated language please?

This is ridiculous.


NN = Neural Network

SOA = State of the Art

The rest should be clear: it's a new method that shrinks the size of neural nets by 90% while only dropping the accuracy <1%. The previous state of the art for doing the same task of shrinking neural nets dropped the accuracy by 3.5%.

Compressing/shrinking neural nets is an important task because most edge devices (i.e. smart devices like thermostats, security cams, etc.) don't have the compute power to run large neural networks and need more compact ways to do the same tasks. Currently the solution is to use internet connectivity to query an API but that's not ideal for a variety of reasons.


Not neural nets in general, just the word embedding layer. This layer tends to be large, like 1,000,000 x 300, so there are problems deploying it to production, especially in mobile environments. They can reduce the second dimension to something like 30 using SVD, replacing the large layer with the smaller one, and then retraining.

Note that whole networks can also be compressed to similar levels with a variety of techniques, this one is just slightly better for a specific type of layer.


This isn't a productive way to criticize something. Please state what you were confused about and suggest an alternative.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: