Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hmm I would love for someone else to give their opinion because I find it very interesting but don't quite understand it yet. The way I see it, at the very beginnig the transformer has a large number of choices for numbers a, b that allow it to solve the problem. If there is randomness present then it will (pseudo-)randomly choose a pair a, b, with the intention to write ab,a,b

After writing the first digit of ab, as I understand it wanta to recover its features by doing the same operations as before on the past sequence. But as the computation of the features is non-deterministic, it can't arrive at the same pair a, b.

Let me try to specifiy a more difficult task for a tranaformer with randomness: I want you to generate exactly three numbers in the form c,a,b where a and b are prime numbers with 250-300 digits and c=ab. I want these numbers to be randomly chosen with a distribution so that the range of primes 250-300 is approximately uniformly covered.

Suppose now a transformer has uniformly picked a and b and generated the first digit of ab. Let's in fact say it has already generated all the digits of ab. If the transformer has weights that make it now successfully print a, b (say with probability > 0.99), then you have constructed a method of factoring products of prime numbers in the 250-300 digit range, i.e. you just initialize the context window with the desired number 500-600 number ab to be factorized, and let the transformer do its work.

I.e. such a transformer has to be computationally powerful enough to factorize large prome products with >0.99 percent accuracy.

On the other hand an RNN or a human both with randomness can solve this task without having to be computationally powerful enough to do the factorization.



>you just initialize the context window with the desired number 500-600 number ab to be factorized

You also have to initialize the initial random context correctly correlated with your initial ab prompt (before the prompt, aka the initial state) which it has used to generate this number ab. For example give it a feature vector corresponding to a, b, and ab written in binary (in the way the network has learned to do). It won't ever learn to invert the one way function ab-> a,b from only being shown ab.

In practice, the learning signal will be quite weak because only when it has seen ab,a,b aka at the last character and back-propagate it through to the initial time (in training all previous character level prediction will amount to noise that it will have to learn to model first to be able to ignore ("Benford's law" distortion ??) , (or you can just give weight of zero to these intermediate character predictions that only add to the noise)

About the deterministic/randomness, the task is fundamentally deterministic as it has a single answer, so noise won't help but a non-deterministic network can/will learn to produce deterministic output without problem. In fact if at any point the network output a wrong digit, it won't be able to recover (except if you give him some character like backspace in which case each time it outputs a wrong digit not corresponding to its initial state that it want to write it will output a backspace for the next character and try its luck again ) and the answer will be wrong, but the output will look something like ab,a,b where a and b are real primes but with some wrong digits (like if they had been corrupted by a noisy channel).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: