Hmm I would love for someone else to give their opinion because I find it very i...

GistNoesis · on Jan 20, 2023

>you just initialize the context window with the desired number 500-600 number ab to be factorized

You also have to initialize the initial random context correctly correlated with your initial ab prompt (before the prompt, aka the initial state) which it has used to generate this number ab. For example give it a feature vector corresponding to a, b, and ab written in binary (in the way the network has learned to do). It won't ever learn to invert the one way function ab-> a,b from only being shown ab.

In practice, the learning signal will be quite weak because only when it has seen ab,a,b aka at the last character and back-propagate it through to the initial time (in training all previous character level prediction will amount to noise that it will have to learn to model first to be able to ignore ("Benford's law" distortion ??) , (or you can just give weight of zero to these intermediate character predictions that only add to the noise)

About the deterministic/randomness, the task is fundamentally deterministic as it has a single answer, so noise won't help but a non-deterministic network can/will learn to produce deterministic output without problem. In fact if at any point the network output a wrong digit, it won't be able to recover (except if you give him some character like backspace in which case each time it outputs a wrong digit not corresponding to its initial state that it want to write it will output a backspace for the next character and try its luck again ) and the answer will be wrong, but the output will look something like ab,a,b where a and b are real primes but with some wrong digits (like if they had been corrupted by a noisy channel).