You don't have to precisely represent the float in decimal. You just have to have each float have a unique decimal representation, which you can guarantee if you include enough digits: 9 for 32-bit floats, and 17 for 64-bit floats.
And you need to trust that whoever is generating the JSON you’re consuming, or will consume the JSON you generate, is using a library which agrees about what those representations round to.
Note that the consumer side doesn't really have a lot of ambiguity. You just read the number, compute its precise value as written, round it to the closest binary representation with banker's rounding. You do anything other than this only under very special circumstances. Virtually all ambiguity lies on the producer side, which can be cleared out by using any of the formatting algorithms with the roundtripping-guarantee.
EDIT:
If you're talking about decimal->binary->decimal round-tripping, it's a completely different story though.
This is one of those really common misunderstandings in my experience. Indeed JSON doesn’t encode any specific precision at all. It’s just a decimal number of any length you possibly want, knowing that parsing libraries will likely decode it into something like IEEE754. This is why libraries like Python’s json will let you give it a custom parser, if, say, you wanted a Decimal object for numbers.
Like it or not, but json data types are inherently linked to the primatives available in JavaScript. You can, of course write JSON that can’t be handled with the native types available in JavaScript, but the native parser will always deserialize to a native type. Until very recently all numbers were iee754 doubles in JavaScript, although arbitrary precision bignums do exist now. So the defacto precision limit of a number in JSON that needs to be compatible is an IEE754. If you control your clients you can do whatever you want though.
The standard definitely limits what precision you should expect to be handled.
But how JSON numbers are handled by different parsers might surprise you. This blog post actually does a good job of detailing the subtleties and the choices made in a few standard languages and libraries: https://github.com/bterlson/blog/blob/main/content/blog/what...
I think one particular surprise is that C# and Java standard parsers both use openAPI schema hints that a piece of data is of type ‘number’ to map the value to a decimal floating point type, not a binary one.
Not sure which parser you consider standard, as Java doesn't have one at all (in the standard libraries). Other than that the existing ones just take the target type (not json) when they deserialize into, e.g. int, long, etc.
That's bit much - (unfortunately) the codebase uses at least 4 different json libraries (perhaps 5 if I consider one non-general purpose, personally written). gson is generally very popular as well. The blog post mentions BigDecimal and at that point, I'd not dare to trust it much.
The de-facto standard is similar to the expectation everyone uses spring boot.
Indeed - you could be serializing to or from JSON where the in-memory representation you're aiming for is actually a floating point decimal. JSON doesn't care.
What you're describing — search + NN — is presently under the term "test-time compute".
The rules / dynamics / objectives of chess ( and Go ) are trivial to encode in a search formulation. I personally don't really get what that tells us about AGI.
No, I don't think that test-time compute is the same thing at all. It's a little challenging to find a definitive definition of TTC, but AFAICT it is just a fairly simple control loop around an LLM. What I'm describing is a merging of components with fundamentally different architectures, each of which is a significant engineering effort in its own right, to produce a whole that is greater than the sum of its parts. Those seem different to me, but to be fair I have not been keeping up with the latest tech so I could be wrong.
I think search is a fairly simple control loop. Beam search is an example of TTC in this modern era.
It is a very wide term, IME, that means anything besides "one-shot through the network".
I think the thing about the search formulation, which is amenable to domains like chess and go, but not other domains is critical. If LLMs are coming up with effective search formulation for "open-ended" problems, that would be a big deal. Maybe this is what you're alluding to.
That's like saying that Darwinian evolution is simple. It's not entirely wrong, but it misses the point rather badly. The thing that makes search useful is not the search per se, it's the heuristics that reduce an exponential search space to make it tractable. In the case of evolution (which is a search process) the heuristic is that at every iteration you select the best solution on the search frontier, and you never backtrack. That heuristic produces a certain kind of interesting result (life) but it also has certain drawbacks (it's limited to a single quality metric: reproductive fitness).
> Beam search is an example of TTC in this modern era.
That's an interesting analogy. I'll have to ponder that.
But my knee-jerk reaction is that it's not enough to say "put reactivity and deliberation together". The manner in which you put them together matters, and in particular, it turns out that putting them together with a third component that manages both the deliberation and the search is highly effective. I can't say definitively that it's the best way -- AFAIK no one has ever actually done the research necessary to establish that. But empirically it produced good results with very little computing power (by today's standards).
My gut tells me that the right way to combine LLMs and search is not to have the search manage the LLM, but to provide search as a resource for the LLM to use, kind of like humans use a pocket calculator to help them do arithmetic.
> If LLMs are coming up with effective search formulation for "open-ended" problems, that would be a big deal.
AFAICT, at the moment LLMs aren't "coming up" with anything, they are just a more effective compression algorithm for vast quantities of data. That's not nothing. You can view the scientific method itself as a compression algorithm. But to come up with original ideas you need something else, something analogous to the random variation and selection in Darwinian evolution. Yes, I know that there is a random element in LLM algorithms, and again I don't really understand the details, but the way in which the randomness is deployed just feels wrong to me somehow.
I wish I had more time to think deeply about these things.
- the sufficient amount of information to do evolution of the system. The state of a pendulum is it's position and velocity (or momentum). If you take a single picture of a pendulum, you do not have a representation that lets you make predictions.
- information that is persisted through time. A stateful protocol is one where you need to know the history of the messages to understand what will happen next. (Or, analytically, it's enough to keep track of the sufficient state.) A procedure with some hidden state isn't a pure function. You can make it a pure function by making the state explicit.
I'm not sure what you mean by "hidden state". If you set aside chain of thought, memories, system prompts, etc. and the interfaces that don't show them, there is no hidden state.
These LLMs are almost always, to my knowledge, autoregressive models, not recurrent models (Mamba is a notable exception).
If you dont know, that's not necessarily anyone's fault, but why are you dunking into the conversation? The hidden state is a foundational part of a transformers implementation. And because we're not allowed to use metaphors because that is too anthropomorphic, then youre just going to have to go learn the math.
The comment you are replying to is not claiming ignorance of how models work. It is saying that the author does know how they work, and they do not contain anything that can properly be described as "hidden state". The claimed confusion is over how the term "hidden state" is being used, on the basis that it is not being used correctly.
I don't think your response is very productive, and I find that my understanding of LLMs aligns with the person you're calling out. We could both be wrong, but I'm grateful that someone else spoke saying that it doesn't seem to match their mental model and we would all love to learn a more correct way of thinking about LLMs.
Telling us to just go and learn the math is a little hurtful and doesn't really get me any closer to learning the math. It gives gatekeeping.
Hidden state in the form of the activation heads, intermediate activations and so on. Logically, in autoregression these are recalculated every time you run the sequence to predict the next token. The point is, the entire NN state isn't output for each token. There is lots of hidden state that goes into selecting that token and the token isn't a full representation of that information.
Hidden layer is a term of art in machine learning / neural network research. See https://en.wikipedia.org/wiki/Hidden_layer . Somehow this term mutated into "hidden state", which in informal contexts does seem to be used quite often the way the grandparent comment used it.
That's not what "state" means, typically. The "state of mind" you're in affects the words you say in response to something.
Intermediate activations isn't "state". The tokens that have already been generated, along with the fixed weights, is the only data that affects the next tokens.
Sure it's state. It logically evolves stepwise per token generation. It encapsulates the LLM's understanding of the text so far so it can predict the next token. That it is merely a fixed function of other data isn't interesting or useful to say.
All deterministic programs are fixed functions of program code, inputs and computation steps, but we don't say that they don't have state. It's not a useful distinction for communicating among humans.
I'll say it once more: I think it is useful to distinguish between autoregressive and recurrent architectures. A clear way to make that distinction is to agree that the recurrent architecture has hidden state, while the autoregressive one does not. A recurrent model has some point in a space that "encapsulates its understanding". This space is "hidden" in the sense that it doesn't correspond to text tokens or any other output. This space is "state" in the sense that it is sufficient to summarize the history of the inputs for the sake of predicting the next output.
When you use "hidden state" the way you are using it, I am left wondering how you make a distinction between autoregressive and recurrent architectures.
I'll also point out what is most important part from your original message:
> LLMs have hidden state not necessarily directly reflected in the tokens being produced, and it is possible for LLMs to output tokens in opposition to this hidden state to achieve longer-term outcomes (or predictions, if you prefer).
But what does it mean for an LLM to output a token in opposition to its hidden state? If there's a longer-term goal, it either needs to be verbalized in the output stream, or somehow reconstructed from the prompt on each token.
There’s some work (a link would be great) that disentangles whether chain-of-thought helps because it gives the model more FLOPs to process, or because it makes its subgoals explicit—e.g., by outputting “Okay, let’s reason through this step by step...” versus just "...." What they find is that even placeholder tokens like "..." can help.
That seems to imply some notion of evolving hidden state! I see how that comes in!
But crucially, in autoregressive models, this state isn’t persisted across time. Each token is generated afresh, based only on the visible history. The model’s internal (hidden) layers are certainly rich and structured and "non verbal".
But any nefarious intention or conclusion has to be arrived at on every forward pass.
The LLM can be predict that it may lie, and when it sees tokens which are contrary to some correspondence with reality as it "understands" it, it may predict that the lie continues. It doesn't necessarily need to predict that it will reveal the lie. You can, after all, stop autoregressively producing tokens at any point, and the LLM may elect to produce an end of sequence token without revealing the lie.
Goals, such as they are, are essentially programs, or simulations, the LLM runs that help it predict (generate) future tokens.
Anyway, the whole original article is a rejection of anthropomorphism. I think the anthropomorphism is useful, but you still need to think of LLMs as deeply defective minds. And I totally reject the idea that they have intrinsic moral weight or consciousness or anything close to that.
You're correct, the distinction matters. Autoregressive models have no hidden state between tokens, just the visible sequence. Every forward pass starts fresh from the tokens alone.But that's precisely why they need chain-of-thought: they're using the output sequence itself as their working memory. It's computationally universal but absurdly inefficient, like having amnesia between every word and needing to re-read everything you've written.https://thinks.lol/2025/01/memory-makes-computation-universa...
The words "hidden" and "state" have commonsense meanings. If recurrent architectures want a term for their particular way of storing hidden state they can make up one that isn't ambiguous imo.
"Transformers do not have hidden state" is, as we can clearly see from this thread, far more misleading than the opposite.
No, that's not quite what I mean. I used the logits in another reply to point out that there is data specific to the generation process that is not available from the tokens, but there's also the network activations adding up to that state.
Processing tokens is a bit like ticks in a CPU, where the model weights are the program code, and tokens are both input and output. The computation that occurs logically retains concepts and plans over multiple token generation steps.
That it is fully deterministic is no more interesting than saying a variable in a single threaded program is not state because you can recompute its value by replaying the program with the same inputs. It seems to me that this uninteresting distinction is the GP's issue.
do LLM models consider future tokens when making next token predictions?
eg. pick 'the' as the next token because there's a strong probability of 'planet' as the token after?
is it only past state that influences the choice of 'the'? or that the model is predicting many tokens in advance and only returning the one in the output?
if it does predict many, id consider that state hidden in the model weights.
The most obvious case of this is in terms of `an apple` vs `a pear`. LLMs never get the a-an distinction wrong, because their internal state 'knows' the word that'll come next.
If I give an LLM a fragment of text that starts with, "The fruit they ate was an <TOKEN>", regardless of any plan, the grammatically correct answer is going to force a noun starting with a vowel. How do you disentangle the grammar from planning?
Going to be a lot more "an apple" in the corpus than "an pear"
Caveat: Coercions exist in Lean, so subtypes actually can be used like the supertype, similar to other languages. This is done via essentially adding an implicit casting operation when such a usage is encountered.
I'm not quite following. According to the OP and the docs you linked, a subtype is defined by a base type and a predicate. In other words: You can view it as a subset of the set of elements of the base type. That's pretty much the standard definition of a subtype.
Object-oriented programming languages are not that different: The types induced by classes can easily be viewed as sets: A child class is a specialized version of its parent's class, hence a subtype/subset thereof if you define all the sets by declaring `instanceof` to be their predicate function.
B. Meyer made an attempt to formulate many concepts in programming using simple, set theory. It might help in discussions like this. I say might since I'm not mathematically-inclined enough to know for sure.
> You can view it as a subset of the set of elements of the base type.
Technically speaking the elements in the supertype are all distinct from the elements in the subtype and viceversa. They are not a subset of the other, hence why it's improper to consider one a subtype of the other.
Right, though the embedding is trivial, the conceptual distinction is not. In Lean, a subtype is a refinement that restricts by proof. In OOP, a subclass augments or overrides behavior. It's composition versus inheritance. The trivial embedding masks a fundamental shift in what "subtype" means.
reply