Hacker News new | past | comments | ask | show | jobs | submit login
Llama: Add grammar-based sampling (github.com/ggerganov)
417 points by davepeck on July 21, 2023 | hide | past | favorite | 105 comments



Here's my understanding of how this works (please someone correct me if I'm getting this wrong).

Language models emit tokens one at a time, starting with the prompt that you give them.

If you have a conversation with an LLM, effectively you can think of that as you giving it a sequence of tokens, then it generates some, then you generate more and so-on.

This grammar trick effectively takes advantage of this by giving you much more finely grained control over the tokens. So you can do things like this:

    Give me the address of the
    White House as JSON:
    
    {"street": "
Then the LLM can return:

    1600 Pennsylvania Ave NW"
The moment you see that closing double quote, you take over again and inject:

    ",
    "City": "
It fills in:

    Washington, DC"
And so on.

But because this is all based on a grammar, you can do way more with it than just JSON.

I saw a brilliant suggestion relating to this on Twitter a while ago:

> @OpenAI should add an API argument allowing passing up a deterministic context free grammar.

> [...]

> While I think DCFL is what you want here in the short term, the really best thing is passing up a small WASM binary that simply is the sampler.

> Allow a user to pass up a few KB of WASM binary and give it a few megabytes of RAM to run. Would enable next level LLM superpowers.

https://twitter.com/grantslatton/status/1637692033115762688


Not just that: the LLM outputs not individual tokens, but a weighted recommendation. The most probable (“best”) token has the highest weight, but there may be many alternatives including JSON symbols like quote characters.

The “temperature” setting adjusts how likely it is that an output token is chosen that is not the top-rated option. That prevents repetitive output.

Forcing an LLM to obey a grammar is mostly about filtering the list before the token choice is made. There may still be a random element controlled by the temperature!

A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.


> A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.

Technically that part is mandatory if you don't just want it to produce an output but to make it produce an output that correctly matches the temperature (i.e. one that you could have gotten by randomly sampling the LLM until you got a correct one). Randomly picking the next tokens that isn't grammatically incorrect works but oversamples paths where most of the options are invalid. The ultimate example of this is that it can get stuck at a branch with probability 0.

From a probabilistic standpoint what you'd need to do is not just make it backtrack but make it keep generating until it generates a grammatically correct output in one go.

Maybe there is something clever that can be done to avoid regenerating from the start? What you'd need to achieve is that a token that has a x% probability of leading to an incorrect output also has x% probability to be erased.


The way LLMs work is they output probabilities for every _token_, so you don't really need to backtrack you can just always pick a token that matches the provided grammar.

That said, you might want to do something like (backtracking) beam-search which uses various heuristics to simultaneously explore multiple different paths because the semantic information may not be front-loaded, i.e. let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.

That said, there are a lot of shortcuts you can take to make this fairly efficient thanks to the autoregressive nature of (most modern) LLMs. You only need to regenerate / recompute from where you want to backtrack from.


Whether or not backtracking is needed is really down to the grammar's ambiguity.

The auto-regressive nature of LLMs is actually something that counts against them, at least as some tell it. Although, really, the root problem is generating autoregressively from LLMs precludes planning ahead while also lacking any iterative refinement stage.

Backtracking, look-ahead, early failure pruning and staged generation are all very useful for fitting both concepts (refinement and planning ahead) in an auto-regressive generation framework.


This is what Google Mind is working on: treating the output of LLMs as tree to be searched instead of just linearly outputting tokens in a "greedy" manner and hoping for the best.

Apparently GPT-4 gets a lot of its quality from generating many alternatives (16?) and then picking the best one, but this is 16x as much computer power.

A clever tree search (which itself could be a neural net!) could improve the efficiency of this many-fold while simultaneously improving the quality by a huge factor as well.


Arguably a '1 token at a time' model is itself a tree search, so it's more of a perspective than anything. It's really when you start pruning this tree that this distinction becomes interesting. And of course treating the tree as an explicit object may allow the model to do interesting stuff like jumping to a different branch entirely (deletions insertions etc.).

Generating 16 alternatives and picking the best one only makes sense to me if your standard for picking one is orthogonal to the model itself, if you just pick the one that your model deems the most likely you've just figure out a very crude and expensive way to lower the temperature.


That is stretching arguably too far. If you are taking 1 sample path, you are not in any meaningful sense searching a tree. In the context of sampling a probability distribution, which is what LLMs do in effect, there is extra depth to this. Any random response need not be representative of what the model "thinks". And maybe counter-intuitive to some but the most likely generation might actually be unrepresentative as well.

Drawing lots of samples and then marginalizing (as a kind of vote) is methodologically more principled where appropriate. Constraining generation according to some gating function, continually redrawing samples, can be used to significantly reduce error rates at the cost of longer generation times.

LLMs are not being used to their full potential because it is too costly to do so.


Isn’t that the whole point of using RL with these things, that the chain of likeliest tokens one by one doesn’t lead to the best overall generation by the model (according to the model itself)? I believe that is one reason the rlhf is using rl and not supervised learning; credit assignment for a good sentence to each token is not trivial after all.


When we talk about tree search we allow for backtracking, so if a node has 3 children all 3 will be explored generally, or at least a subsample of the children will be, in LLM sampling you generally pick a single token/child and then just go on with that until the end of the generation.

If DeepMind is indeed doing something similar to AlphaZero to language modelling one would expect they would generate multiple "rollouts" from the current context and then use some kind of function/network to predict which next token will lead you to the best final generation and then output that token. How to do all of that using a sensible amount of compute is what remains to be seen


Talking about efficiency. LLMs are often more efficient running batches. Sort of several lines at a time. Which means we can at some point branch new lines and run them in parallel. It will be more efficient than running one after another. More over, with some tricks we can share the 'history' instead of recomputing. This requires going deep into the model though.


What tricks are you thinking about? Sharing the history still means you need to save the state of the autoregressive transformer, which is usually prohibitively large?


I'm talking about inference. We need to save is keys, we need all of them to compute next tokens. We don't need queries. But we can play the fact that each next token depends only on the previous. And in whatever gets out of each tranformer's block it's the same. Let's call it 'history'. Which is 2d array [prev_size, embed_size]. Typical will be 1024x512 = 0.5M, may be more depending on the model, but looks like still affordable. prev_size here is [0..max_prompt_size] as we do inference. The idea is that we don't need to recompute it every time. Just add one element as we compute each next token. And if we want to try several alternative tokens, we can put them in one batch, and they will have the same 'history'. We need just a copy, or better reference. This way the branching is almost free. As opposite to 'normal' way when everything is recomputed for each alternative token.


This isn't true, it's a telephone game version of "it's a mixture of experts model" that was used to explain the impossible claim that "it's a 1 trillion parameter" in fall 22


Well, if LLM suggests "moves", and an Expert Model judges the whole output, then combining the two with a tree search suspiciously resembles the AlphaGo idea.


It’s not true.


Apparently it's both. There's a bunch of experts, and then those output many alternatives, of which you see the "best" one as selected by a final quality-check neural net.


I can’t say this strongly enough: it’s not true. You’re just the latest victim.


I understand that the people who claim this don't provide any evidence. But do you have any pointers for the claim that it is not true?


Alas, no, though I'm going to think out loud a bit. I've had to go from making a comment like this once a month to twice a week, so I'm curious what pops out as helpful to point to.

Forgive opinionated language, it's more concise and is more clear to you what exactly I can give evidence of:

- December 22: proto-AI influencers are latching onto GPT4 rumors as a source of engagement. Bunch of people start repeating "RUMORS say GPT4 has ONE TRILLION parameters" Altman laughs, most people laugh, it's not quite so big a community yet.

This percolates, but you kinda ignore it: it's to non-tech people and it's unfalsifiable.

- Feb 23: GPT3.5 API announcement, run out of news, and GPT4 stuff circulates again. MS Euro executive throws gas on the fire by confirming it's release 1.5 weeks earlier. These claims circulate in coverage of what GPT4 might be. However, the circulation is 99.99% in non-tech circles still.

- Mar 23: GPT4 comes out, by now "Chinchilla scaling laws" went from something 10% of tech following AI knows about, to maybe 0.1%. OpenAI releases ~0 information on # of parameters, training, or runtime details, just a visualization of a Chinchilla-fit scaling curve and that they were able to predict the models abilities in advance based on scaling laws.

- Apr 23: GPT4 release content is old now, people needing content venture into claiming details about the model from leaks -- its just the same the trillion parameter thing.

- May 23: Tech substacks beging offering a perspective on AI. They're new and don't know enough to know Altman laughed it off...and that it would be absurd for 100 other reasons. It comes up. A particularly famous blog handwaves about "mixture of experts" to explain how the trillion parameter number could make sense given the most basic reason why they wouldn't, Chinchilla scaling, and the most factual reason it isn't: Altman laughing it off. "Altman was just parsing the idea closely to hide details, it was a showman stunt!"

- Jun 23: The tech community interested in AI outstrips the sober-minded/experienced with LLMs by 1000:1, and this sounds plausible, and it's unfalsifiable. There is no proof it _isn't_ true, and it could be true, and it's a comfortable way to "understand" without putting in the work to understand. People start laundering it to HN in subdiscussions. I see it once the whole month.

- end of July 23: I've seen it every week in July, twice this week.

This is the first time I've seen the mixture of experts simplified to "it generates 16 answers and picks one" ---

which is a thing!

Except that's top-K.

And it's a _completely independent claim_ from the original misunderstandings, and it is a misunderstanding of the misunderstandings that shores up the weak points of the misunderstandings.

Yet, the claim only would make sense if the misunderstandings were true at their face, weak points and all: generating 16 from the same model has existed for a very very long time. I only got in on this in 2019, but its been around since then, and I'm almost certain someone with formal ML training will pop in and say "1965 bro"


Wait, so it was never even confirmed or actually leaked by OpenAI that they're using a MoE model? That was just invented by some blog? I've seen it mentioned everywhere as though it's true.

I think it's likely they're using a technique that is similar to or a descendant of the Tree of Thought technique, because in Karpathy's talk where he was not allowed to discuss GPT4s architecture so he had to discuss only information in the public domain about other models, he pretty strongly indicated that the direction of research he thought people should pursue was ToT. In the past, Karpathy has communicated basically as much as he can to try and educate people about how these models are made and how to do it yourself - he has one of the best YouTube tutorials on making an LLM up. I suspect that he personally probably does not agree with OpenAI's level of secrecy, but at minimum he shares a lot more information publicly than most OAI employees.


We already do tree searches: see beam search and “best of” search. Arguable if it is a “clever” tree search but it’s not entirely unguided either since you prune your tree based on factors like perplexity which is a measure of how probable/plausible the model rates a branch as it stands so far.

In beam search you might keep the top n branches at each token generation step. Best of is in a sense the same but you take many steps using regular sampling at a time before pruning.


> Maybe there is something clever that can be done to avoid regenerating from the start? What you'd need to achieve is that a token that has a x% probability of leading to an incorrect output also has x% probability to be erased.

Like giving the llm a backspace token? There is a paper related to this:

https://news.ycombinator.com/item?id=36425375


I mean you're going to need to include a probability to backtrack one way or another, but simply having a backtrack character seems more like a trick to make fitting the model easier than a way to make constraining it more accurate.

Simply having the probability to backtrack does turn the whole generation process into a ergodic Markov chain though, so you might be able to use something like MCMC to make it work. Technically those only start sampling the distribution eventually but picking the first or nth full output might be good enough for all practical purposes. Especially at low temperatures where there aren't many reasonable options in the first place.


No, the way it works is that the current output + potential next tokens to be sampled are checked with the grammar. All potential tokens that don't match are removed. Then, with the list of valid tokens left, normal sampling strategies are used.


I don't think this is correct; previously you could already control output by reading tokens one at a time from the LLM until you hit a stop character.

My take from the grammar-based sampling PR is that you ask llama.cpp to constrain the next output token, to a restricted set of possible tokens, using the grammar.


Right, which is the same idea - it's just that the code in llama.cpp is running your grammar as part of its token generation decisions as opposed to pausing and waiting for your other code to pick the next token.

(I'm trying for a very high level explanation here.)


you could also always specify the logit bias parameter in openai apis


That's true, and one can bias logits in llama.cpp and friends too, but those are global biases that affect the entire output rather than being specified per-token. Uploading a grammar or a wasm binary to the inference engine does seem more expressive.


Another detailed description of how to do this: https://github.com/normal-computing/outlines/pull/131

That's one of the developers of the Outlines library, another cool LLM workflow library.


There's a paper as well. :) https://arxiv.org/pdf/2307.09702.pdf


I'm struggling to understand what he's talking about. Starting with "passing up," did he invent this terminology? The only input you have to an LLM is the prompt, which gets tokenized. And if you were to send DCFG rules or a compiled version of it as part of the request, how would that fundamentally alter the way that the tokens are predicted? If the model predicts something that doesn't conform to the grammar you require, is he proposing re-prompting until it gets it right?


You have more inputs to an LLM than just the prompt. For example, people commonly pass parameters which control the sampling of tokens.

Implementing grammar based sampling does NOT require "re-prompting until it gets it right". Imagine a point in time when the LLM is generating some particular token. Which token will it produce? To decide that, it evaluates and assigns a score to each potential token. Then it chooses one of these options based on some rules. Rules could be as simple as "pick the token with the highest score". That is called a greedy strategy. Usually more complex strategies are used and they typically have some randomness. That is called sampling. You can imagine a grammar based sampling strategy to force specific tokens at specific positions in the output, for example, to close a bracket in json.


I think he's proposing this kind of API:

    POST /openai/gpt4
    {
        "prompt": "The address of the White House",
        "sampler_wasm": "base64 encoded WASM binary blob here"
    }
That WASM would be a program that you write yourself that is run as part of the tokenizer - so it could be a grammar but it could be anything else too.

It's WASM which means it can be safely and performantly run in a sandbox by the OpenAI servers as part of their execution of your prompt.


I think you mean "run as part of the sampler," the tokenizer (and tokenization) is fixed for a given model. The sampler blob would basically:

1. Modify the output token probabilities to fit any arbitrary use case

2. Perhaps do trigger some sort of backtracking / beam-search

(I'm not Grant but we've chatted on twitter and built similar things)


Yes, I meant sampler, not tokenizer.


The model returns probabilities across the full set of tokens. This restricts the tokens to those that conform to the grammar, and samples from those


This reminds me of something i did once using a 1d CRF at the output of a sequence model. I set "impossible" transitions to -inf so that transitions between certain state pairs were simply never predicted. I could imagine executing something like this that is conditional on the current grammatical context.

On fact it's a bit surprising to me how little I see CRFs mentioned in the context of language models. They are useful whenever you want to model or learn transition probabilities.


Isn’t this what Microsoft Guidance does?

https://github.com/microsoft/guidance


I read the code. Guidance seems designed to work well with OpenAI's chat completion API. When you ask Guidance to choose from a set of options, it breaks the list into a tree of tokens and then walks this tree, providing the next set of possible tokens in the logit_bias parameter with value set to +100.

For example, suppose that you specify this as your Guidance "program" and suppose (for sake of simplicity) that the token for "lea" is 1300, the token for "ther" is 1500, and the token for "ves" is 5300:

  "armor": "{{#select 'armor'}}leather{{or}}leaves{{/select}}",
Guidance will send OpenAI a chat completion starting with

  "armor": "
... providing a logit_bias map {"1300": "100"}. This bias forces the model to choose "lea" as the next token. Following this call, we have the prefix

  "armor": "lea
... and now Guidance calls chat completion again setting the logit_bias map to {"1500": "100", "5300": "100"} to indicate that the tokens for "ther" or "ves" are equally probable and really the only tokens the model is allowed to select between, unless some other token is maximally probable given the context. OpenAI now replies with token "1500" (let's say) and Guidance completes the string as follows:

  "armor": "leather
... because "ther" is represented by token number 1500. Guidance then tacks on the closing quote and other stuff specified by the user:

  "armor": "leather",
... and it sets the value of "armor" to "leather" so that you can use that value later in your code if you wish to. Guidance is pretty powerful, but I find the grammar hard to work with. I think the idea of being able to upload a bit of code or a context-free grammar to guide the model is super smart.

https://github.com/microsoft/guidance/blob/d2c5e3cbb730e337b...


OTOH, AFAIK when running a model locally Guidance does something really similar to what OP is doing.


Just to +1 mmoskal, while Guidance does somewhat work with OpenAI APIs, AFAIK it was first designed around having direct access to the logits, thus is superior when using local models.


Thank you! I finally get what Guidance is doing now.


Ha, yay! So now: Do you think it might be useful/possible to incorporate as a plugin [or otherwise] into your LLM app?


So does this mean I can detect "Sorry" at the start of a response and prevent it?


I’m sure ggml would accept that as a PR if the wasm vm had a tiny surface area.


Thank you for laying it out like that.


I think it should be noted that this enforces grammatical constraints on the model's generated text, but it doesn't do anything to properly align the content. This would be useful if you needed to ensure a server delivered well-formatted JSON, but it I suspect it wont solve a lot of alignment issues with current language generation. For example current iterations of Llama and GPT often do not label markdown code-blocks correctly. Using grammar-based sampling, you could enforce that it labels code blocks but you couldn't enforce correct labeling since this is context-dependent. You also couldn't invent a novel domain-specific language without aligning against that language and expect good output.


Also important to call out that anytime you have a freeform string it's pretty much an open invitation for the LLM to go completely haywire and run off into all sorts of weird tangents. So these methods are best used with other heuristics to bias sampling once you get to free-form text territory (i.e. a repetition penalty etc)


But since its llama, some examples could be trained into a lora.

I can imagine a system where, for instance, a markdown lora and a markdown grammar file can be hotswapped in and out.


I am in love with this, I tried my hand at building a Constrained Text Generation Studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...), and got published at COLING 2022 for my paper on it (https://paperswithcode.com/paper/most-language-models-can-be...), but I always knew that something like this or the related idea enumerated in this paper: https://arxiv.org/abs/2306.03081 was the way to go.

I will have to think about how I can build grammars that force things like syllable counts or syntactic rules. Current LLMs do very poorly on those kinds of tasks due to the tokenization schemes...


I was surprised, but Nous Hermes does a half decent job at writing haikus.


I implemented this for PyTorch too at https://github.com/Shopify/torch-grammar. I have a hacked version of text-generation-inference that uses it—happy to share that if it’s useful to anyone.


Yes, please share.

I've been meaning to play around with dumping the token probability vectors inside one of the LLM UIs. Having a diff starting point would help a bunch.


Sure, here it is. It's extremely rough but it works as a starting point: https://gist.github.com/burke/6d035758f7492612e2ad86bb7de2d5...


Specifically for multi-choice string enums (essentially dropdowns), I wonder if this would work better if the full (joint/product) probability given the logits is considered when picking the final choice, rather than using a greedy algorithm. This will favor the right choice, as opposed to e.g. one of the choices that contain the most common start token - if a start token are shared among many items in the list.

Of course the probability needs to be adjusted once a subset of the logits goes to zero so it actually makes sense...


This grammar "library" was cited as an example of what the format could look like:.

https://github.com/antlr/grammars-v4

There is everything from assembly and C++ to glsl and scripting languages, arithmetic, games, and other weird formats like freedesktop shortcuts, llvm ir or verilog.


A convenience feature in any inference API would be to specify a shortcut to a standardized grammar such as HTML, JSON, Python, etc. It is frankly strange to me that OpenAI have not already done this, considering the obvious effort they undertook to fine-tune the Code Interpreter model.


It would be awesome if they supported ANLTR4 grammar syntax. Such a great tool.


Can someone ELI5 what's going on here? I'm reasonably familiar with LLMs, but I can't quite grok what Georgi is doing here and why it's so exciting for some.


An LLM does not generate "the next token" - from an input text, it generates a vector of probabilities where each slot in the vector corresponds to a token. The value in a token's slot is (approximately) the probability that that particular token might appear next in the text.

Programs like ChatGPT "interpret" that vector of probabilities to generate text by selecting (sampling) one of the top tokens. But sometimes this is too flexible -- for example, ChatGPT might generate invalid JSON when you want JSON output because it chose a token that does not conform to the JSON grammar.

A way to "force" an LLM to generate e.g. JSON is to change the sampling process. Instead of choosing any top token, we first filter the tokens to just those that conform to the JSON grammar. Then, we sample one of the top tokens from that subset.


And to build on this, take a look at the code change. Currently in llama.cpp there are many techniques for sampling the next token:

llama_sample_token_greedy - just take the top probability

llama_sample_top_k - sample only from the top k probabilities

etc ...

this code change adds a new sample:

llama_sample_grammar - sample only from tokens which match the grammar


If the performance was ok, is there any reason why a sampler couldn't call an API or use a separate fine tuned model?


And given that the inference code has access to the entire vector, it's the logical place to put this filtering... OpenAI and other LLM APIs probably don't want to return the entire token probability vector to the user because it's a lot of data. That being said, it wouldn't surprise me if Microsoft has such access as part of their deal because of the obviously superior position this puts them in vs. regular API customers.


Damn, I thought that making DiffuserAST is smart but sampler based on typescript closures might be much easier to implement. Gotta try this.


If you ask an LLM to generate JSON or another language that has a grammar, it will sometimes produce invalid syntax. This pull request constrains the LLM so that it can only output valid syntax according to whatever grammar you supply. It's a modification to the sampling procedure.

What is the sampling procedure? Well, the way an LLM generates text is one token (short sequence of characters) at a time. First the giant neural net assigns a probability to every possible token (this is the hard part). Then a sampling procedure uses the probabilities to pick one of the tokens, and the process repeats.

The sampling procedure is not a neural net and can be modified in many different ways. You might think that the sampling procedure should always simply pick the token with the highest probability (greedy sampling). You can do that, but it's usually better to pick at random weighted by the probabilities. This gives more diversity and is less likely to get stuck in loops. But this means that literally any token with nonzero probability might get picked, so you can see how this might lead to invalid JSON being generated sometimes. This pull request zeros out the probabilities of all the tokens that wouldn't be valid according to your grammar, so they can't be picked.

BTW there are lots of other interesting modifications to the sampling process you could consider. For example, maybe you can see that in the process of sampling tokens one after the other you might paint yourself into a corner and end up with no good options to choose from. So maybe it makes sense to allow backtracking. In fact, maybe at each sampling step we can consider multiple options, making a tree of possible outputs, and at the end we can pick the path through the tree with the highest overall probability. Of course we can't consider every option; it would be a complete tree with a branching factor of the number of possible tokens, which would grow exponentially. Let's prune the tree at each step and only consider the top, say, five paths we've seen so far. This is called "beam search". It's not normally used for LLMs because the neural net that generates the probabilities is very expensive to run and multiplying that cost by a factor of e.g. five is unpalatable. But it can be done, and produces somewhat better results. You could also consider using MCTS like chess engines do.


This is a sort of modern version of https://wiki.c2.com/?AlternateHardAndSoftLayers, one of the most useful software patterns.


Say more… I read the link, and it seems to be advocating for replacing specific business logic with a generic code interpreter?


I swear the content of that page was a lot more helpful the last time I looked at it.

Anyway, the idea is that if you can architecture your program into layers (rather than spaghetti) it can get a lot more powerful if some of your layers are "soft" (dynamic typing/scripting/etc) and some are "hard" (static typing/lower level code/etc). Because some things like UI are too inconvenient to do in static languages so are better as a soft layer, but you want that resting on top of something more proven and type-checked so it still catches bugs easily.

In this case, LLMs are a big blob of unproven data that we don't know why they work, i.e. they're not tested. But that's also the reason they can do everything they do.


LLMs are happy to generate arbitrary strings. You might want it to spit out something along the lines of "Alice: 42" and then it spits out "hi, i'm helpful and Alice is exactly forty two, as far as I can tell, but I'm just a language model."

So you give it a grammar that says the response has to be an uppercase letter followed by lowercase letters, then a colon, then a space, then digits, then it's done. Now, when it looks for that first token, it will only consider tokens that are compatible with that pattern. Then it'll continue with only next tokens that are compatible with the next parts of the pattern.

These grammars do that with a flexible and useful kind of pattern.



I'm interested in this and I'm going to try incorporating it into something I'm doing. That said, I feel like this could be one of those Bitter Lesson situations where it's not the most effective approach in anything but the very short term: http://www.incompleteideas.net/IncIdeas/BitterLesson.html


It may be a stop-gap, but its an important one as it is not obvious that LLMs in the next few years will "organically" solve their issues with generating text with constraints.


Not an expert at all, but I believe that OpenAI uses this in some of their GPT apis which are meant for programmatic use. I've seen it theorized that offloading the rote grammar stuff to a simple process that is meant for it lets the LLM use it's "brainpower" on the complicated stuff more effectively. No idea if this is true.


It makes sense to my uninformed intuition, which is that a strict grammar reduces the search space for the token generation and so the AI can eliminate possibilities that would otherwise be ambiguous.



Can anyone recommend some paper or overview on how "sampling" / "decoding" is done in the e2e neural network age? I know how decoding was done for machine translation and speech recognition back in the HMM times (i.e. https://en.wikipedia.org/wiki/Viterbi_algorithm and https://en.wikipedia.org/wiki/Beam_search). These days I get the impression people just do "greedy" - but I don't really know. Any recommendations for info on that topic?

Edit: Forgot Viterbi


Its greedy and random :) Instead of a paper, I would recommend the algorithms of most LMM implementations (rwkv.cpp has a relatively clean implementation in python https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/samp...)


I guess I need to sit down and study this stuff in more detail, but do I understand correctly that the code you shared makes the decisions for each position independently? I am just astonished that this produces any coherent output. Also it is not clear to me how the length of the output sequence is determined.


Once the stop token is likeliest


Just reading through the GPT4 documentation it doesn’t seem like there’s a ton of difference with what you’ve mentioned.

https://platform.openai.com/docs/api-reference/completions/c...

Of course we now know that GPT4 is a Mixture of Experts, so under the hood they’re parallelizing computation. They also include a way to modify the logits with presence/frequency penalty terms.


How is this different from Guidance and LMQL?


Looks like a tool Guidance could use to make better use of the sampling from a local llama model.


This is great and all.

But LLM's are usually very good at following grammars. I rarely see LLM generating code that is OOD. Ofc, this is only true for popular language (JSON/Python/Java, etc), I can see how this is handy for more niche and in house DSL.

You still need quite a lot of prompt engineering to get desired outputs, this just add another layer of output verification IMO. But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.

But great work regardless.


> this just add another layer of output verification IMO.

It's not verifying the output after it's done, it's constraining the output as it's generated.

> But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.

I don't think it's debatable at all. Forcing the model to conform to a grammar during generation means there is never a need to discard and regenerate because it got the grammar wrong.

Think of the compute involved in generating the whole output and then re-generating if it's non-conformant. There is no comparison.


> It's not verifying the output after it's done, it's constraining the output as it's generated.

It is verifying, by filtering only tokens that allowed by the grammar. I think we are talking about the same thing.


So, umm, if you want to walk BNF and emit likely tokens you can do that without any "machine learning" or whatever you want to call it. So what is being added here? Training to tie the prompt to the output?


The difference is in the word “likely”. You can put unstructured data in the prompt and get structured data out. You could put in the beginning of a list and ask for a continuation.


I get that


Interesting that the second commentor is Tobias Lütke, CEO of Shopify.


Also interesting how Shopify is making a lot of moves in this space using both the OpenAI APIs and using self-hosted models.


Ah finally, this was discussed a lot and is well overdue. Remains to be seen how well the models will adapt to this new constraint, though the demo seems promising.


Isn’t this approach forcing the LLM to adapt? E.g. it is throwing tokens away that don’t match the grammar.


Well the grammar will be correct as enforced by the sampler, but the content it's filled with could be anything at all. Sort of how when you change the prompt template the output can be garbage for some models. I haven't tried it out yet myself, but apparently even OpenAI's implementation of this exact principle on their API still has function hallucination issues even with GPT 4.


Has anyone tested FreeWilly2 (the new Llama2 fine-tune released today by Stable Foundation) on code generation?


Does anyone know Japanese well enough to comment on the output from the Japanese example?


It is vaguely Japanese, I guess, but pretty incoherent:

  1. What is the purpose?
  2. Remember the customer
  3. About the customer [incomplete sentence?]


Note that there are actual Japanese llama finetunes that would be much more coherent with these grammar constraints


Something I’m wondering lately is if you are generating tokens fast enough, is restricting the logits actually worth it computationally? If tokens are cheap enough it might be more efficient to validate/discard them as they come rather than place constraints on how they come out. I don’t know how this one works, but the sampling or renormalizing scheme would cost something too right?


There is at least 6 orders of magnitude difference in computation cost of computing a token (pass through a multi-B model) and doing a single step of a program. Even if your validation is really naive, it's hard to beat 6 orders of magnitude.

So no, you're not generating tokens fast enough.


Could someone help me with context? I'm OOTL and don't understand what is going on here.


This can constrain an LLM's output to an arbitrary grammar/format as it is generated, rather than asking the model to output a specific format and hoping it outputs something valid.


This is important for "smaller" models, because you don't have to waste some of the potential "intelligence" (parameter space) on training it how to generate valid JSON or YAML or anything like that.


You still do... The model has to know JSON and YAML, its just more reliable when the generation is enforced by grammar


Right, but there is a big difference between ”generally knows what JSON looks like and gets it right most of the time" and "generates perfect JSON every time".




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: