Reminds me of "Gadsby", a 50.000 word novel without the letter "e": https://en.m...

isolli · 2025-04-14T07:12:36 1744614756

I'd be curious to know if it was easier or harder (or perhaps just as difficult) to write than the French equivalent. [0]

The Wikipedia article goes on to discuss interesting aspects of how the book was translated in different languages, with different self-imposed constraints.

[0] https://en.wikipedia.org/wiki/A_Void

lelag · 2025-04-14T09:22:23 1744622543

I can’t say for certain, but I’d guess that writing without the letter “e” is slightly more difficult in French than in English. For one, “e” is a bit more common in French (around 15% of all letters, versus about 12% in English). But more importantly, French grammar adds extra challenges—like gender agreement, where feminine forms often require an “e”, and the frequent use of articles like le and les, which become unusable.

That said, I think the most impressive achievement is the English translation of the French novel. Writing an original constrained novel is hard enough, but translating one means you can’t just steer the story wherever you like. You have to preserve the plot, tone, and themes of the original, all while respecting a completely different set of linguistic limitations. That’s a remarkable balancing act.

vodou · 2025-04-14T10:17:30 1744625850

Georges Perec did the same with his novel "La Disparition".

What is almost as impressive is that these novels (at least Perec's) have been translated to other languages.

koiueo · 2025-04-14T07:12:24 1744614744

I imagine LLMs would excel in this kind of writing these days.

But really impressive for the time.

lvncelot · 2025-04-14T07:51:39 1744617099

I think it's the exact opposite, as they operate on a token-level, not a character level, which makes tasks like these harder for them. So they would generate a sentence with multiple es in it and just proclaim that they didn't.

(Just tried it, "write a short story of 12 sentences without one occurence of the letter e" - it had 5 es.)

Timwi · 2025-04-14T09:06:49 1744621609

You're assuming all you can do is prompt it. Surely you could also constrain its output to tokens that genuinely contain no e’s (or make only max 4 letters per word). LLMs actually output a probability distribution of next tokens; ChatGPT just always picks the top one, but you could totally just always filter that list by any constraint you want.

JohnKemeny · 2025-04-14T09:21:16 1744622476

But the problem is that the tokens are subwords, which means that if you simply disallowed tokens with es, you'd make it hard to complete a word given a prefix.

For example, it may start like this "This is a way to solv-", or "This is th-"

lelag · 2025-04-14T09:41:04 1744623664

If I understand it correctly, that's a valid concern but the way structured generation library like outlines[1] work is that they can generate multiple variants of the inference (which they call beam search).

One beam could be "This is a way to solv-". With no obvious "good" next token. Another beam could be "This way is solv-". With "ing" as the obvious next token.

It will select the best beam for the output.

[1]:https://github.com/dottxt-ai/outlines

zahlman · 2025-04-14T17:02:57 1744650177

... What if you retrained it from scratch, on an e-less corpus?

JohnKemeny · 2025-04-14T21:06:42 1744664802

Yes, that would probably work quite well, given enough training data. However, I interpreted the question/claim as a task that LLMs excell at, meaning that writing text while avoiding a certain character is a task for a general purpose LLM.

probably_wrong · 2025-04-14T09:30:23 1744623023

I tried something like that some time ago. The problem with that strategy is the lack of backtracking.

Let's say I prompt my LLM to exclusively use the letters 'aefghilmnoprst' and the LLM generates "that's one small step for a man, one giant leap for man-"[1]. Since the next token with the highest probability ("-kind") isn't allowed, it may very well be that the next appropriate word is something really generic or, if your grammar is really restrictive, straight up nonsense because nothing fits. And then there's pathological stuff like "... one giant leap for man, one small step for a man, one giant leap for man- ...".

[1] Toy example - I'm sure these specific rules are not super restrictive and "management" is right there.

wizzwizz4 · 2025-04-14T09:50:58 1744624258

The next token is obviously "goes". (Any language model that disagrees is simply wrong.)

JohnKemeny · 2025-04-14T21:09:10 1744664950

I'm not sure if my chain's bein' yanked right now, but surely you mean "gos"‽

wizzwizz4 · 2025-04-14T22:39:41 1744670381

The plural of mangoe is mangoes. https://en.wiktionary.org/wiki/mangoe

lelag · 2025-04-14T09:10:14 1744621814

I was going to point that out.

What I will add is that constrained generation is supported by the major inference engine like llama.cpp, vllm and the likes, so what you are describing is actually trivial on locally hosted models, you just have to provide a regex that prevent them to use the letter 'e' in the output.

Der_Einzige · 2025-04-14T10:04:43 1744625083

You can do this more properly with the antislop sampler and we are working on a follow up paper to our previous work on this exact problem.

https://github.com/sam-paech/antislop-sampler

https://arxiv.org/abs/2306.15926

HPsquared · 2025-04-14T11:23:16 1744629796

All the training data contains 'e's.

pyfon · 2025-04-14T11:31:28 1744630288

That is not a counter point! The output has a probability distribution so you can assing zero to any e-containing token and scale everything else up accordingly.

stavros · 2025-04-14T09:22:50 1744622570

I think an LLM would do well on this if you gave it a function that located words with an e so it could change them.

chillitom · 2025-04-14T07:49:29 1744616969

They’d probably sucks at a challenge like that because they work on tokens and don’t really see individual letters.

There was a post here a little while back asking AI models to count the number of Rs in the word raspberry and most failed.

Der_Einzige · 2025-04-14T10:01:46 1744624906

I wrote the relevant paper about this:

https://arxiv.org/abs/2306.15926

https://github.com/Hellisotherpeople/Constrained-Text-Genera...

probably_wrong · 2025-04-14T09:35:35 1744623335

You don't need to go all the way to LLMs when a simpler approach may do.

Here's a "What if?" on a very similar issue that uses Markov chains: https://what-if.xkcd.com/75/

mock-possum · 2025-04-14T14:56:31 1744642591

LLMs are usually shit at this kind of wordplay, they don’t understand the rules - words that begin or end or include particular letters, words that rhyme, words with particular numbers is syllables - they’ll get it right more often than wrong, maybe, but in my experience they just aren’t capable catching wrong answers before returning them to the reader, even if they’re told to check their work.

pyfon · 2025-04-14T11:27:33 1744630053

8 of them on the cover!