Attempting to create an aesthetic global line breaking algorithm

pierrebai · on June 16, 2022

Note that the author's algorithm allow lines wider than the target. (He does not hide this fact.) If you allowed the other algorithm this wiggle room (which basically is making the box wider) they'd also probably find better layout...

For example the first place the algorithms diverges is exactly on such a wider line.

throwaway81523 · on June 16, 2022

I would like to see evidence that their thing beats TeX in practice, before bothering to check into it further. They carefully avoided putting up a comparison pic. TeX occasionally gets stuck and requires an intervention but it is rare enough that it was never an issue for me.

bradleybuda · on June 16, 2022

In the GPT-3 era, why not allow the algorithm to change the author’s wording (reordering or phrase rewrites) to more nicely distribute long and short words and give the line breaker more raw material to work with? /s

gwern · on June 17, 2022

This is something I've thought about, inspired by the old textfiles where the authors manually space or rewrite to avoid hyphenation and fit exactly into 80col.

You don't want to use GPT-3 raw because the number of candidate versions would bankrupt you if you do it naively: there are way too many synonyms for each word and combinatorially many. If you do it in a single pass, then it's not too much different from greedy linebreaking, since it can't 'look forward' to minimize badness of linebreaks & synonym.

The system I have in mind would do dynamic programming like usual to find linebreaks, and each line-end word would have a list of synonyms with a distance given by a standard word embedding like word2vec. Then you have a standard DP which tries to minimize the sum of # of hyphens, characters left/right of the ideal position, and distance for a synonym rewrite multiplied by a configurable parameter weight. This will give you a linebreaking algorithm which can be tuned to do pure linebreaking (set the weight of synonyms to ∞) to perfectly preserve meaning even at the cost of appearance, or treat synonyms as free (set to 0) and spent like a drunken sailor to wedge the text in regardless of meaning distortion. You need to generate a lot of synonyms, but you can easily cache that and word embedding NNs are extremely small/fast/easy to run locally, unlike GPT-3. So, won't break the bank.

Now, you might say that rewriting words individually, even if the meaning is very close thanks to word2vec, can still lead to weirdness in global meaning, coherency, or text style. True! Then you can run it through GPT-3 as a final phase: take several different candidates (generated by setting the weight to a few different values), feed them through the GPT-3 Search API to find the closest (or the embedding API to do inner-product distance), and take the best one (again, with a weighting factor for how faithful to the original semantics you want to force the rewrites to be).

My guess is that given a large dictionary of synonyms, and the final GPT-3 pass to score for naturalness, creating many possible rewrites, the DP search will be able to find extremely subtle rewrites which eliminate the the need for hyphenation or ragged margins while still reading completely naturally. Text run through it will look identical (because we can be careless readers - did you notice the 'the the' in the previous sentence?), but just magically fit perfectly in its column.

ericwaller · on June 16, 2022

I like it! I've definitely found myself doing this while manually breaking lines in code comments

pfortuny · on June 16, 2022

Notice that the author means “paragraph “ when he or she says “chapter”.

anyfoo · on June 16, 2022

If true, I wonder if that mixup is somehow connected to the author’s original language. (Probably Finnish, from the name?)

In German, “chapter” has a very direct translation, “Kapitel” (English and German are extremely related after all), and though “paragraph” in its common text usage has its own word, “Absatz”, the literal word “Paragraf” also exists and is used mostly when referring to laws.

So I wonder whether Finnish, which I know next nothing about, is significantly different there.

layer8 · on June 16, 2022

Apparently “kappale” means “paragraph” in Finnish, and “chapter” is a false friend.

anyfoo · on June 17, 2022

That was exactly the kind of insight I was looking for, thanks!

pfortuny · on June 16, 2022

I am sure that is an honest mistake, it is just confusing to the reader.

anyfoo · on June 16, 2022

Oh yeah, I was just curious whether the author's native language (assuming it's not English) had interesting features that would lead to such a mistake.

jwilk · on June 16, 2022

> he or she

"A gathering of development thoughts of Jussi Pakkanen. Some of you may know him as the creator of the Meson build system."

pfortuny · on June 16, 2022

My bad, sorry. Thanks for the pointer!

a-dub · on June 16, 2022

> For each line measure the difference between the desired width and the actual width and square the value. Then add these values together.

seems like you could add something to the loss function like sum (over hyphens) k*(1/1+d)^2 where d is the shortest vertical distance to the nearest next hyphen and k is a tunable.

drfuchs · on June 16, 2022

Yikes; the author is simply wrong about TeX in the quote about squaring the width difference (there is some cubing going on, but of space stretch and shrink ratios of the natural space width specified per-font). He also seems to imply that TeX doesn't take care to try to avoid two hyphenated lines in a row, which is also not true ("doublehyphendemerits"). And, finally, he implies that TeX doesn't prune its search tree while looking for a global optimal set of line breaks, when of course it does. So, it's not clear what exactly his algorithmic improvement is supposed to be?

a-dub · on June 16, 2022

now that i read more carefully, looks like they're just making an attempt at a line breaking algorithm for their own edification and sharing the results.

some parts of it make it sound like they're trying to beat tex, but on second read, nah. should probably be clearer (and maybe more correct) about paraphrasing how tex works though.

isitmadeofglass · on June 17, 2022

> No sample picture of TeX provided because I have neither the patience nor the skills to find out how to make it use Gentium.

To me this seems like ignorering the most importen comparison without any decent argument.

IshKebab · on June 17, 2022

Yeah what a cop out. "I've tried to improve on TeX but I'm not going to bother finding out if I actually did!"

Makes me suspect he tried it and it wasn't any better.

ZeroGravitas · on June 17, 2022

Is there evidence that this actually matters? Do untrained people choose the Knuth result as more visually pleasing or read faster/better with it?

If not then maybe we can just use the quick and dirty methods.