Note that the author's algorithm allow lines wider than the target. (He does not hide this fact.) If you allowed the other algorithm this wiggle room (which basically is making the box wider) they'd also probably find better layout...
For example the first place the algorithms diverges is exactly on such a wider line.
I would like to see evidence that their thing beats TeX in practice, before bothering to check into it further. They carefully avoided putting up a comparison pic. TeX occasionally gets stuck and requires an intervention but it is rare enough that it was never an issue for me.
In the GPT-3 era, why not allow the algorithm to change the author’s wording (reordering or phrase rewrites) to more nicely distribute long and short words and give the line breaker more raw material to work with? /s
This is something I've thought about, inspired by the old textfiles where the authors manually space or rewrite to avoid hyphenation and fit exactly into 80col.
You don't want to use GPT-3 raw because the number of candidate versions would bankrupt you if you do it naively: there are way too many synonyms for each word and combinatorially many. If you do it in a single pass, then it's not too much different from greedy linebreaking, since it can't 'look forward' to minimize badness of linebreaks & synonym.
The system I have in mind would do dynamic programming like usual to find linebreaks, and each line-end word would have a list of synonyms with a distance given by a standard word embedding like word2vec. Then you have a standard DP which tries to minimize the sum of # of hyphens, characters left/right of the ideal position, and distance for a synonym rewrite multiplied by a configurable parameter weight. This will give you a linebreaking algorithm which can be tuned to do pure linebreaking (set the weight of synonyms to ∞) to perfectly preserve meaning even at the cost of appearance, or treat synonyms as free (set to 0) and spent like a drunken sailor to wedge the text in regardless of meaning distortion. You need to generate a lot of synonyms, but you can easily cache that and word embedding NNs are extremely small/fast/easy to run locally, unlike GPT-3. So, won't break the bank.
Now, you might say that rewriting words individually, even if the meaning is very close thanks to word2vec, can still lead to weirdness in global meaning, coherency, or text style. True! Then you can run it through GPT-3 as a final phase: take several different candidates (generated by setting the weight to a few different values), feed them through the GPT-3 Search API to find the closest (or the embedding API to do inner-product distance), and take the best one (again, with a weighting factor for how faithful to the original semantics you want to force the rewrites to be).
My guess is that given a large dictionary of synonyms, and the final GPT-3 pass to score for naturalness, creating many possible rewrites, the DP search will be able to find extremely subtle rewrites which eliminate the the need for hyphenation or ragged margins while still reading completely naturally. Text run through it will look identical (because we can be careless readers - did you notice the 'the the' in the previous sentence?), but just magically fit perfectly in its column.
If true, I wonder if that mixup is somehow connected to the author’s original language. (Probably Finnish, from the name?)
In German, “chapter” has a very direct translation, “Kapitel” (English and German are extremely related after all), and though “paragraph” in its common text usage has its own word, “Absatz”, the literal word “Paragraf” also exists and is used mostly when referring to laws.
So I wonder whether Finnish, which I know next nothing about, is significantly different there.
Oh yeah, I was just curious whether the author's native language (assuming it's not English) had interesting features that would lead to such a mistake.
> For each line measure the difference between the desired width and the actual width and square the value. Then add these values together.
seems like you could add something to the loss function like sum (over hyphens) k*(1/1+d)^2 where d is the shortest vertical distance to the nearest next hyphen and k is a tunable.
Yikes; the author is simply wrong about TeX in the quote about squaring the width difference (there is some cubing going on, but of space stretch and shrink ratios of the natural space width specified per-font). He also seems to imply that TeX doesn't take care to try to avoid two hyphenated lines in a row, which is also not true ("doublehyphendemerits"). And, finally, he implies that TeX doesn't prune its search tree while looking for a global optimal set of line breaks, when of course it does. So, it's not clear what exactly his algorithmic improvement is supposed to be?
now that i read more carefully, looks like they're just making an attempt at a line breaking algorithm for their own edification and sharing the results.
some parts of it make it sound like they're trying to beat tex, but on second read, nah. should probably be clearer (and maybe more correct) about paraphrasing how tex works though.
For example the first place the algorithms diverges is exactly on such a wider line.