Ask HN: I just wrote an O(N) diffing algorithm – what am I missing?

ridiculous_fish · on Oct 29, 2019

Diffing means computing the longest common subsequence (LCS); the edit script is everything that's not part of the LCS. The proposed algorithm greedily finds a matching element. However it may be that it would be better to skip this element, and not incorporate it into your sequence.

Consider the following sequences (here each letter represents a line).

   AAAB
   BAAA

Where the first one is "left" and the second one is "right".

The LCS is AAA. The minimal edit script from left to right is:

1. Insert B at beginning.

2. Delete B at end.

The proposed O(N) algorithm finds B as the first matching line, and thereby exhaust `left`. So it finds the common subsequence of just "B" and its edit script is:

1. Delete 3 As at beginning

2. Insert 3 As at end

which is longer.

keithwhor · on Oct 29, 2019

Thanks! Yep, definitely not finding the LCS and associated edits with the current implementation. From my view right now, I'm basically going to have to use a suffix tree to efficiently find the LCS and go from there - which means it's likely that I'm just going to end up approaching Myers' algorithm.

leshow · on Oct 29, 2019

Is it subsequence or substring? Don't you want to find the longest common substring, i.e. contiguous block? I always understood subsequence to be non-contiguous.

ruds · on Oct 29, 2019

It's subseqence. For instance, in

ABXCDYEFGZ QARBCSTDEFG

You want to find [Q]A[R]B(X)C[ST]D(Y)EFG(Z)

ABCDEFG is the longest common subsequence, QRST are the additions and XYZ are the deletions.

robertelder · on Oct 29, 2019

I spent a bit of time learning the Myers Diff algorithm, (I made this page: https://blog.robertelder.org/diff-algorithm/)

Like someone else said, I think the key is that you're not getting the shortest edit script, you're just getting a edit script.

Interestingly, even git's implementation of diff (which uses a variant of the Myers Diff algorithm) doesn't actually find the shortest edit script by default. It has a hard-coded heuristic that just quits early if it finds a 'good enough' solution:

https://github.com/git/git/blob/master/xdiff/xdiffi.c#L143

Same this for unix diff (unless you specify the commandline flag to find shortest):

https://github.com/Distrotech/diffutils/blob/distrotech-diff...

Having said that, if you're still confident that you've made an improvement, you can port the function to python and try it out as a test case in my unit test file that I wrote when I developed my variant of the Myers Diff algorithm:

https://github.com/RobertElderSoftware/roberteldersoftwaredi...

On that topic, I'm pretty sure my variant of the Myers Diff algorithm is actually slightly faster/uses less memory than the original (I'd be happy to be proved wrong). At least, I've had my blog post up for a few years and no one has corrected me, but then again, I think the original paper has a few mistakes/obvious flaws and it still has hundreds of citations and no one has written much about improvements.

ridiculous_fish · on Oct 29, 2019

Nice work on your visualization! Thank you for rebutting the Fraser article; I came to the same conclusion as you (that Myers algorithm is in fact correct) but had never written it up.

I wrote a vectorized Myers implementation for diffing binary files. Perhaps the most interesting aspect is the progress bar: it's not obvious how to create a monotonic progress estimate for Myers diff.

https://github.com/ridiculousfish/HexFiend/blob/master/frame...

robertelder · on Oct 29, 2019

Cool. You've got a fairly impressive following on that tool. I wish more people took an interest in studying these sorts of algorithms. Most people assume that some genius figured out the optimal way to do things years ago, but more often than not if you spent a few weeks focusing on a specific paper/algorithm, you'll come up with ideas for improvement (although it might not always be in the way you initially expected). Sometimes the improvement is just figuring out a better way to explain it and that in itself can lead other people to create breakthroughs.

I think there are still a few improvements that could be done the the Myers algorithms. One specifically, is that I'm pretty sure it's tail recursive. You won't get any big O improvements off of that, but less stack space is still an improvement. Plus better cache locality. I'd work on some demos, but too many things to do these days...

smrq · on Oct 29, 2019

Hey! You made the hex editor I use. Thanks! :)

pstch · on Oct 29, 2019

To anyone that wants to learn or understand the Myers Diff algorithm, the first link in the parent post has great interactive visualization, which also cover the various improvements made on it by robertelder. And by the way, thanks a lot for all this, it has helped me a lot !

hombre_fatal · on Oct 29, 2019

Wow, you weren't kidding. That's some good stuff.

im3w1l · on Oct 29, 2019

Since the problem has already been resolved, I'll go on a slight tangent and mention there are some very interesting probabilistic diffing algorithms invented by the bio-informatics people.

E.g https://en.wikipedia.org/wiki/BLAST_(biotechnology)

terrelln · on Oct 29, 2019

Slightly off topic, but one great way to diff is to compress the new file using the old file as a dictionary. Both zlib and zstd support this (but zlib is limited to a 32KB file).

You can tune how much CPU you want to spend getting a better diff by adjusting the compression ratio, the compressor is well tuned for speed and is generally more efficient than a diffing algorithm, it can find repetitions in the new content, and you get entropy coding.

andai · on Oct 29, 2019

That's really neat, do you maybe have any links on this?

anonytrary · on Oct 29, 2019

You're most likely not getting a minimal diff, but it's highly likely that your diff is good enough for your use-case. React and most other similar VDOM libraries also fail to achieve minimal diffs, however, that's not necessarily a bad thing. Just because a diff isn't minimal doesn't mean it's wrong.

The ideal trade-off (diff-size vs space-time characteristics) depends on the use-case. It might be more important for a VDOM library to obtain a good-enough diff in O(N) time, as opposed to an LCS in O(NM) space and time. Linear diffing for VDOMs works well, especially with heuristics and things like explicit and implicit key matching. Sometimes the asymptotic characteristics of a program are irrelevant if your system is bounded.

victorNicollet · on Oct 29, 2019

React also doesn't need a diff to be of the "longest common subsequence" kind. For instance, the diff between "AB" and "BA" involves no insertion or deletion, just transposition, which React supports but traditional diffs don't.

anonytrary · on Oct 30, 2019

What you're talking about is called "implicit key matching" in React-speak. AFAIK it isn't implemented as a transposition, rather an "insert before" because the DOM API supports it directly. I have written a VDOM algorithm that relies only on swaps, deletions and insertions, but it doesn't map well to the DOM API.

devit · on Oct 29, 2019

React doesn't compute diffs, and it would be wrong for it to do so since components are stateful, and so it matters how they are matched (the idea that VDOM diffing is just an optimization is a very common misconception).

Instead, you just specify keys for elements (either explicitly or by index in the sequence) and it matches elements with the same key in the old and new trees.

anonytrary · on Oct 30, 2019

> React doesn't compute diffs

I think this is a bit pedantic. React computes what is essentially an edit path from one tree to another, so yes, it is computing diffs for all intents and purposes. That said, you're right that implicit and explicit key matching are not only for optimizations; they're for persisting state as well.

heavenlyblue · on Oct 29, 2019

But elements with ids and their attributes can be changed after a re-render and you need to sync those with DOM.

kazinator · on Oct 29, 2019

I just found an O(M + N) diffing algorithm:

  $ my-diff yours mine
  --- yours 2019-10-28 15:11:07.150118090 -0700
  +++ mine  2019-10-28 16:46:42.093923066 -0700
  @@ -1,50 +1,75 @@
  - yours
  - yours
  [...]
  - yours
  + mine
  + mine
  [...]
  + mine

Spit out the diff and hunk header, then dump yours preceded by "- " and then mine preceded by "+ ".

phinnaeus · on Oct 29, 2019

This is pretty great, the edit script is always two steps.

1. delete left

2. insert right

jerf · on Oct 29, 2019

There's a non-trivial sense in which you'd be justified calling this O(1), inasmuch as the edit script "delete yours, insert mine" can be yielded up as an edit script without even reading "yours" or "mine", simply yielding "yours" and "mine" by reference. That's in a more abstract sense than something "diff" could take. However, with at most a slight change to "diff" (and I'm not sure it's even that, I think it may have this already I'm just not checking for an HN post), you can yield an O(1) "delete yours" and then insert mine, so it's O(N) without the M [1].

In UNIX terms you can just spec out "cp mine yours" as the edit script, which is O(1) from the perspective of the diff algorithm. Applying that script is O(N), of course, but generating it is constant.

1: Super pedants will note that O(N + M) isn't terribly meaningful here since adding two linear terms means you can reduce that to O(N) under the big-O rules. However, I've always like keeping the information like that in the O notation, in some sense precisely because O(N+M) is-equivalent-to O(N) in big-O terms, but if you have additional knowledge like for some reason, M is exponential in a term you care about you can generate the new big-O term that you care about. In graph theory, my professor tended to want to express O(edges) as O(nodes^2), but I always liked retaining the O(edges) because O(edges) yields useful information if you know you have a sparse graph, for instance, whereas O(nodes^2) removes that. To his credit, he never marked me down for it, which puts him ahead of a lot of professors and their personal preferences...

syrak · on Oct 29, 2019

O(N + M) is not equivalent to O(N) if you make no assumptions about the relative growths of N and M, that's why it's actually meaningful to keep both terms around. You can only reduce it to O(N) when N dominates M (M = O(N)).

jerf · on Oct 29, 2019

In the case of file size inputs to a diff algorithm, it would not normally make sense to discuss how one file is going to be "exponentially" larger than another or something. They just are what they are, the bare inputs to the function in question.

By contrast, graphs can have significantly different relationships between nodes and edges, for instance, with sparse vs. dense graphs.

One could construct a situation in which file size inputs are varying related to each other in some O(...)-significant way, but it would be very strained, especially in the context of a diff algorithm (why are you trying to diff things significantly different in size at all?). You can fiddle with almost any O(...) pronouncement that way, e.g., "adding two integers isn't O(1)". True enough, but generally, more detail then we are looking for.

lorenzhs · on Oct 30, 2019

You’re wrong, the sizes of both input files are commonly considered when analysing algorithms that work on two files. Consider a problem that could be solved by sorting one of the files and then stepping through both files item by item (not saying that any meaningful problem with this property exists, it’s just an example). Then you would have a complexity of O(n log n + m) assuming n is the size of the file being sorted.

There is valuable information in running time being linear in the size of both files. You could define n as the combined size if you wanted to, but that could be confusing.

ysleepy · on Oct 29, 2019

Does this algorithm find the shortest edit script?

Finding any diff is not really the problem.

mattb314 · on Oct 29, 2019

This is basically the issue. If a single line from the bottom of left is present at the top of right, it will fast-forward all the way through left and fail to match anything else from right. The hard part with diffing isn't finding differences, it's deciding whether they're inserts, edits, or deletions.

keithwhor · on Oct 29, 2019

Nope! That's related to the Longest Common Substring Problem I think, I mentioned it in another comment [0]. Though I have a feeling you could find the shortest edit script with a few easy tweaks, I might check it out. I imagine the tweaks are going to make the the difference between O(N) and O(ND) though.

The problem I had was that I just wanted a simple diff! It's for HTML rendering and not ultra complex at that: usually folks are only changing a few lines at a time, max, between any re-render.

[0] https://en.wikipedia.org/wiki/Longest_common_substring_probl...

dllthomas · on Oct 29, 2019

I think shortest is probably a good proxy if we can't find something better but I think in an editor "easiest to understand" is probably what's actually important.

Not that I have any particular expectation that this algorithm does unexpectedly well at that.

eru · on Oct 29, 2019

Yes. That reminds me of lossy image compression: the real goal is to find a representation that looks as close as possible to the input image to the human eye. In practice, we use proxies.

Not all proxies are equally good. Eg just going by Hamming distance or L2 norm distance would be awful. Proxies that take more features of our visual system into account are going to fare better.

jacquesm · on Oct 29, 2019

This doesn't answer your question but you really should read up on Needleman-Wunsch for matching up strings.

Love this post by the way, wished we would see more of these on HN.

bodeadly · on Oct 29, 2019

Ah, no, the Myers algorithm is also known as the "Sortest Edit Script" and your algo does not follow the shortest path or produce a script (just -1 for mismatches).

More specifically, it's not just something at the end appearing at the beginning causing skipping like others have pointed out. The skipping can happen in the middle. Consider the two sequences A B B C and B C A. Your algo will match the first Bs and then the Cs and skip the common B C together.

Even if you could somehow make it work, it would be O(N) but note that with all of the split() and map() and array indexing, it starts to be more like O(N+NX) where X is that potentially non-insignificant overhead.

Don't let me pooh-pooh you from trying but if you do give-in and use the Myers' algo, here's my C diff implementation:

  http://www.ioplex.com/~miallen/libmba/dl/src/diff.c

This is actually pretty compact as diff implementations go and would probably be a reference for implementing diff in JS.

keithwhor · on Oct 29, 2019

Thank you: I'll definitely reference this if (and when) the bottleneck becomes our diffing implementation, or if as part of the editing experience we want to be able to show diffs between files that aren't stored in git.

Keep in mind that right now the implementation is for an in-browser code editor. The importance of the diff algorithm is primarily to quickly determine which HTML elements to re-render on a line-by-line basis. The changes a user makes between frames are at best single-grouping insertions / deletions (cut, paste, delete, insert). For a 2k LOC file, the render step (pre-diff) was on the order of 200ms: I was repopulating thousands of DOM elements every time a line was changed. Now it's at like 3ms, so under the 16.67ms frame cutoff.

I'm floored by the responses here though, this has been super helpful! I knew I was missing something academic here. :)

victorNicollet · on Oct 29, 2019

Question: why do you ensure that you are only moving forward? Does it matter, for line highlighting, that a line has moved up, or now appears twice?

pratchett · on Oct 29, 2019

Your approach is called hash join in databases.

Not sure about the diffing research though.

vince14 · on Oct 29, 2019

React has absolutely nothing to do with that.

username90 · on Oct 29, 2019

[flagged]

ian0 · on Oct 29, 2019

Not sure about the rest of the folks on this thread but I was delighted to read through this and the comments.

StreamBright · on Oct 29, 2019

Which school gives your the problem to implement a diffing algo as homework?

username90 · on Oct 29, 2019

[flagged]

StreamBright · on Oct 29, 2019

All the ones I went to.

hhmc · on Oct 29, 2019

hash map worst case insertion/search is O(N) so this is potentially worst case O(N^2). IIRC a typical implementation would be more likely to be O(N log N) for some high log factor in the limit, but still worse than O(ND). And this is before we account for the fact that hashing a line probably has an implicit factor of D in it.

keithwhor · on Oct 29, 2019

Hmm -- assumed time complexity for well-built hash functions for get / put operations is O(1), is it not? Sure, absolute worst-case means every operation is going to collide. Hash functions themselves are O(1)... I'm not sure what you mean by "implicit factor of D"? Like hashing is going inflate the runtime by a small factor basically equivalent to the edit distance in Myers' approach?

I've been looking through a little bit more literature and it's clear to me that my implementation as it stands is not equipped to solve the Longest Common Substring problem [0]. Right now it's too greedy. I think with a tweak it could.

I think I'm mostly just surprised because when I sat down at my computer today, I had the impression that diffing was a "hard problem" -- to be able to implement something that (apparently) works in a short timeframe made me feel as though I did something wrong. (I have this code running in a staging environment now with a code editor, and it seems to be working smoothly and as expected.)

[0] https://en.wikipedia.org/wiki/Longest_common_substring_probl...

hhmc · on Oct 29, 2019

Replying here because I can't edit my orignal comment. But I stand by what I said -- in the pathalogical case all of your keys map to the same bucket, and you hit O(N^2) or O(N log N) behaviour (depending on what exactly happens within your bucket).

I appreciate that in the typical case this won't happen, but big O denotes limiting behaviour, which in this case in O(N^2) or O(N log N) depending on your hash map implementation.

hhmc · on Oct 29, 2019

My understanding is that from a purely theoretical point of view you can call the operations Θ(1), but not O(1) -- since the worst case must account for the pathological case.

Yessing · on Oct 29, 2019

by definition Θ(1) implies O(1).

A sandwich means you have a slice of bread on the upper side.

heavenlyblue · on Oct 29, 2019

I have just been interviewing at several companies that have highly technical interviews in London, and about 80% of the interviewers who asked questions about hash maps expected me to assume that hashmaps imply constant-time lookups.

People tend to dismiss the statistics related to hash collisions because they think they are absolutely irrelevant to “real-life scenarios”.

Shoop · on Oct 29, 2019

This is because hash tables resize when they are filled with too many elements. Assuming a suitably good hash function, this gives O(1) lookup w.h.p. and insertion stays O(1) amortized. The mathematics of hash collision probabilities is factored into the resizing thresholds.

Many standard library implementations of hash tables (such as rust's) also include special hash functions which are salted with random values to prevent DoS attacks which create large numbers of hash collisions.

eru · on Oct 29, 2019

If you are really paranoid, you can even use a salted cryptographic hash function. But the constant factors for those are usually worse than for simpler salted hashes, so they are not worth it for hashtables.

Your fallback for collisions could also be something with O(log size_of_bucket) runtime, instead of linked lists. But again, when you don't have many collisions, that's going to be slower than something simpler.

(I half remember a result that if you tune your parameters right, you can get away with a hashtable that only has enough size for storing one element in each slot; and if a collision happens, you replace that element with a tomb stone, and store put the elements involved in a single global linked list.

Basically, for that to work you need to keep your total number of collision smaller than a constant with high probability.)

LgWoodenBadger · on Oct 29, 2019

Once your hashes collide, you have no other mechanism to use besides equality. How can you use something other than a list in order to find the actual value? This being for the general case where items don't have an "order/rank/sort" property between them.

eru · on Oct 29, 2019

Most of the time, when you talk about collisions we mean that two different keys get hashed to the same bucket.

What mechanisms you have available depends on your implementation.

For example Cuckoo Hashing (https://en.wikipedia.org/wiki/Cuckoo_hashing) relies on having two different hash functions available.

And yes, for having something like a balanced tree, having a comparison would be useful.

In general, most hash functions work by consuming something like a stream of bits. So for your implementation it makes a lot of sense for the datatypes you want to use as keys to export a way to convert themselves into that stream, and leave the actual hashing into a fixed size as a detail of your hashtable.

That way you can eg do fall-back comparisons directly on stream of bits (including for equality). Or you can transparently support multiple hash functions.

Even in languages where the hashable interface works by giving you a method to spit out eg a 64 bit number only, you still have to map that number to one of your buckets. So for your fall-back, you can choose a different mapping.

LgWoodenBadger · on Oct 29, 2019

My point was "what else can you use besides a linked list to store hash-colliding values?"

If you have a second (or a third, or a fourth...) hashing algorithm, then make it hash tables all the way down. At the end, you still need some data structure to store hash-colliding values. And if so, what other structure could you possibly use besides a list (linked-, array-, etc.) ?

eru · on Oct 30, 2019

> If you have a second (or a third, or a fourth...) hashing algorithm, then make it hash tables all the way down. At the end, you still need some data structure to store hash-colliding values.

Why? You can have just two layers: the primary hash table and your buckets are made up of one small secondary hashtable each. If there's a collision in the bucket hashtable, pick a now random hash function for that bucket and re-hash everything in the bucket.

If that fails after trying a few times, pick a new random hash for the primary hash table and consider resizing.

I bet you can make that scheme workable in O(1) expected amortised time for inserts and lookups.

Cuckoo hashing (https://en.wikipedia.org/wiki/Cuckoo_hashing) is a related idea: you just have to hash functions. If you had three elements that collide for both hash functions each, you just re-hash your entire table.

(From Wikipedia:)

> When a new key is inserted, and one of its two cells is empty, it may be placed in that cell. However, when both cells are already full, it will be necessary to move other keys to their second locations (or back to their first locations) to make room for the new key. A greedy algorithm is used: The new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there. The process continues in the same way until an empty position is found, completing the algorithm. However, it is possible for this insertion process to fail, by entering an infinite loop or by finding a very long chain (longer than a preset threshold that is logarithmic in the table size). In this case, the hash table is rebuilt in-place using new hash functions: [...]

You say:

> And if so, what other structure could you possibly use besides a list (linked-, array-, etc.) ?

Any datastructure you feel like. You can also use a balanced search tree, if you want to. See eg https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_w...

hhmc · on Oct 29, 2019

But in the pathological case where all the inputs collide to the same hash, I don't see how you avoid degrading to at least O(N log N), even with resizing -- you must account for the resizing as an O( g(N) ) operation.

eru · on Oct 29, 2019

Well, it depends on your hashtable implementation, and where your data comes from.

Python's old hashtables used to be good enough for most practical uses; but when the input data was controlled by an adversary, it was easy to run a Denial of Service attack.

See https://bugs.python.org/issue13703 They started randomising the hash function at runtime.

Hashtables can be made to run in O(1) with arbitrarily high probability for insert and lookup, if you are willing to be very careful with the implementation, and willing to endure slightly worse constant factors in some cases. (Of course, it also depends on what model of computation you are using. At some scale your keys have to grow like O(log n) of the number of elements, just so that you can distinguish all elements. But we usually abstract away from that.)

But yes, just programming any old hashtable doesn't magically give you O(1) performance, you need to work for it.

flukus · on Oct 29, 2019

Another common one is see is conflating constant-time with no-time. It's near enough to true for a single lookup, but 30,000 individuals lookups in a tight loop and you've got some real performance issues that are relatively invisible.

eru · on Oct 29, 2019

Hash tables also effectively randomize your access patterns.

That can be bad for memory locality, especially if you outgrow your main memory (or just various smaller caches before that.) All without changing the O(1) w.h.p. asymptotics.

flukus · on Oct 29, 2019

Usually you pay the memory locality price twice too, once for the index and once for the actual data, especially in OO languages. Thankfully fitting in main ram has never been an issue for me, but I've seen the results when some of those fetches are in a database or external memory cache.

eru · on Oct 29, 2019

One standard example I use is finding duplicates in a big collection of ite s either via a hashtable or via merge sort. The collection would be big enough not to fit into RAM, so you get swapping onto a hard drive.

Merge sort will mostly just work, because all its memory access patterns are sequential. Hashtables would need to pay for seeks all the time.

So O(n log n) mergesort would be faster than O(n) hashtable approach in this case.

(There are ways to improve the absolute running time further compared to these naive approaches.)