Show HN: Data Diff – compare tables of any size across databases

alexkoay · on June 23, 2022

This seems to be very useful!

I notice that it casts everything to string for MD5 to work. In that case, how does it handle two databases having different types for the same columns? I'm thinking about floats and numerics (decimal places), timestamps (some have timezone support, some don't) and bytes (hex, base64) in particular, but there are definitely others that I'm missing as well.

erezsh · on June 23, 2022

Great question, this is something we put a lot of effort into.

For both dates and numerics, we format them in a normalized format, and round them to the lowest mutual precision. It gets even more complicated, because when databases reduce the precision of timestamps, some of them round it, and some of them truncate. We implemented both, and we either truncate or round both timestamps, according to which database has the column with the lower precision.

We haven't got to bytes and strings yet, but it's on our list, and I imagine we'll use a similar approach.

For now, we print a warning whenever we don't have special handling for the column type. If you see a value mismatch where it shouldn't be, let us know and we'll implement it next.

oa335 · on June 23, 2022

Hi, data engineer here. There are umpteen data engineering tools that have “Data” in the title. Have you considered a different name?

hichkaker · on June 23, 2022

As a data engineer myself, totally agree about the abuse of the word “data”. What we strived for when naming was to make it self-describing as much as possible. Since the tool does one thing - diff datasets - we could name it “dataset diff” but that seemed more clunky. “table diff” wouldn’t work since we’re working on adding APIs (e.g. Stripe) as a data source to make validation of API-to-database syncs possible, and that goes beyond just tables. There is always an option to give a nondiscriptive or metaphorical name but we were concerned that would make the tool far less discoverable by potential users.

higeorge13 · on June 22, 2022

Awesome tool, we will definitely give it a try! 2 questions:

- how do you handle the data replication lag in the comparison?

- i assume that this works in identical tables between 2 databases, right? Any support for “similar” tables based on a column set? Imagine that we have a use case where we have a table X in one db, and another table Y in another db, with some columns from X and enhanced attributes.

Sirupsen · on June 22, 2022

(1) You can specify `--min-age` to exclude records in the replication lag window

(2) We can do that, as long as they have the same name and compatible-ish types. There isn't support for differently named columns yet (but it's on the list)

Cheers. Open an issue if you run into trouble!

pbnjay · on June 23, 2022

Pretty cool! I did a similar project for flatfiles, but used bloom filters to generate an “index” of row contents to test against later. I feel like a similar idea could work for identifying divergent rows within your segments more quickly/with less repeated work.

Making that work across databases could be a huge pain though, I had some success in Postgre but bitfields in the other DBs were painful.

erezsh · on June 23, 2022

> Making that work across databases could be a huge pain though

That was indeed the main challenge. Each DB has a different syntax, different set of features, different format for timestamps and floats, different max precision, and so on. I'd say most of our work on data-diff went to making sure the behavior of the different DBs aligned with each other.

snidane · on June 23, 2022

What benefit does it give me over running table diff query in SQL?

  select * exclude (date_uploaded) from dev_table
  except
  select * exclude (date uploaded) from prod_table

or

  select *
  from (select * exclude (date_uploaded) from dev_table) dev
  natural full join
  (select * exclude (date_uploaded) from prod_table) prod
  where dev.date_uploaded is null
    or prod.date_uploaded is null

The only issue with the above is that EXCLUDE/EXCEPT is missing from standard SQL and even from market leaders like Snowflake, making this a massive pain in the ass. Second, natural joining in presence of null fields is going to produce a mess instead of something useful. Again - analytics db providers would rather boast about adhering to an ancient standard from the 1970s than listening to users and actually making SQL work after all those decades of pain.

Without the stupid default behavior of SQL, this wouldn't be a problem. I'm curious if Data Diff solves this or some other use case.

erezsh · on June 23, 2022

data-diff has several advantages over running the query you wrote here.

- data-diff can compare tables across different databases. Your query is limited to one database.

- For very big tables, your 'select' will time-out. data-diff splits the diff into small segments, so we side-step this issue.

- data-diff supports running in threaded mode, which means it can finish a LOT faster. (especially for cloud databases.)

malcolp · on June 23, 2022

I think snowflake supports this.

https://docs.snowflake.com/en/sql-reference/operators-query....

snidane · on June 23, 2022

Snowflake doesn't support this kind of except, which is needed for the diff to work seamlessly:

  select * except (column1, column2, ...)

neural_thing · on June 23, 2022

I've been a Datafold customer for a year at two different companies. Great experience. Very useful tool in your CI flow, the team is very responsive.

jnsie · on June 22, 2022

Just curious - is there a reason that SQL Server doesn't make the list of supported platforms (i.e. it appears that there is no plan to support in future)?

Sirupsen · on June 22, 2022

There is plans to support pretty much every database. The reason it’s not supported currently is because its md5 hashing is too slow, so we need to do something different for it, e.g. just sum for types that support it. It’s similar for databases we plan to support that don’t support MD5 too, for example ElasticSearch.

If anyone knows a dev on the MSSQL team we could speak to, we’d be eager to be connected

See https://github.com/datafold/data-diff/issues/51

go_prodev · on June 22, 2022

If you install Visual Studio Community edition you can do free cross server MSSQL data validation and syncing using a "database project". It won't be as flexible, or as scalable, as the Data Diff tool advertised though.

cryptonector · on June 22, 2022

Does FDW let you do performant `FULL OUTER JOIN`s and/or `NATURAL FULL OUTER JOIN`s? If so then I would think that would be a decent place to start for remote DB diffs for PG. If might not be enough, of course, if the tables are huge, in which case taking a page from rsync and using some sort of per-row checksum as TFA does is clearly a good idea.

Sirupsen · on June 22, 2022

I'm not completely sure I understand your comment, so pardon me if I misunderstand. I don't think a foreign data wrapper would fundamentally to be more efficient with whatever table is ~foreign~, especially for an OUTER JOIN? Unless you're basically implementing something similar to data-diff with an OUTER JOIN with FDW, which seems possible

If you're doing in-database diffs, however, a join-based approach will likely outperform data-diff though.

Ideally databases would have support a standard MERKLE TREE INDEX so we could get extremely fast comparisons.

cryptonector · on June 23, 2022

A naive FULL OUTER JOIN is O(N), which is not efficient, indeed.

An RDBMS could implement something like the rsync algorithm, or history tables, etc., to speed up a FULL OUTER JOIN.

The point is that FULL OUTER JOIN is the SQL table source "diff" primitive. Thus it seems natural to use that and let the RDBMS optimize it.

michalg · on June 22, 2022

Can you diff a table with a view? Or only tables are supported?

Sirupsen · on June 22, 2022

If they are materialized into a table, they will work! Diffing queries is not supported yet.

data-diff only relies on a `sum(md5(concat(col1, col2)))` aggregation as well as `min(id), max(id)` to get the id bookends.