Launch HN: Synth (YC S20) – Realistic, synthetic test data for your app

brosky117 · on Aug 18, 2020

Congrats on shipping Christos, Damien, and Nodar! I really like this idea. I have this problem at my company.

Two questions:

First, we’re using Postgres and some of our tables use JSON. Would Synth be able to generate realistic JSON? Sometimes this is configuration (which would need to be straight copied) and other times it would be data (which would need to keep the same keys but have generated values). Is this use case supported?

Second, I’m concerned about giving Synth access to my data as much of it is sensitive. I understand that you need access to production data to offer the service. What can you tell me about your data security to help me feel more comfortable? (i.e. What kind of data would you have stored on your end? How does the CLI work? etc)

Congrats again and good luck!

openquery · on Aug 18, 2020

Thanks and great questions!

> First, we’re using Postgres and some of our tables use JSON...

We've seen this before when we were talking to a company we were considering to pilot pre-launch - it's on our roadmap. Currently the JSON text would be treated as a string, i.e. it is classified as a categorical type or text.

What we would want is for the classifier to traverse the JSON object instead of treating it like text. This feature is going to be implemented when we extend to NoSQL databases.

> Second, I’m concerned about giving Synth access to my data as much of it is sensitive.

Absolutely. This has been one of the guiding principles in building Synth. We've built it so that our servers never have to see any sensitive information. (Hence why you can use Synth via a CLI tool instead of an API)

Also:

1) The CLI is soon to be OSS giving full visibility into exactly what's happening when you use it. (Really it's OSS now since you can just take a look at the source code running in the container, we just haven't had the time to make our repo public)

2) The models are designed to be transparent. You can inspect them by running `synth model inspect <model-id>`. This gives you visibility into exactly what the model looks like. (Looking at the data which has been sampled is still a WIP)

3) If something goes wrong and sensitive information is uploaded to the Synth platform, you can easily purge all traces of it using `synth model rm <model-id>`

sbecker · on Aug 18, 2020

> We've built it so that our servers never have to see any sensitive information.

If true, this is a key selling point and should probably be somewhere near the top of the homepage. I didn't get that point from reading any of the copy.

openquery · on Aug 18, 2020

Thanks for the feedback. I'll make sure this is clear.

Why is this important for you?

imInGoodCompany · on Aug 19, 2020

(not OP, but) from a European perspective, it means one less GDPR headache. At the company I work for I know having PII going through a 3rd party server for this kind of purpose would be a no-go.

cowb0yl0gic · on Aug 18, 2020

This is almost identical to a project idea I've had banging around for...um...6 years now. :) Glad to see someone is running with it, and also that you have data privacy as a 1st-class citizen. One idea for the data model: domain-specific descriptors (ex., not just a date, but a human birthdate with specific parameters (think healthcare applications: pediatrics vs general inpatient); this could be derived from sample/production data, but when designing a new application, one might need to have finer control over things like distribution (normal vs. skewed), min/max, etc.). If someone is designing a new report for an existing application, but wants synthetic data to use for dev/testing and UAT, the report "target data profile" may diverge from historical production data in very specific ways (ex., introducing new types/classes of products).

openquery · on Aug 18, 2020

Thanks for your comment :)

These are all very good points. We are in the process of figuring out a natural way to express user-specified semantic types. We have some ideas but more on this coming soon!

nartz · on Aug 18, 2020

Hey guys - here's just some critical feedback from a fellow dev - here's my n of 1 perspective - of course this could be a very different perspective for e.g. large enterprise companies struggling with this.

Feedback:

It seems overly complicated. You lost me when you said i have to train models? Are you assuming that software developers want to train machine learning models to do something as simple as creating some test data? In reality - I reach for tools that make things easier for me, which includes not having to read a ton of documentation, download new external tools, and things that 'just work'.

It is 100% easier for me to export a little production data to test on (and maybe sanitize), or to write a small script to generate a few users and those things I need to test. Plus - then I know exactly what I'm going to get. A lot of times, after I've done this once, it will work for a good while as well - if I do change the schema, I can add some additional data for that column, and go from there, or otherwise.

For those companies who have 'messy' fixture data - is the tool the issue? My take is that the difficulty with maintaining the data could contribute to this issue, but is also more an issue of simply bad housekeeping - e.g. rushing and not tending the garden. While your system might handle this, your system also seems to require a different skillset (e.g. specific training/knowledge) than the standard QA developer might have.

If I did use it, i'd prefer it to be much easier to use - if I could include a ruby gem, and incorporte it into the testing progress, e.g. an 'after' hook after migrating the db, that would be ideal. Then, I dont really need to know much. However, I would still be concerned about whether this is deterministically creating data or if its random?

Good luck!

openquery · on Aug 18, 2020

Thanks for the feedback. This is exactly what we're looking for.

> It is 100% easier for me to export a little production data to test on (and maybe sanitize), or to write a small script to generate a few users and those things I need to test.

In your case it may very well be. But when you are an organization with a schema which has 100+ tables and these tables have scattered sensitive information this can become a nightmare to manage. I've seen this first hand. Furthermore if you are trying to generate more than 'a little' data this can get more complex as you have to create factories and write a lot of code to make the whole thing coherent and tell a story. I think undertaking the added complexity of Synth is a trade-off one should consider depending on the sophistication of the testing data they require.

> If I did use it, i'd prefer it to be much easier to use

I think this misconception may be attributed to the fact that we use machine learning under the hood. We've spent a lot of time abstracting the developer away from this. In fact you can run the whole lifecycle with 1 line of code:

`synth model new --from-database <database-uri> --train --deploy`

> I would still be concerned about whether this is deterministically creating data or if its random?

At this point you can choose. You can either pick a seed with which the whole generation process starts (this may not be in production yet) or elect to randomly seed it.

Thanks for the great questions :)

treis · on Aug 18, 2020

>things I need to test

I think this is the biggest problem. I don't need a lot of random data in my database. I need a lot of specific scenarios set up. And a way to get those scenarios back after I test something.

I've definitely been in a lot of situations where test data is a problem. A particularly egregious one that comes to mind is the poor developer that had to develop the fraud functionality. Marking an account as fraud nuked it in the back end. Lots of angry testers/developers when their favorite test account got marked as fraud.

openquery · on Aug 19, 2020

> I need a lot of specific scenarios set up

Yes we've seen this quite a lot in the wild. The truth is this is not very well defined - how do you get your data to tell a story depends on the story you are trying to tell.

We are trying to come up with a more rigorous framework for abstract representations of 'scenarios' . It's on our roadmap so keep an eye out for this :)

lukeqsee · on Aug 18, 2020

As billed, this is good stuff.

I have a client who has millions of rows of data in production—and we have to run our test suite against production because they have no curated staging data set. This would allow us to save multiple minutes every dev pipeline and local test run (which are typically too slow to even run locally).

Looking forward to see you growing!

lukeqsee · on Aug 18, 2020

This same client is a bit of a penny-pincher.

Are there any plans to open this up so we could host the infrastructure and then pull a SQL import dump or something along those lines after running the CLI part? This would reduce your ongoing costs to reduce our monthly fee? ($130 would be a very tough sell, even though I think the business value is there.)

openquery · on Aug 18, 2020

Hey!

So we are soon introducing the Firehose API. Basically this allows you to point at an arbitrary database and fill it up with as much data as you need from the model.

The Firehose should work for use-case and be much more cost effective.

A more hacky solution for right now, you can spin up a database and run a `select * ...`.

lukeqsee · on Aug 18, 2020

That's perfect! I'll keep an eye out for that.

openquery · on Aug 18, 2020

If you can't wait, you can always run `synth model sample <model-id> --ouput <some-directory> --sample-size <number-of-rows>` which will generate synthetic data directly into your directory as CSV files. You can then ETL that into your database.

Hope this helps :)

joshAg · on Aug 18, 2020

For testing i care a lot about repeatability.

Specifically, i'm interested in testing a web dashboard/app. So if I use synth to populate my db, how would I know whether the backend's endpoints are giving me good data? Is there a way to guarantee a specific set of test data each time (so i can precompute what the values should be), or will i need to start a test run by querying the data base a bunch to see what's in it to figure out what i should expect the test results to be?

Also, is there a way to prepare data for import into an existing db? Right now for some of our testing we have a single staging instance and we deconflict multiple tests by including a randomized 8 character string in all the relevant IDs for precomputed data we insert as part of the testing initialization. For this testing it's not as important that the data is repeatable, but the testers have a few different scenarios they want to test, so I'd need a way to make a low-data, medium-data, and high-data test set where the backing data fit within some ranges.

openquery · on Aug 18, 2020

Hey!

> Is there a way to guarantee a specific set of test data each time

Absolutely. You can seed the model so that the data you get each time is completely reproducible

> For this testing it's not as important that the data is repeatable, but the testers have a few different scenarios they want to test, so I'd need a way to make a low-data, medium-data, and high-data test set where the backing data fit within some ranges.

This is a great use-case for Synth. With the upcoming Firehose API you can point it at an existing database and specify how much synthetic data you want to generate and pump into your db.

For now you can either create a database and write the ETL, or do `synth model sample <model-id> --ouput <some-directory> --sample-size <number-of-rows>` to sample directly from the model into a directory of CSV files and use that to load your database

Feel free to get in touch if you would like to learn more :)

Tarrosion · on Aug 18, 2020

This looks really cool. One question I have is about how much the synthetic data can protect privacy. For example, my company has geospatial event data from our customers. We're very protective of customer identities, and wouldn't want to expose which cities our customers are in. If a model trained on our database notices that the "longitude" column marginal distribution has a spike around (just as an example) -71 degrees (longitude of Boston, where we're located), then presumably the synthetic data would also include a bunch of longitudes near -71 degrees? But there aren't that many cities at longitude -71 degrees, so even the marginal distribution of the synthetic longitudes would reveal something private about our data.

Second question is whether y'all support geospatial data? Both in the sense of "the topology of latitudes and longitudes is not a plane" and "can the model be trained on databases which encode geometries as a single column?"

openquery · on Aug 19, 2020

That's a great question. I had to divert to my co-founder Damien who is spear-heading the research side of the company.

The gist of it is that if the the original data has a spike around -71, you will indeed see a spike in the synthetic copy as well. What it boils down to under the hood is a decision on the value of a continuous degree of freedom between two pieces of information:

- the information that you have a significant number of users located in Boston, and

- the information that any given particular user is located in Boston.

At a high level, we are taking the view that for your synthetic data to be realistic, it would need to spike around Boston if and only if most of your users are in Boston. This also means that you are not leaking information about any given individual user and that the behavior of the crowd is OK. Put more simply, if you have a single user located in Boston and all the others in, say, San Francisco, then your synthetic data should not end up having users in Boston at all.

Currently we do not have any bespoke support for lat/lon data, beyond it being like any other float of course. It is planned for the next release though! So check back in a couple weeks and it'll be there

silverlake · on Aug 18, 2020

I implemented a similar system a while ago, including differential privacy. The data at my firm was so messy the models failed miserably. You really need an analysis phase that can tell a customer whether their data will work or not. I.e. weird distributions, crazy foreign keys, difficult data types.

openquery · on Aug 18, 2020

Yes - you're absolutely right in that data is a messy business.

Even in the early days we've seen crazy data types and constraints that makes our job of completely automating the process hard. However, every instance of this makes the product better and this transfers to the next customer.

> You really need an analysis phase that can tell a customer whether their data will work or not

This is part of the roadmap, it's a non-trivial piece of engineering. In the meantime you can try it for free and see if it works for you :)

iforiq · on Aug 18, 2020

One use case I've seen for this is compliance. For SOC2 and other compliance standards, I think you aren't allowed to use production data for dev/staging environments. An automated way to generate a database with synthetic data would make life much better in such cases.

openquery · on Aug 18, 2020

Absolutely! We spent a bunch of time in the data privacy space before pivoting to Synth. Synth has utility as a dev tool but really does address exactly this issue.

This also ties into GDPR and CCPA compliance - we think that as regulations tighten (which seems almost inevitable) this sort of tooling will empower developers to go quicker and focus on their applications instead of compliance.

sqs · on Aug 18, 2020

Anyone know how this compares to https://www.tonic.ai/? Tonic lets you generate data for safe local dev/testing, and they're also open source and have some big customers.

graerg · on Aug 18, 2020

> Under the hood we use a combination of copulas and deep-learning models to model the distributions and correlations in your dataset (the intuition here is that it's much more useful for developers to have realistic data than just sample from a random number generator)

This is neat, but do users have the option of just doing vanilla RNG if they want?

openquery · on Aug 18, 2020

Hey - good question.

Not right now, but it shouldn't be hard to implement. Is there something some specific use-case this would address?

graerg · on Aug 18, 2020

> it shouldn't be hard to implement

Yeah it seems like it's just a flat/un-informed probability distribution and I'd guess your models are general enough to accommodate that.

A couple use cases come to mind:

1. If I have no data but want to test out various/arbitrary schemas with just a bunch of dummy data. Of course, I could generate it myself (either with ad hoc scripts or building a more general CLI that does this for me), but if Synth just makes it a one-liner in the command line, that's appealing.

2. If it's too burdensome to convince others in my org that you've "built it so that our servers never have to see any sensitive information". Even if I trust you, I then have to make arguments for others to also trust you, when really if all I need is some random data for an empty schema, then that's a whole can of worms I don't need to open.

tekkertje · on Aug 19, 2020

Congrats, looks great and quite useful for the cases you've mentioned.

Only thing is that I initially thought the pricing was a bit high, but that was because I thought there was no trial option. On the second visit found it at the bottom of the page. Maybe an idea for an A/B test in the future to put the trial option right below the pricing?

carlps · on Aug 18, 2020

I'm curious how the model handles text data. Does it use the actual input text from the source db to generate new synthetic data? If I have a column of a bunch of sensitive text that I need sanitized, how will that appear in the output? What is the risk of leaking something sensitive?

openquery · on Aug 18, 2020

Thanks for the question!

For now text data will be marked as `categorical` or `text`. When you have sensitive data you want to use `text` which will provide a lorem-ipsum type generator.

If the model has classified that column with the semantic type `text`, no information from the column should be leaked :)

sammyd56 · on Aug 19, 2020

Very interesting concept. A couple of initial observations:

* Creating models from a file borks on anything non-UTF8 (i.e. most legacy system outputs)

* `synth model inspect` output does not match the docs - how do I see the JSON?

openquery · on Aug 19, 2020

> Creating models from a file borks on anything non-UTF8 (i.e. most legacy system outputs)

Yes - this is a WIP. Thanks for pointing it out

> `synth model inspect` output does not match the docs - how do I see the JSON?

Ah yes this is a typo in the docs. We'll fix it up. What you're looking for is: `synth --format json model inspect <model-id> | jq`

Thanks for the feedback!

withinboredom · on Aug 18, 2020

Does this work with unstructured data (such as cosmosdb?)

openquery · on Aug 18, 2020

Not yet - but it's on our roadmap. Feel free to get in touch if you would like this to be accelerated and we can find out more about your use case :)

AznHisoka · on Aug 19, 2020

When you were working at the hedge fund, what type of datasets were you typically testing? Can you give me some broad examples?

openquery · on Aug 19, 2020

Unfortunately I can't go into specifics here.

What I can say is that these were alternative[0] datasets.

[0] https://en.wikipedia.org/wiki/Alternative_data_(finance)

cmdkeen · on Aug 19, 2020

Looks very interesting and would be a huge win if we were able to use it - any chance Oracle support is on your roadmap?

openquery · on Aug 19, 2020

We haven't looked into the logistics of supporting Oracle yet. The fact that Oracle is closed source makes everything a little bit harder. This was our experience when adding MsSQL Server support.

Feel free to get in touch if you would like to discuss more about your use case :)

hans_castorp · on Aug 18, 2020

Can this be installed on premise? Especially in the light of GDPR it might not be possible to do something like this with data stored "on the outside" (even if it's only a "model").

I know for sure, our customers wouldn't allow this.

openquery · on Aug 18, 2020

Hey - great question!

We've been careful to design Synth such that the model doesn't contain any sensitive information. That being said I completely understand where you're coming from.

We do offer the enterprise version for on-prem deployments. Basically, if you have a Kubernetes cluster you can run Synth on-prem :)

trulala · on Aug 18, 2020

How does it compare to Delphix?

openquery · on Aug 19, 2020

It's hard to say. Delphix is quite opaque on what they do exactly and getting through requires booking a demo.

From what we have seen, Delphix is very much focused exclusively on large enterprise, and by extension does not look like a tool which is focused on the developer experience (could be wrong here).

We are much more focused at addressing the engineers in businesses - at the end of the day it's developers who will be using this tooling.

vosper · on Aug 19, 2020

Offtopic: Was 2020 YC's Year of Developer Tooling or something? Seems like there have been lots of launches for YC-backed dev-tool startups in the past few weeks.

dang · on Aug 19, 2020

There have been, but YC has always funded lots of those. I suspect it's a random cluster in the startup stream. More are coming, too. This is Launch HN season because Demo Day is next week.

svsaraf · on Aug 18, 2020

For those of you who feel this solution is a bit too complex for your workflow, there are a couple of lightweight alternatives, including Sudopoint (https://www.sudopoint.com) which lets you specify what you need and download a CSV, in and out in a few seconds.

To the Synth team, awesome product! Great to see that more tools are getting built to help testing / QA workflows. I think this a huge area for the future. Welcome to the competition. :)

[Disclaimer] I'm the (solo, bootstrapped) founder of Sudopoint

sleepygardener · on Aug 18, 2020

Sorry to say this, but the name "synth" is terribly misleading and generic. Word "synth" is used widely for electronic musical instrument "synthesizer".

openquery · on Aug 18, 2020

I wouldn't say it's misleading but I see where you're coming from. I play the piano so this was what inspired the name.

It turns out that picking a name for a startup/product which is representative of what you do is hard!