Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Synth (YC S20) – Realistic, synthetic test data for your app
121 points by openquery on Aug 18, 2020 | hide | past | favorite | 48 comments
Hey!

Christos, Damien and Nodar here and we're the co-founders of Synth (https://getsynth.com) - Synth is an API which allows you to quickly and easily provision test databases with realistic data with which to test your application.

We started our company about a year ago, after working at a quantitative hedge fund in London where we built models to trade US equities. Strangely, instead of spending time developing models or building the trading system, a large portion of our time was spent on just sourcing and on-boarding datasets to train and feed our models. The process of testing datasets and on-boarding them was archaic; one data provider served us XML files over FTP which we then had to spend weeks transforming for our models to ingest. A different provider asked us to spin up our own database and then sent us a binary which was used to load the data. We had to whitelist their API ip-address and setup a cronjob to make sure the dataset was never out of date. The binary provided an interactive input so it couldn't be scripted, or rather it could be but you need something to mock the interactive params. All this took a junior developer on the team a good 3-4 days to figure out and setup. Furthermore after our trial expired we decided we didn't actually need this dataset so those 3-4 days were essentially wasted. Our frustration around the status-quo in data distribution is what drove us to start our company.

We spent the first 6 months building a privacy-aware query engine (think Presto but with built in privacy primitives), but software developers we talked to would frequently divert the topic to the lack of high quality, sanitised testing data during the software development lifecycle. It was strange - most of us developers and data scientists constantly use some sort of testing data for different reasons. Maybe you want a local development environment which is representative of production but clean from customer data. Or a staging environment which contains a much smaller, representative database so that tests run faster. You could want the dataset to be much bigger to test how your application scales. Maybe you want to share your database with 3rd party contractors who you don't necessarily trust. Whichever way you put it, it's strange that for a problem most of us face every day, we have no idiomatic solution. We write bespoke scripts and pipelines which often break. They are time consuming to write and maintain and every time your schema changes you need to update them manually. Or we get lazy and copy/paste production.

We finally listened to all this feedback, dropped the previous product, and built Synth instead. Synth is a platform for provisioning databases with completely synthetic data.

The way Synth works can be broken into 3 main steps. You first download our CLI tool (a bunch of python wrapped up in a container) and point it at your database to create a model (we host the models on the Synth platform). This model encodes your schema, and foreign key relationships as well as a semantic representation of your types. We currently use simple regular expressions to classify the semantic types (for example an address or license plate). The whole model is represented as a JSON object - if the classifier gets something wrong you can easily change the semantic type. Once the model has been created, the next step is to train the model. Under the hood we use a combination of copulas and deep-learning models to model the distributions and correlations in your dataset (the intuition here is that it's much more useful for developers to have realistic data than just sample from a random number generator). The final step is to use the trained model to generate synthetic data. You can either sample directly from the model or we can spin up a database for you and fill it with as much data as you need. The generation step samples from the trained model to create realistic data, as well as utilising bespoke generators for sensitive fields (credit card numbers, names, addresses etc.)

You can run the entire lifecycle in a single command - you point the CLI tool at your database (currently Postgres, MySQL and MsSQL) and in ~1 minute you get an i.p. address and credentials to your new database with completely synthetic data.

We're long time fans of HN and are eagerly looking forward to feedback from the community (especially criticism). We've made a free version available for this week so you can try it with no strings attached. We hope some of you will find Synth useful. If you have any questions we'll be around throughout the day. Also feel free to get in touch via the site.

Thanks! ~ Christos, Damien & Nodar




Congrats on shipping Christos, Damien, and Nodar! I really like this idea. I have this problem at my company.

Two questions:

First, we’re using Postgres and some of our tables use JSON. Would Synth be able to generate realistic JSON? Sometimes this is configuration (which would need to be straight copied) and other times it would be data (which would need to keep the same keys but have generated values). Is this use case supported?

Second, I’m concerned about giving Synth access to my data as much of it is sensitive. I understand that you need access to production data to offer the service. What can you tell me about your data security to help me feel more comfortable? (i.e. What kind of data would you have stored on your end? How does the CLI work? etc)

Congrats again and good luck!


Thanks and great questions!

> First, we’re using Postgres and some of our tables use JSON...

We've seen this before when we were talking to a company we were considering to pilot pre-launch - it's on our roadmap. Currently the JSON text would be treated as a string, i.e. it is classified as a categorical type or text.

What we would want is for the classifier to traverse the JSON object instead of treating it like text. This feature is going to be implemented when we extend to NoSQL databases.

> Second, I’m concerned about giving Synth access to my data as much of it is sensitive.

Absolutely. This has been one of the guiding principles in building Synth. We've built it so that our servers never have to see any sensitive information. (Hence why you can use Synth via a CLI tool instead of an API)

Also:

1) The CLI is soon to be OSS giving full visibility into exactly what's happening when you use it. (Really it's OSS now since you can just take a look at the source code running in the container, we just haven't had the time to make our repo public)

2) The models are designed to be transparent. You can inspect them by running `synth model inspect <model-id>`. This gives you visibility into exactly what the model looks like. (Looking at the data which has been sampled is still a WIP)

3) If something goes wrong and sensitive information is uploaded to the Synth platform, you can easily purge all traces of it using `synth model rm <model-id>`


> We've built it so that our servers never have to see any sensitive information.

If true, this is a key selling point and should probably be somewhere near the top of the homepage. I didn't get that point from reading any of the copy.


Thanks for the feedback. I'll make sure this is clear.

Why is this important for you?


(not OP, but) from a European perspective, it means one less GDPR headache. At the company I work for I know having PII going through a 3rd party server for this kind of purpose would be a no-go.


This is almost identical to a project idea I've had banging around for...um...6 years now. :) Glad to see someone is running with it, and also that you have data privacy as a 1st-class citizen. One idea for the data model: domain-specific descriptors (ex., not just a date, but a human birthdate with specific parameters (think healthcare applications: pediatrics vs general inpatient); this could be derived from sample/production data, but when designing a new application, one might need to have finer control over things like distribution (normal vs. skewed), min/max, etc.). If someone is designing a new report for an existing application, but wants synthetic data to use for dev/testing and UAT, the report "target data profile" may diverge from historical production data in very specific ways (ex., introducing new types/classes of products).


Thanks for your comment :)

These are all very good points. We are in the process of figuring out a natural way to express user-specified semantic types. We have some ideas but more on this coming soon!


Hey guys - here's just some critical feedback from a fellow dev - here's my n of 1 perspective - of course this could be a very different perspective for e.g. large enterprise companies struggling with this.

Feedback:

It seems overly complicated. You lost me when you said i have to train models? Are you assuming that software developers want to train machine learning models to do something as simple as creating some test data? In reality - I reach for tools that make things easier for me, which includes not having to read a ton of documentation, download new external tools, and things that 'just work'.

It is 100% easier for me to export a little production data to test on (and maybe sanitize), or to write a small script to generate a few users and those things I need to test. Plus - then I know exactly what I'm going to get. A lot of times, after I've done this once, it will work for a good while as well - if I do change the schema, I can add some additional data for that column, and go from there, or otherwise.

For those companies who have 'messy' fixture data - is the tool the issue? My take is that the difficulty with maintaining the data could contribute to this issue, but is also more an issue of simply bad housekeeping - e.g. rushing and not tending the garden. While your system might handle this, your system also seems to require a different skillset (e.g. specific training/knowledge) than the standard QA developer might have.

If I did use it, i'd prefer it to be much easier to use - if I could include a ruby gem, and incorporte it into the testing progress, e.g. an 'after' hook after migrating the db, that would be ideal. Then, I dont really need to know much. However, I would still be concerned about whether this is deterministically creating data or if its random?

Good luck!


Thanks for the feedback. This is exactly what we're looking for.

> It is 100% easier for me to export a little production data to test on (and maybe sanitize), or to write a small script to generate a few users and those things I need to test.

In your case it may very well be. But when you are an organization with a schema which has 100+ tables and these tables have scattered sensitive information this can become a nightmare to manage. I've seen this first hand. Furthermore if you are trying to generate more than 'a little' data this can get more complex as you have to create factories and write a lot of code to make the whole thing coherent and tell a story. I think undertaking the added complexity of Synth is a trade-off one should consider depending on the sophistication of the testing data they require.

> If I did use it, i'd prefer it to be much easier to use

I think this misconception may be attributed to the fact that we use machine learning under the hood. We've spent a lot of time abstracting the developer away from this. In fact you can run the whole lifecycle with 1 line of code:

`synth model new --from-database <database-uri> --train --deploy`

> I would still be concerned about whether this is deterministically creating data or if its random?

At this point you can choose. You can either pick a seed with which the whole generation process starts (this may not be in production yet) or elect to randomly seed it.

Thanks for the great questions :)


>things I need to test

I think this is the biggest problem. I don't need a lot of random data in my database. I need a lot of specific scenarios set up. And a way to get those scenarios back after I test something.

I've definitely been in a lot of situations where test data is a problem. A particularly egregious one that comes to mind is the poor developer that had to develop the fraud functionality. Marking an account as fraud nuked it in the back end. Lots of angry testers/developers when their favorite test account got marked as fraud.


> I need a lot of specific scenarios set up

Yes we've seen this quite a lot in the wild. The truth is this is not very well defined - how do you get your data to tell a story depends on the story you are trying to tell.

We are trying to come up with a more rigorous framework for abstract representations of 'scenarios' . It's on our roadmap so keep an eye out for this :)


As billed, this is good stuff.

I have a client who has millions of rows of data in production—and we have to run our test suite against production because they have no curated staging data set. This would allow us to save multiple minutes every dev pipeline and local test run (which are typically too slow to even run locally).

Looking forward to see you growing!


This same client is a bit of a penny-pincher.

Are there any plans to open this up so we could host the infrastructure and then pull a SQL import dump or something along those lines after running the CLI part? This would reduce your ongoing costs to reduce our monthly fee? ($130 would be a very tough sell, even though I think the business value is there.)


Hey!

So we are soon introducing the Firehose API. Basically this allows you to point at an arbitrary database and fill it up with as much data as you need from the model.

The Firehose should work for use-case and be much more cost effective.

A more hacky solution for right now, you can spin up a database and run a `select * ...`.


That's perfect! I'll keep an eye out for that.


If you can't wait, you can always run `synth model sample <model-id> --ouput <some-directory> --sample-size <number-of-rows>` which will generate synthetic data directly into your directory as CSV files. You can then ETL that into your database.

Hope this helps :)


For testing i care a lot about repeatability.

Specifically, i'm interested in testing a web dashboard/app. So if I use synth to populate my db, how would I know whether the backend's endpoints are giving me good data? Is there a way to guarantee a specific set of test data each time (so i can precompute what the values should be), or will i need to start a test run by querying the data base a bunch to see what's in it to figure out what i should expect the test results to be?

Also, is there a way to prepare data for import into an existing db? Right now for some of our testing we have a single staging instance and we deconflict multiple tests by including a randomized 8 character string in all the relevant IDs for precomputed data we insert as part of the testing initialization. For this testing it's not as important that the data is repeatable, but the testers have a few different scenarios they want to test, so I'd need a way to make a low-data, medium-data, and high-data test set where the backing data fit within some ranges.


Hey!

> Is there a way to guarantee a specific set of test data each time

Absolutely. You can seed the model so that the data you get each time is completely reproducible

> For this testing it's not as important that the data is repeatable, but the testers have a few different scenarios they want to test, so I'd need a way to make a low-data, medium-data, and high-data test set where the backing data fit within some ranges.

This is a great use-case for Synth. With the upcoming Firehose API you can point it at an existing database and specify how much synthetic data you want to generate and pump into your db.

For now you can either create a database and write the ETL, or do `synth model sample <model-id> --ouput <some-directory> --sample-size <number-of-rows>` to sample directly from the model into a directory of CSV files and use that to load your database

Feel free to get in touch if you would like to learn more :)


This looks really cool. One question I have is about how much the synthetic data can protect privacy. For example, my company has geospatial event data from our customers. We're very protective of customer identities, and wouldn't want to expose which cities our customers are in. If a model trained on our database notices that the "longitude" column marginal distribution has a spike around (just as an example) -71 degrees (longitude of Boston, where we're located), then presumably the synthetic data would also include a bunch of longitudes near -71 degrees? But there aren't that many cities at longitude -71 degrees, so even the marginal distribution of the synthetic longitudes would reveal something private about our data.

Second question is whether y'all support geospatial data? Both in the sense of "the topology of latitudes and longitudes is not a plane" and "can the model be trained on databases which encode geometries as a single column?"


That's a great question. I had to divert to my co-founder Damien who is spear-heading the research side of the company.

The gist of it is that if the the original data has a spike around -71, you will indeed see a spike in the synthetic copy as well. What it boils down to under the hood is a decision on the value of a continuous degree of freedom between two pieces of information:

- the information that you have a significant number of users located in Boston, and

- the information that any given particular user is located in Boston.

At a high level, we are taking the view that for your synthetic data to be realistic, it would need to spike around Boston if and only if most of your users are in Boston. This also means that you are not leaking information about any given individual user and that the behavior of the crowd is OK. Put more simply, if you have a single user located in Boston and all the others in, say, San Francisco, then your synthetic data should not end up having users in Boston at all.

Currently we do not have any bespoke support for lat/lon data, beyond it being like any other float of course. It is planned for the next release though! So check back in a couple weeks and it'll be there


I implemented a similar system a while ago, including differential privacy. The data at my firm was so messy the models failed miserably. You really need an analysis phase that can tell a customer whether their data will work or not. I.e. weird distributions, crazy foreign keys, difficult data types.


Yes - you're absolutely right in that data is a messy business.

Even in the early days we've seen crazy data types and constraints that makes our job of completely automating the process hard. However, every instance of this makes the product better and this transfers to the next customer.

> You really need an analysis phase that can tell a customer whether their data will work or not

This is part of the roadmap, it's a non-trivial piece of engineering. In the meantime you can try it for free and see if it works for you :)


One use case I've seen for this is compliance. For SOC2 and other compliance standards, I think you aren't allowed to use production data for dev/staging environments. An automated way to generate a database with synthetic data would make life much better in such cases.


Absolutely! We spent a bunch of time in the data privacy space before pivoting to Synth. Synth has utility as a dev tool but really does address exactly this issue.

This also ties into GDPR and CCPA compliance - we think that as regulations tighten (which seems almost inevitable) this sort of tooling will empower developers to go quicker and focus on their applications instead of compliance.


Anyone know how this compares to https://www.tonic.ai/? Tonic lets you generate data for safe local dev/testing, and they're also open source and have some big customers.


> Under the hood we use a combination of copulas and deep-learning models to model the distributions and correlations in your dataset (the intuition here is that it's much more useful for developers to have realistic data than just sample from a random number generator)

This is neat, but do users have the option of just doing vanilla RNG if they want?


Hey - good question.

Not right now, but it shouldn't be hard to implement. Is there something some specific use-case this would address?


> it shouldn't be hard to implement

Yeah it seems like it's just a flat/un-informed probability distribution and I'd guess your models are general enough to accommodate that.

A couple use cases come to mind:

1. If I have no data but want to test out various/arbitrary schemas with just a bunch of dummy data. Of course, I could generate it myself (either with ad hoc scripts or building a more general CLI that does this for me), but if Synth just makes it a one-liner in the command line, that's appealing.

2. If it's too burdensome to convince others in my org that you've "built it so that our servers never have to see any sensitive information". Even if I trust you, I then have to make arguments for others to also trust you, when really if all I need is some random data for an empty schema, then that's a whole can of worms I don't need to open.


Congrats, looks great and quite useful for the cases you've mentioned.

Only thing is that I initially thought the pricing was a bit high, but that was because I thought there was no trial option. On the second visit found it at the bottom of the page. Maybe an idea for an A/B test in the future to put the trial option right below the pricing?


I'm curious how the model handles text data. Does it use the actual input text from the source db to generate new synthetic data? If I have a column of a bunch of sensitive text that I need sanitized, how will that appear in the output? What is the risk of leaking something sensitive?


Thanks for the question!

For now text data will be marked as `categorical` or `text`. When you have sensitive data you want to use `text` which will provide a lorem-ipsum type generator.

If the model has classified that column with the semantic type `text`, no information from the column should be leaked :)


Very interesting concept. A couple of initial observations:

* Creating models from a file borks on anything non-UTF8 (i.e. most legacy system outputs)

* `synth model inspect` output does not match the docs - how do I see the JSON?


> Creating models from a file borks on anything non-UTF8 (i.e. most legacy system outputs)

Yes - this is a WIP. Thanks for pointing it out

> `synth model inspect` output does not match the docs - how do I see the JSON?

Ah yes this is a typo in the docs. We'll fix it up. What you're looking for is: `synth --format json model inspect <model-id> | jq`

Thanks for the feedback!


Does this work with unstructured data (such as cosmosdb?)


Not yet - but it's on our roadmap. Feel free to get in touch if you would like this to be accelerated and we can find out more about your use case :)


When you were working at the hedge fund, what type of datasets were you typically testing? Can you give me some broad examples?


Unfortunately I can't go into specifics here.

What I can say is that these were alternative[0] datasets.

[0] https://en.wikipedia.org/wiki/Alternative_data_(finance)


Looks very interesting and would be a huge win if we were able to use it - any chance Oracle support is on your roadmap?


We haven't looked into the logistics of supporting Oracle yet. The fact that Oracle is closed source makes everything a little bit harder. This was our experience when adding MsSQL Server support.

Feel free to get in touch if you would like to discuss more about your use case :)


Can this be installed on premise? Especially in the light of GDPR it might not be possible to do something like this with data stored "on the outside" (even if it's only a "model").

I know for sure, our customers wouldn't allow this.


Hey - great question!

We've been careful to design Synth such that the model doesn't contain any sensitive information. That being said I completely understand where you're coming from.

We do offer the enterprise version for on-prem deployments. Basically, if you have a Kubernetes cluster you can run Synth on-prem :)


How does it compare to Delphix?


It's hard to say. Delphix is quite opaque on what they do exactly and getting through requires booking a demo.

From what we have seen, Delphix is very much focused exclusively on large enterprise, and by extension does not look like a tool which is focused on the developer experience (could be wrong here).

We are much more focused at addressing the engineers in businesses - at the end of the day it's developers who will be using this tooling.


Offtopic: Was 2020 YC's Year of Developer Tooling or something? Seems like there have been lots of launches for YC-backed dev-tool startups in the past few weeks.


There have been, but YC has always funded lots of those. I suspect it's a random cluster in the startup stream. More are coming, too. This is Launch HN season because Demo Day is next week.


For those of you who feel this solution is a bit too complex for your workflow, there are a couple of lightweight alternatives, including Sudopoint (https://www.sudopoint.com) which lets you specify what you need and download a CSV, in and out in a few seconds.

To the Synth team, awesome product! Great to see that more tools are getting built to help testing / QA workflows. I think this a huge area for the future. Welcome to the competition. :)

[Disclaimer] I'm the (solo, bootstrapped) founder of Sudopoint


Sorry to say this, but the name "synth" is terribly misleading and generic. Word "synth" is used widely for electronic musical instrument "synthesizer".


I wouldn't say it's misleading but I see where you're coming from. I play the piano so this was what inspired the name.

It turns out that picking a name for a startup/product which is representative of what you do is hard!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: