We have to work on the scaling. There is lots of web scraping and LLM calls that we need to make sure works under load. I'm sure there will be quality improvements as well.
Just came to say, I still think this is the best balance between the many factors of running a dev team. I keep trying to recreate it in every tool I use.
Airbyte acquired our Reverse ETL company, Grouparoo, 1.5 years ago. There is so much to solve making just the Extract and Load work well and so much value that comes from that, we have been busy there. I'm excited to circle back to publishing next year.
I like how the article notes that the stuff we were talking about with Reverse ETL (mostly activating your data in SaaS systems like Salesforce, Zendesk, etc) is one important part of Publishing. But we are also seeing traditional use cases like file uploads and new fancy stuff like vector databases.
I built a system for TaskRabbit that scraped all the IKEA products from a variety of sources and ran algorithms to determine their category and predict how long they would take to be assembled. Then there was a Mechanical Turk sort of system for human input. When combined with real-world feedback from the Taskers, it was pretty good.
For better or worse, I've personally been through the entire catalog multiple times.
Maybe as Turtle or JSON-LD, but you need a format that encapsulates triples and various data types. Otherwise you're throwing away most of the utility of a semantic web knowledge base.
At Grouparoo, this is a primary use case. We have a UI that engineers use locally. This helps gets things right. It outputs a JSON configuration that is checked in. When that is deployed, it does all the syncing.
Congrats on the launch! Hightouch looks great and this need is real. Things seem to be going well, so I don't think I'm taking too much away by mentioning that we have been been working on Grouparoo, an open source alternative that solves similar pain points.
A few differences: git developer workflow focused (branches, CI, PRs, etc), ability to self host, segmentation in destinations (tagging people in mailchimp based on rules, for example)
Hightouch user here. HT actually has a lot of that - git integration [0], visual segmentation [1].
Not sure about self-hosting though. Open-source is cool, will check it out.
Haha thanks. Love some friendly competition :). In all seriousness, though we're focusing elsewhere, the OSS angle is cool.
If you're interested in self-hosted though, just reach out at hello@hightouch.io.
That said, IMO one of the coolest parts of our tech is our "hybrid architecture". Out of the box, no data is stored in Hightouch - it's all in your cloud (warehouse, s3 bucket). This is how fintech (Plaid, Blend, Betterment, + some banks now!) and healthcare brands like Headway use us. We've also done a ton of compliance work and have certificates for SOC2 Type II and whanot.
There are probably some nuances one level down. Things our users have told us they can do in these areas that, to my knowledge, Hightouch doesn't do:
* Combine data from different sources to define a model. We'v seen using Postgres as a source of truth and supplementing with Snowflake data, for example.
* Add tags to contacts in mailchimp, zendesk or make lists of them in customer.io, Pardot, etc based on segmentation. I believe Hightouch Audiences is more like a filter.
* Full workflow with branches, PRs, test suite in a repo. I saw Hightouch added git syncing to a known branch yesterday and it looks cool, but it's not the full workflow yet.
I'm certainly trying to keep it in the friendly-competition area, especially on this thread :-)
This probably isn't the best place for an extended comparison, but since it's our launch post, I'll try to close the thread with a couple corrections for factuality. If anyone is interested in a deep-dive, email hello@hightouch.io, and I'm happy to set one up personally. And, I'm sure the team at Grouparoo would be willing to do the same ("contact us" at bottom of their website).
* Add tags to contacts in mailchimp, zendesk or make lists of them in customer.io, Pardot, etc based on segmentation. I believe Hightouch Audiences is more like a filter.
With static mappings, audiences can be synced to destinations as tags :). The magic is in the abstractions, not features!
* Full workflow with branches, PRs, test suite in a repo. I saw Hightouch added git syncing to a known branch yesterday and it looks cool, but it's not the full workflow yet.
Lots more coming soon here. Our git integration is bidirectional so you can totally do that stuff in git, but UI support is on the way. We've found the UI experience is a lot better of an experience than code for _most_ Reverse ETL workflows... so I see the value in this - I'lll check it out
If I have to be honest, the biggest thing that customers love about our product is that it works and accomplishes their use cases. Platform features are cool, but from time to time, I have to remind myself that Fivetran has proven that integrations and actually working comes first, and it is volume but not _just_ volume... our philosophy (destinations as a product), design, and progress there is quite differentiated from the space. You can read more in our Series A announcement from a few months ago at https://hightouch.io/blog/series-a
PS: I haven't tried Grouparoo in a while. I do love the concepts, will give it a swing!
We just finished up the Open Source Data Stack conference, which is all about this topic. Feel free to check out the reply.
Specifically, open source approaches to the modern data stack where the trend is picking the right tools for the job that revolve around the warehouse central data store.
The pieces discussed were around getting data in (Snowplow events, Meltano ELT), transforming it (dbt), reporting (Superset), getting it back into tools (Grouparoo Reverse ETL), and orchestrating things (Dagster).