Hey HN! Vlad here with Sergey, Ildar, and Kirill. We are building Jitsu, an open-source Segment alternative (
https://github.com/jitsucom/jitsu,
https://jitsu.com/). We help companies collect events from their apps, websites, and APIs and send them to databases.
I've been doing data engineering for more than ten years (half of that time, I didn't know that it's called "data engineering”). Before Jitsu, I was a co-founder and CTO of GetIntent, an ad-tech startup. Although it was ad-tech (I'm sorry for that!), we also built a quite fascinating technology platform. We processed up to 1 million events per second at peak, and all those events needed to be stored somewhere.
We churned through a few data warehouse platforms along the way. In 2013, we started with Hadoop's HDFS and a bunch of map-reduce jobs on top of it. Then, when we decided to allow our customers to run ad-hoc reports, we switched to BigQuery. BigQuery was great, but expensive—especially with some customers obsessively clicking the refresh button. Finally, in 2017 we migrated to self-hosted ClickHouse which in my opinion is still the best analytics database in the world.
All that time, we spent a fair amount of effort to get data to the database. When you're dealing with millions of events per minute, running an INSERT statement per event won't work. What if the DB is down for maintenance? How can you be sure that all 50+ edge nodes are aware of recent DB schema changes? Also, did you know streaming data to BigQuery is costly while batching data is free?
We tried different approaches: first, we would write local log files, sync them to HDFS, and load data to BQ (or ClickHouse) with map-reduce jobs. To improve data freshness, we ditched HDFS and started to send data in batches to the DB directly from edge servers. We experimented with Kafka, but it felt too complex for that task at the time.
I always dreamed about a straightforward service, to which I'd throw JSON objects, and it would take care of the rest: queueing, retrying, updating database schema, etc.
Then I discovered Segment. I liked it at first. It seemed very developer-friendly with a nice API and excellent documentation. But the pricing model and data delays (the event gets to DB in 12 hours after it has been sent to Segment) killed the whole idea. And it was not open-sourced. In my opinion, being open-source and self-hostable is a must for such a fundamental part of the architecture as data collection.
I left GetIntent and got accepted to YC with a different idea for the Summer 2020 batch. The idea was to build a churn prevention and BI tool for online retailers. It didn't take off, but in the process we made a component to collect customer's app events and put it to DB. We tried to hack a solution on top of the ELK stack, but I was frustrated with ElasticSearch’s lack of SQL support. Here I was back to square one: there's no good open-source event collection service yet, and we needed to build one, once again.
So we decided to focus solely on that problem. We ditched all the previous code, which was in Java, rewrote the data collection server in Go and hacked together what we called EventNative [1]. It was received very well, and we started to get users.
Over the last 11 months, we've been busy building the UI, adding Connectors (to pull data from external APIs), polishing data warehouse support, adding javascript support to transform incoming data, and implementing dozens of other features.
Now we're launching Jitsu, an open-source Segment alternative. With Jitsu, we make it easy to collect data and send it to databases (we support all major players: ClickHouse, Redshift, Snowflake, BigQuery and Postgres). We’re deployed in production, including into a large gaming publisher, eSignature service, and many other great companies. We're going for an open-core model. So far we don't have paid features, but soon we'll have some, presumably around things like authorization and data masking. Also we run Jitsu.Cloud[2] which you can buy if you don’t want to self-host
Give it a spin: https://github.com/jitsucom/jitsu.
Thank you for reading this story - I hope it was interesting. I would love to read your feedback on Jitsu and answer questions!
[1] https://news.ycombinator.com/item?id=24120325
[2] https://cloud.jitsu.com
Thoughts:
1. You've got most major ads sources that I care about, but it seems that there is a higher bar to implementation. Segment lets me just plug in Google & FB ads and dump the entire shebang right into my data warehouse. A lot of marketing teams are going to have less time/resources to deal with implementation so smoothing this out is key.
2. Functions are an underrated and highly powerful feature of Segment. The ability to operate on data in transit, create custom connectors that "just run" (akin to CF Workers) and the like is a big selling point for more technically advanced marketing teams. It doesn't seem present here and that would hold a customer such as myself back on bigger scope projects.
3. I'd love to see a "compare us to your segment usage" where I select my data sources and destinations to see what you cover vs. Segment in a specific use case (and possibly pricing advantages on a self-hosted vs. non). This would make it much easier to sell through procurement and devops for new customers that are switching.
4. There are going to be a lot of people like me that are soon to start fresh in terms of marketing stack, so going after people before they select Segment might also be a play.
Looking forward to seeing where you all take this. Good luck!