A Distributed Real-Time Data Store with Flexible Deduplication

alexdean · on Jan 20, 2017

There are other reasons for duplicates in event streams - not just the dupes introduced by at-least once processing in Kinesis or Kafka workers. We've done a lot of thinking about this (all open-source) at Snowplow, this is a good starting point:

http://snowplowanalytics.com/blog/2015/08/19/dealing-with-du...

Our last release started to tackle dupes caused by bots, spiders and dodgy UUID algos:

http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-pe...

andrsncpr · on Jan 20, 2017

Hi, Jin here from Amplitude. You are absolutely right that there are other sources of duplicates. Our real-time data store sits behind an event processor (not covered in this blog) that handles all major event duplication scenarios. This is why the real-time store focuses on duplications introduced by the message bus replays, something that systems such as Druid do not address.

csears · on Jan 20, 2017

I would be curious to know if they evaluated any cloud-based data stores or streaming services from AWS or GCP before deciding to building this from scratch. It seems like a common set of requirements for event analytics pipelines.

andrsncpr · on Jan 20, 2017

Hi, Jin here from Amplitude. The real-time data store is part of a bigger columnar store we built last year called Nova (https://amplitude.com/blog/2016/05/25/nova-architecture-unde...). In designing Nova, we’ve looked at many existent solutions including Amazon Redshift and Google BigQuery, but none of them sufficiently supports all our use cases. You can read more in the linked blog.

tlarkworthy · on Jan 21, 2017

I read that and your motivations for building nova align very well with bigquery. E.g. immutabled (big query was append only), felaxability (break out of SQL with dataflow).

mborch · on Jan 20, 2017

It uses Amazon Redshift under the hood.

andrsncpr · on Jan 20, 2017

We don't use Redshift to run our queries. Nova, our customized columnar store, is designed to handle more specific use cases. You can read more here, https://amplitude.com/blog/2016/05/25/nova-architecture-unde....

julienmarie · on Jan 21, 2017

It reminds me exactly of the common architecture pattern of KDB/Q. Still at this point, it's a marvel of tech.