There are other reasons for duplicates in event streams - not just the dupes introduced by at-least once processing in Kinesis or Kafka workers. We've done a lot of thinking about this (all open-source) at Snowplow, this is a good starting point:
Hi, Jin here from Amplitude. You are absolutely right that there are other sources of duplicates. Our real-time data store sits behind an event processor (not covered in this blog) that handles all major event duplication scenarios. This is why the real-time store focuses on duplications introduced by the message bus replays, something that systems such as Druid do not address.
I would be curious to know if they evaluated any cloud-based data stores or streaming services from AWS or GCP before deciding to building this from scratch. It seems like a common set of requirements for event analytics pipelines.
Hi, Jin here from Amplitude. The real-time data store is part of a bigger columnar store we built last year called Nova (https://amplitude.com/blog/2016/05/25/nova-architecture-unde...). In designing Nova, we’ve looked at many existent solutions including Amazon Redshift and Google BigQuery, but none of them sufficiently supports all our use cases. You can read more in the linked blog.
I read that and your motivations for building nova align very well with bigquery. E.g. immutabled (big query was append only), felaxability (break out of SQL with dataflow).
http://snowplowanalytics.com/blog/2015/08/19/dealing-with-du...
Our last release started to tackle dupes caused by bots, spiders and dodgy UUID algos:
http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-pe...