Hacker News new | past | comments | ask | show | jobs | submit login
A Distributed Real-Time Data Store with Flexible Deduplication (amplitude.com)
46 points by prospero on Jan 20, 2017 | hide | past | favorite | 8 comments



There are other reasons for duplicates in event streams - not just the dupes introduced by at-least once processing in Kinesis or Kafka workers. We've done a lot of thinking about this (all open-source) at Snowplow, this is a good starting point:

http://snowplowanalytics.com/blog/2015/08/19/dealing-with-du...

Our last release started to tackle dupes caused by bots, spiders and dodgy UUID algos:

http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-pe...


Hi, Jin here from Amplitude. You are absolutely right that there are other sources of duplicates. Our real-time data store sits behind an event processor (not covered in this blog) that handles all major event duplication scenarios. This is why the real-time store focuses on duplications introduced by the message bus replays, something that systems such as Druid do not address.


I would be curious to know if they evaluated any cloud-based data stores or streaming services from AWS or GCP before deciding to building this from scratch. It seems like a common set of requirements for event analytics pipelines.


Hi, Jin here from Amplitude. The real-time data store is part of a bigger columnar store we built last year called Nova (https://amplitude.com/blog/2016/05/25/nova-architecture-unde...). In designing Nova, we’ve looked at many existent solutions including Amazon Redshift and Google BigQuery, but none of them sufficiently supports all our use cases. You can read more in the linked blog.


I read that and your motivations for building nova align very well with bigquery. E.g. immutabled (big query was append only), felaxability (break out of SQL with dataflow).


It uses Amazon Redshift under the hood.


We don't use Redshift to run our queries. Nova, our customized columnar store, is designed to handle more specific use cases. You can read more here, https://amplitude.com/blog/2016/05/25/nova-architecture-unde....


It reminds me exactly of the common architecture pattern of KDB/Q. Still at this point, it's a marvel of tech.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: