Do I understand correctly that when someone calls identify(anonymous_id, new_can...

bunsenmcdubbs · on Oct 13, 2020

Yes! You are correct. This ensures that we are able to maintain a cohesive view of the entire "user" across all their devices, browsers etc (so long as the Heap customer has a method for identifying their end user).

A classic example is pre- and post- signup behavior for a single user. When a user first lands on a page, they will be anonymous and lack a canonical identity. They may come from specific referrers (search, ad, social media, direct), land on a specific page, or engage with certain parts of the site. All of these actions are tracked and stored using an anonymous id. After the user creates an account and is assigned a canonical id (via the `identify` API call), we still want to associate all the previously tracked data with the canonical identity. This allows our users to perform analyses using events and data points from before and after identification.

> merge the entire set of anonymous ids

In the previous example the "set of anonymous ids" is just a single ID. There are use cases were a user may already have a canonical identity but we want to change/update that canonical id. In this case, we are merging all the data associated with both canonical identities (set of anonymous id's associated with the canonical user and the set of ids associated with the new canonical identity) and creating a single combined user with a cohesive view of all actions on our customer's site/app etc.

georgewfraser · on Oct 13, 2020

It seems to me that under this scheme, if I make a single erroneous identify call, I will irreversibly merge two users. This is a surprising approach. Given that identify calls may occasionally be wrong, I would expect that

  identify(anonymous_id, new_canonical_id)

would map anonymous_id => new_canonical_id, but would leave the rest of the set find(anonymous_id) alone.

KMag · on Oct 13, 2020

Yea, it seems like compared to all of the other data they're logging per user, separately preserving the parent id and canonical id in the tree would have little cost and allow them to fix canonicalization errors later.

Then there's a write throughput vs read latency trade-off for reading statistics aggregated by canonical ID, but my guess is that trade-off can be made in a way they're happy with in exchange for the ability to undo mistakes.