> I also heard one investor mention how Tumblr struggled with technical debt related to their feed
Not sure I'd agree with that, but I suppose it depends on the context and timing of the statement.
Tumblr's solution for reverse-chrono activity feed is, at its core, <1000 lines of PHP and a few extremely heavily optimized sharded MySQL tables. It is creaky and old, but its relatively small code footprint means it isn't terrible on the tech debt scale.
Tumblr's feed is computed entirely at read-time; there's no write fan-out / no materialized inboxes. The key insights that make the system fast (under 10ms latency to identify which posts go in the feed) even at scale:
* If you are grabbing the first page of a feed (most recent N posts from followed users), the worst-case is N posts all by different authors. If you have a lookup table of (user, most recent post time) you can quickly find the N followed users who posted most recently, and only examine their content, rather than looking at content from all followed users.
* For subsequent pages, use a timestamp (or monotonically increasing post id) as a cursor, i.e. the id or time of the last post on the previous page. Then to figure out which followed users to examine for the new page of results, you only need to look at followed users who posted since that cursor timestamp (since they may have older posts on the current page) plus N more users (to satisfy the worst-case of all N posts on the current page being by different authors that were not on previous pages).
* InnoDB's clustered index PK means it is very very fast at PK range scans. This is true even for disjointed sets of range scans, e.g. with PK of (a,b) you can do a query like "WHERE a IN (...long list of IDs...) AND b >= x AND b <= y" and it is still extremely fast.
* You can optimize out the access pattern edge cases and make them fuzzier. Most non-bot users only ever scroll so far; even power users that follow a lot tend to check the feed very often, so they don't go deep each time. On the extreme edge, users who follow thousands don't comprehensively try to view every piece of content in their feed anyway. This means you can measure user activity to determine where to place limits in the system that help query performance but are completely unnoticeable by users.
Not sure I'd agree with that, but I suppose it depends on the context and timing of the statement.
Tumblr's solution for reverse-chrono activity feed is, at its core, <1000 lines of PHP and a few extremely heavily optimized sharded MySQL tables. It is creaky and old, but its relatively small code footprint means it isn't terrible on the tech debt scale.
Tumblr's feed is computed entirely at read-time; there's no write fan-out / no materialized inboxes. The key insights that make the system fast (under 10ms latency to identify which posts go in the feed) even at scale:
* If you are grabbing the first page of a feed (most recent N posts from followed users), the worst-case is N posts all by different authors. If you have a lookup table of (user, most recent post time) you can quickly find the N followed users who posted most recently, and only examine their content, rather than looking at content from all followed users.
* For subsequent pages, use a timestamp (or monotonically increasing post id) as a cursor, i.e. the id or time of the last post on the previous page. Then to figure out which followed users to examine for the new page of results, you only need to look at followed users who posted since that cursor timestamp (since they may have older posts on the current page) plus N more users (to satisfy the worst-case of all N posts on the current page being by different authors that were not on previous pages).
* InnoDB's clustered index PK means it is very very fast at PK range scans. This is true even for disjointed sets of range scans, e.g. with PK of (a,b) you can do a query like "WHERE a IN (...long list of IDs...) AND b >= x AND b <= y" and it is still extremely fast.
* You can optimize out the access pattern edge cases and make them fuzzier. Most non-bot users only ever scroll so far; even power users that follow a lot tend to check the feed very often, so they don't go deep each time. On the extreme edge, users who follow thousands don't comprehensively try to view every piece of content in their feed anyway. This means you can measure user activity to determine where to place limits in the system that help query performance but are completely unnoticeable by users.