Hacker News new | past | comments | ask | show | jobs | submit login

Yeah don't mean to argue, here are some examples. It's difficult for any company to be competitive at scale without sufficient logging to support all of these things.

* Timestamps of events related to content loading and rendering. This is crucial for debugging and improving load times.

* Backfilling aggregated data so that ML models can be trained without waiting weeks for new streaming aggregation.

* Answering product questions of almost any kind that weren't asked when logging was built.

Concrete example from my recent experience, you may want to know how often people like a post then later look at comments, vs look at comments then later like a post. This gives you information about cause and effect.




The first one doesn't even need much historical data. Unless you have some very unoptimized periodic jobs, the last few days or something is plenty.

The second can be done simply on something like Dynamo, CosmosDB, or your cloud-hosted NoSQL of choice. Heck, it can even be done on Aurora or vanilla Postgres + partitioning if it's <64TB.

The third can be done with any off the shelf cloud data warehouse software, at many petabyte scale. And even then, I'm sorry, but I just don't believe you that the product clicks over some large timeframe are historically relevant if your software and UI changes often.

All of these things mentioned have had extremely simple, boring solutions at petabyte scale for >10 years, and in some cases more than that. If you add a batch workflow manager and a streaming solution like Spark, that's like 3-4 technologies total to cover all these cases (and many more!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: