ariskk's comments

ariskk · on Jan 14, 2021

I like this and I really wish it takes off. +1 for filtering by skill, eg ZIO, cats, shapeless etc. It might be useful to also have a directory of companies using Scala. Good luck!

ariskk · on Nov 9, 2020

Debugging those issues always seems like a Sherlock Holmes episode to me. Especially when errors are logical and more specifically in this case hidden in an upstream dependency!

ariskk · on Oct 6, 2019

This is truly lovely. I went from ‘pip install’ to reproducing one of our internal dashboards in ~1 hour. One issue: Auth and ACLs seem to be part of the paid/hosted version so it needs extra work to become viable for most people

ariskk · on Nov 23, 2017

Looks nice. It would be nice if you could add search tags there.

nsorros · on Nov 23, 2017

Thanks for the feedback. Glad you like it. Yep search tags are definitely on the roadmap!

ariskk · on March 12, 2017

This is exactly what we did initially. Really early on (before any rate limiting was in place), a few spam accounts followed 100K people, created lots spam content etc. Encapsulating those deletions started yielding messages bigger than the default max kafka message size (1MB). Additionally, this method had a few side effects on the downstream processors. We could of course increase the limit, but we decided to deal with the problem at its core.

ariskk · on March 9, 2017

Hi. I am the author of the article. Thank you for spending time to read this. The combined reach of the co-founders is very large, thus being able to provably handle scale was an essential prerequisite. Additionally, the requirements of the platform extend way beyond a simple content server. Content performance is tracked in real-time and this is fed to multiple ranking and recommendation models. Those frequently change, thus we need a way to retroactively process our data. Flexibility is key when trying to build an intelligent platform. Thus, we decided to early-on invest time in the ability to quickly iterate and experiment on algorithms, in real time over live data. You are right that the API fleet could be implemented using the aforementioned technologies; We use Scala and thus decided to use Akka HTTP instead. The challenging part is how you manage state behind that.

aub3bhat · on March 9, 2017

Don't get me wrong, I am not denying that CQRS/stream processing style approach is not useful for any application. Rather it is unsuitable for this particular problem.

In my experience all these features sound nice on paper. But you quickly run into practical issues that are far easier when you know approximate information about the state.

E.g. Developing a model? you might just want a subset/batch data. Doing BI/Analytics? are you going to continuously tax your server to recompute? The argument about recommender systems is also honestly flimsy, having built and applied such systems to live traffic at very large scale (more than hundreds of millions of users). There is only a small advantage from being able to quickly reconfigure flows. In most cases you have a single baseline model which you compare against for a small fraction of the traffic. The real complexity/gains in recommender systems lie in choice of algorithm/hyper-parameters/features, not on continuous multi armed bandits with 1000 different models applied simultaneously while waiting an infinite amount of time to produce any statistically meaningful answer. In fact for a website like this one, recommender systems can only provide so much advantage.

There are actually several really good specialized use cases, e.g. Google secmon-tools uses a system like this one.

[1] https://web.stanford.edu/class/cs259d/lectures/Session11.pdf

ariskk · on March 10, 2017

You mention the word "batch" when talking about models. Also "BI/Analytics". Since Django/Rails applications do not support any of the two, another sort of system would be needed. This is the point where, having built everything on Django, with no foresight whatsoever about future requirements, we would have ended up creating DataFrames from SQL tables in Spark. Our BI guys have no experience with Spark, so we would need to load data to a DW-like solution, like BigQuery/Redshift/Impala/Presto/you-name-it. Instead of another sink in Flink, we would need to implement and schedule ETL jobs. Even at our current load, computing counters (eg likes) at read time would be slow and inefficient. Which means we would need a way to pre-aggregate them. Maybe another service, possibly behind a queue? You can see where I am going. As requirements evolve, systems evolve, and with no planning before hand, people end up with spaghetti architectures. We knew we were funded enough to run for a couple of years. We knew the site would have traffic. We were tasked with delivering an algorithmicly-driven product, and this is the solution we came up with.

I really do not understand how such a strong set of conclusions can be drawn out of so little information.

newman314 · on March 9, 2017

Can you share some details around numbers and volume? "Very large" does not really convey why going down this route makes sense.

ariskk · on March 9, 2017

Unfortunately, I am not allowed to. The problem with this is that beforehand you cannot predict the volumes. 1K requests per second? 10K per second? Maybe 50K per second on special occasions? It is difficult to tell, especially when high profile personalities are involved.

PS: we do have lots of load