Post author here. Let me know if you have any questions!

mjburgess · on June 22, 2021

Is there anything you can say here about why you're running this query in spark?

Supposing spark is your ETL machinery... would it not make more sense to ETL this into a database?

IvanVergiliev · on June 22, 2021

Definitely. One of the primary benefits we get out of Spark is the ability to decouple storage and compute, and to very easily scale out the compute.

Our main Spark workload is pretty spiky. We have low load during most of the day, and very high load at certain times - either system-wide, or because a large customer triggered an expensive operation. Using Spark as our distributed query engine allows us to quickly spin up new worker nodes and process the high load in a timely manner. We can then downsize the cluster again to keep our compute spend in check.

And just to provide some context on our data size, here's an article about how we use Citus at Heap - https://www.citusdata.com/customers/heap . We store close to a petabyte of data in our distributed Citus cluster. However, we've found Spark to be significantly better at queries with large result sets - our Connect product syncs a lot of data from our internal storage to customers' warehouses.

throwaway81523 · on June 22, 2021

Would be nice if title said Apache Spark instead of just Spark, since there are other programs like Spark/Ada also called Spark.

KptMarchewa · on June 22, 2021

There is also Java web application framework called Spark. Nowadays everyone just call it Sparkjava.