Hacker News new | past | comments | ask | show | jobs | submit login

Scala is still primarily used for data engineering workloads due to the fact it is a JVM language. (There's Java too, but no one wants to write Java code)

PySpark is often used for data science experimentation, but is not as frequently found in production pipelines due to the serialization/deserialization overhead between Python and the JVM. In recent years this problem is less pronounced due to the introduction of Spark dataframes which obviates the performance differences between PySpark and Scala Spark, but for UDFs, Scala Spark is still faster.

A newer development that may change all this is the introduction (in Spark 2.3) of Apache Arrow, a in-memory column store engine which lets Python UDFs work with the in-memory object without serializing/deserializing. This is very exciting as this lets Python get closer to the performance of JVM languages.

I've played around with it on Spark 2.3 -- the Arrow interface works but still not quite production-ready but I expect it will only get better.

Many folks are making strategic bets on Arrow technology due to the AI/GPU craze (and an in-memory standard enables multiple parties to build GPU-based analytics [1]), so there is tremendous momentum there.

At some point I expect the relative importance of Scala on Spark will decrease with respect to Python. (even though Spark APIs are Scala native)

[1] https://www.nextplatform.com/2017/05/09/goai-keeping-databas...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: