Hacker Newsnew | past | comments | ask | show | jobs | submit | epdlxjmonad's commentslogin

In this article, we report the results of evaluating the performance of the latest releases of Trino, Spark, Hive-MR3 using 10TB TPC-DS benchmark.

Trino 476 (released in June 2025)

Spark 4.0.0 (released in May 2025)

Hive 4.0.0 on MR3 2.1 (released in July 2025)

At the end of the article, we discuss MPP vs MapReduce.


Performance Evaluation of Trino 468, Spark 4.0.0-RC2, and Hive 4 on MR3 2.0 using the TPC-DS Benchmark

In this article, we report the results of evaluating the performance of the following systems using the 10TB TPC-DS Benchmark.

Trino 468 (released in December 2024)

Spark 4.0.0-RC2 (released in March 2025)

Hive 4.0.0 on Tez (built in February 2025)

Hive 4.0.0 on MR3 2.0 (released in April 2025)


This article evaluates the performance of the following systems.

Trino 418 (released on May 17, 2023) Spark 3.4.0 (released on Apr 13, 2023) Hive 3.1.3 on MR3 1.7 (released on May 15, 2023)


Diablo where you control a large party -- I tried this idea myself (using a bit of hack) and found it to be such great fun and extremely addictive. The only bad thing about it is that you can't go back to a single-player Diablo any more :-) There is a webpage that I made about 15 years ago: http://pl.postech.ac.kr/~gla/diablo.html

I still play Diablo with 8 characters occasionally. I wish there was a game like this.


There is a common belief that SparkSQL is better than Hive because SparkSQL uses in-memory computing while Hive is disk-based. Another common belief is that Presto is better than Hive because it is based on MPP design and was invented for the very purpose of overcoming the slow speed of Hive by the very company (Facebook) that invented Hive in early 2010s.

The reality is that nowadays both SparkSQL and Presto are way behind Hive, in terms of both speed and maturity. Hive made tremendous progress since 2015 (with the introduction of LLAP), while SparkSQL still has the issue of stability of fault tolerance and shuffling. (Presto does not support fault tolerance.) So, IMO, SparkSQL is nowhere near ready to replace Hive.

If you are curious about the performance of these systems, see [1] and [2] which compare Hive, SparkSQL, and Presto. Disclaimer: We are developing MR3 mentioned in the articles. However, we tried to make a fair comparison in the performance evalaution.

[1] https://mr3.postech.ac.kr/blog/2019/11/07/sparksql2.3.2-0.10... [2] https://mr3.postech.ac.kr/blog/2019/08/22/comparison-presto3...


We have never been able to make Hive LLAP run reliably on our HDP cluster, queries sometimes just hang for no apparent reason.

On the other hand, our Presto cluster runs pretty much anything we throw at it, and when it fails, the failures are easier to anticipate and mitigate. It's also quite simple to deploy and operate.


Could you expand more on the reasons why Hive is faster than Spark? Aren't Hive joins also achieved via a MapReduce shuffle?


Query plans are heavily optimized, and map-side joins are used extensively. The use of optimizations exploiting memory makes the so-called in-memory computing of Spark no longer relevant because Hive also uses memory efficiently. Hive community is actively working toward Hive 4, so I guess the future version will be even faster.


Spark also has a query plan optimizer, and uses map-side joins (referred to as broadcast joins) whenever it makes sense. I'm just curious what other differences architecturally in your opinion can result in a performance discrepancy?


While I cannot give a definitive answer because I am not an expert on Spark internals, my opinion is that the discrepancy results mainly from query and runtime optimization.

Apart from adding new features (e.g., ACID support), a lot of effort is still put into optimizing queries and runtime. In essence, Hive is a tool specialized for SQL, so it tries to implement all the optimizations you can think of in the context of executing SQL queries. For example, Hive implements vectorized execution, whose counterpart in Spark was implemented only in a later version (with the introduction on Tungsten IIRC). Hive even supports query re-execution: if a query fails with a fatal error like OOM, Hive re-generates a new query after analyzing the runtime statistics collected by then. The second query usually runs much faster, and you can also update the column statistics in Metastore.

In contrast, Spark is a general-purpose execution engine where SparkSQL is just an application. I remember someone comparing Spark to Swiss army knife, which enables you to do a lot of things easily, but is no match against specialized tools developed for a particular task. My (opinionated) guess is that SparkSQL will be replaced by Hive and Presto, and Spark streaming will be replaced by Flink.


Was SparkSQL ever intended to replace hive? My impression was that it was supposed to supplement spark for times it was convenient. I kind of suspected at one point they got caught up in the SQL hadoop race, but I always felt like it was best to do SQL elsewhere, and save spark for things that couldn't be easily expressed in SQL.


The original SparkSQL was pretty much modelled after the Hive flavour of SQL, down to the available udfs. The compatibility was never complete and has somewhat diverged again with respective releases of the frameworks, but for the most part, Hive was the big data framework to beat at the time (2015-2016), and not everyone wanted to write Scala.

I think that now, maintaining that compatibility is less of a need for Spark and Hive has introduced a lot of goodies in the meantime, so there might not be a need for the SQL flavors to be in lockstep anymore.


SQL can be used as a dataframe, or a hive temp view that can be called from other SQL. That gives flexibility to mix and match SQL and programmatic logic within the same spark app.


I agree that Spark on Kubernetes will have a hard time fixing the problem of shuffling. If they choose to use local disks for per-node shuffle service, a performance issue arises because disk-caching is container-local. If they choose to use NFS to store shuffle files, a different kind of performance issue arises because of not using local disks for storing intermediate files. All these issues will arise without properly implementing fault tolerance in Spark.

We are currently trying to fix the first problem in a different context (not Spark), where worker containers store intermediate shuffle files in local disks mounted as hostPath volumes. The performance penalty is about 50% compared with running everything natively. Besides occasionally some containers almost get stuck for a long time. I believe that the Spark community will encounter the same problem in the future if they choose to use local disks for storing intermediate files.


Glad our post sparked some pretty deep discussions on the future of spark-on-k8s ! The OS community is working on several projects to help this problem. You've mentioned NFS (by Google) but there's also the possibility to use object storage. Mappers would first write to local disks, and then the shuffle data would be async moved to the cloud.

Sources: - end of presentation https://www.slideshare.net/databricks/reliable-performance-a... - https://issues.apache.org/jira/browse/SPARK-25299


For debugging a distributed system, it may be just okay to use the traditional way consisting of testing, log analysis, and visualizing that everyone is familiar with. Yes, there are advanced techniques such as formal verification and model checking, but depending on the complexity of the targe distributed system, it may be practially out of the question or just not worth your time to try to apply these techniques (unless you are in a research lab or supported by FAANG). In other words, it may be that there is nothing inherently inefficient with sticking to the traditional way because distributed systems are hard to debug by definition and there is (and will be) no panacea.

We have gone through the pain of testing and debugging a distributed system that is under development for the past 5 years. We investigated several fancy ways of debugging distributed systems such as model checking and formal verification. In the end, we decided to use (and are more or less happy with) the traditional way. The decision was made mostly because 1) the implementation is too complex (a lot of Java and Scala code) to allow formal techniques; 2) the traditional way can still be very effective when combined with careful design and thorough testing.

Before building the distributed system, we worked on formal verification using program logic and were knowledgeable about a particular sub-field. From our experience, I guess it would require a PhD dissertation to successfully apply formal techniques to debugging our distributed system. The summary of our experience is at https://www.datamonad.com/post/2020-02-19-testing-mr3/.


I wonder is the promise of cheap hardware using distributed systems is offset by the increased complexity and developer time. Stack overflow scaled up rather than out and I have never seen a problem with their site.


Is S.O. a good example of a complicated/large distributed system? I couldn't find any quick googleable results on how many people work on their site.

The reason I ask is I'm working on a product with 30-ish different pods/teams (maybe about 200 - 250 engineers) working on their respective modules, microservices, etc. From my understanding with talking to a lot of others at conferences, is that our distributed system is fairly small (in terms of functional modules/teams, transactions, etc..).

Anyway, even at our small-ish scale, I couldn't imaging running our platform as a single app that were were able to scale up with better hardware.

Also, I think how a company supports multi-tenanting would play a big role in deciding how this works, too, because you can have scenario with a monolith and DB but you have it partitioned by individual tenant dbs, app servers, etc, and you still have a huge pile of hardware (real or virtual) you're dancing around in....


My point is that Stack Overflow seems to have kept things as simple as possible, the opposite of a "complicated distributed system". It seems to be a classical relational databases backed app with some additions for specific parts where it needed to scale. In the end I guess it is distributed but it looks like its based largely around a monolith.

https://stackexchange.com/performance


> Anyway, even at our small-ish scale, I couldn't imaging running our platform as a single app that were were able to scale up with better hardware.

Lots of 250+ engineer teams out there working on monoliths.


Distributed systems (and specifically microservices) are oftentimes solutions to organizational problems, not technical ones.


Their developers can't write good stuff on their resume though. How will those poor chaps get another job without writing Kubernetes, NoSQL, distributed database, large scale horizonatally scalable systems. /s

AFAIK, writing "Used a large machine to solve customer problems quickly and efficiently" is not really taken well by a lot of people. The majority of companies can better scale up, than out, but out is the new normal for various reasons.


They're using NoSQL and other things. It's mostly Microsoft C# stack though.


What NoSQL? According to their blog they use SQL server.


That’s the problem. Doing the reasonable thing is a career killer.


The notion that stack overflow is small and scaling up is long obsolete. It's running on more than a hundred servers now.


Not according to this:

https://stackexchange.com/performance

Where do you get your numbers from?


Yep, and if you look at average CPU load percentage, it's usually in single digits.


When I worked on supercomputer simulations, I always landed on just dumping intermediate results and looking at them over figuring out how to compile something TotalView could look at.

When the latter was possible it saved time but it was always breaking under me and I eventually said fuck it


> it may be just okay to use the traditional way consisting of testing, log analysis, and visualizing that everyone is familiar with.

Not at all. I spent quite a lot of effort on debugging distributed deadlocks in a highly available system that was the best selling product in its category at one of the most famous (and loved) software companies based in SV and the amount of things that could (and will) go wrong is infinite, given every piece of infrastructure has its difficult-to-find/reproduce bugs. Things like sockets stopping responding after a few weeks due to an OS bug, messing up your distributed log, or unexpected sequences of waking up during a node crash and fail-over because some part of the network had its issues, leading to split brain for a few moments that needs to be resolved real-time, a brief out of memory issue that leads to a single packet loss, messing up your protocol etc. We used sophisticated distributed tests and that was absolutely inadequate. You are still in a naive and optimistic mode, though perhaps you as a designer will be shielded from the real-life issues your poor developers would experience as usually original architects move on towards newer cooler things and leave the fallout to somebody else.


Check out FoundationDB's approach:

https://www.youtube.com/watch?v=4fFDFbi3toc


Using a single thread to simulate everything is cool (as stated in my previous comment on FoundationDB at https://news.ycombinator.com/item?id=22382066). Especially if the overhead of building the test framework is small.

In our case, we use a single process to run everything, in conjunction with strong invariants written into the code. When an invariant is violated, we analyze the log to figure out the sequence of events that triggers the violation. As the execution is multi-threaded, the analysis can take a while. If the execution was single-threaded, the analysis would be much easier. In practice, the analysis is usually quick because the log tells us what is going on in each thread.

So, I guess there is a trade-off between the overhead of building the test framework and the extra effort in log analysis.


Reading this book AND trying to follow its key lessons makes a huge difference in productivity, which I can testify from my own experience. This book is often compared to the bible of software engineering, suggesting that everyone knows about the book (ex: 'no silver bullet'), some people read it, but only few people abide by it. So, its key lessons are hard to follow in the real world, but for a good reason.

We started a project 5 years ago, after a few months of failed attempts. From the very onset of the project, we tried to adhere to the key lessons of the book. Examples are: recognizing the importance of minimizing communication overhead (the most important assets are not people but time), following the surgical model (key decisions should be made by a single individual), practicing effect-free programming whenever possible, allocating enough time for testing, and so on. I would definitely attribute the success of our project to the teachings of the MMM.

Like many books on software engineering (and self-help books in general), just reading a book and learning its contents may not make any difference in practice. Only when you seriously make conscious efforts to practice its teachings do you realize what the book is really about. This is also the reason why many university courses on software engineering are boring.


When writing code, we often think "according to the design of our system, this condition must be true at this point of execution.” Examples are:

1. The argument x must not be 0.

2. The variable x must smaller than the variable y.

3. The list foo must be non-empty.

4. The variable x should have value 'Success' if it had value 'Try' in the beginning of the function call.

These 'invariants', or assertions, can be extremely useful for testing the correctness of the code. Put simply, if an invariant is violated (during unit test, integration tests, or system tests), it indicates that either the design or the implementation is wrong. An article on testing methodology would be more appealing if it had some discussion on exploiting invariants/assertions.


I believe "Design by Contract"[0], or "DbC", is the concept you are describing. In the Wiki page Notes and External Links sections there are some good resources IMHO.

0 - https://en.wikipedia.org/wiki/Design_by_contract


Yes, closely related, but invariants can appear anywhere in the code (like loop invariants), and are less restrictive than pre-conditions and post-conditions which must appear in the beginning and end of methods. So, invariants are more about testing than design.

Arguably, invariants are especially powerful in testing distributed systems:

0 - https://www.datamonad.com/post/2020-02-19-testing-mr3/


Aren't these examples of unit tests?


No, as they're runtime asserts baked into the code itself, rather than existing elsewhere.


Testing a distributed system using a single machine may look like an unorthodox approach. From our experience, however, when building a test framework for a distributed system, everyone would be automatically led to think about building it using a single machine for many benefits, especially saving a lot time. So, I would find it very inefficient to develop a distributed system using only system tests utilizing a cluster of machines, without exploiting integration tests.

Nevertheless using just a single thread to simulate everything seems like a great approach.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: