>Netflix observed a 20% increase of CPU usage on JDK 17 compared to JDK 8. This ...

blackoil · on Oct 10, 2023

I think this is a 20% improved utilization of CPU, earlier app was memory-bound or/and GC was consuming CPU. Now app has 20% more CPU available. It should be doing correspondingly more work. This could definitely be written clearly.

moffkalast · on Oct 10, 2023

> Bakker provided a retrospective of their JDK 17 upgrade that provided performance benefits, especially since they were running JDK 8 as recently as this year. Netflix observed a 20% increase of CPU usage

Seems like it's exactly that, OP cropped out the relevant bit where they list it having an overall performance benefit for that extra CPU time. Otherwise it could be assumed that it just hogs more CPU to get the same result.

bunderbunder · on Oct 10, 2023

I haven't dealt with this side of Java in a while, but it reflects my experience poking at Java 8 performance. At some (surprisingly early) point you'd hit a performance wall due to saturating the memory bus.

A new GC could alleviate this by either going easier on the memory itself, or by doing allocations in a way that achieves better locality of reference.

jillesvangurp · on Oct 10, 2023

Most modern GCs trade off CPU usage and latency. Less latency means the CPU has to do more work on e.g. a separate thread to figure out what can be garbage collected. JDK 8 wouldn't have had the G1 collector (I think, or at least a really old version of that) and they would have probably been using one of the now deprecated garbage collectors that would be collecting less often but have a more open ended stop the world phase. It used to be that this would require careful tuning and could get out of hand and start taking seconds.

The new ZGC uses more CPU but it provides some hard guarantees that it won't block for more than a certain amount of milliseconds. And it supports much larger heap sizes. More CPU sounds worse than it is because you wouldn't want to run your application servers anywhere near 100% CPU typically anyway. So, there is a bit of wiggle room. Also, if your garbage collector is struggling, it's probably because you are nearly running out of memory. So, more memory is the solution in that case.

BinaryRage · on Oct 10, 2023

The figure is about the overall improvement, not sure why that reads increase.

On JDK 8 we are using G1 for our modern application stack, and we saw a reduction in CPU utilisation with the upgrade with few exceptions (saw what I believe is our first regression today: a busy wait in ForkJoinPool with parallel streams; fixed in 19 and later it seems).

G1 has seen the greatest improvement from 8 to 17 compared to its counterparts, and you also see reduced allocation rates due to compact strings (20-30%), so that reduces GC total time.

It's a virtuous cycle for the GRPC services doing the heavy lifting: reduced pauses means reduced tail latencies, fewer server cancellations and client hedging and retries. So improvements to application throughput reduce RPS, and further reduce required capacity over and above the CPU utilisation reduction due to efficiency improvements.

JDK 21 is a much more modest improvement upgrading from 17, perhaps 3%. Virtual threads are incredibly impressive work, and despite having an already highly asynchronous/non-blocking stack, expect to see many benefits. Generational ZGC is fantastic, but losing compressed oops (it requires 64-bit pointers) is about a 20% memory penalty. Haven't yet done a head to head with Genshen. We already have some JDK 21 in production, including a very large DGS service.

pron · on Oct 10, 2023

> G1 has seen the greatest improvement from 8 to 17

Yep. G1 in newer JDKs is very different from G1 in JDK 8, but Parallel GC has also seen very significant improvements: https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html

algo_trader · on Oct 10, 2023

> Virtual threads are incredibly impressive work,

Do you have an (un)informed opinion on minimum task sizes for the green threads?

My interest is re-factoring java code to reduce total wall clock time, on large compute with plenty of memory/cache.

ahoka · on Oct 10, 2023

I don't think he meant that.

Macha · on Oct 10, 2023

A somewhat common problem is to be limited by the throughput of CPU heavy tasks while the OS reports lower than expected CPU usage. A lot of companies/teams just kind of handwave it away as "hyperthreading is weird", and allocate more machines. Actual causes might be poor cache usage causing programs to wait on data to be loaded from memory, which depending on the CPU metrics you use, may not show as CPU busy time.

For companies at much smaller scale than netflix where employee time is relatively more costly than computer time, this might even be the right decision. So you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.

If the bottlenecks and overhead are reduced such that it's able to make more full use of the CPU, you might be able to reduce to e.g. 15 machines at 75% CPU usage. Consequently the increased CPU usage represents more efficient use of resources.

CraigJPerry · on Oct 10, 2023

>> while the OS reports lower than expected CPU usage

>> which depending on the CPU metrics you use, may not show as CPU busy time

If your userspace process is waiting on memory (be that cache, or RAM) then you’ll show as CPU busy when you look in top or whatever - even though if you look under the covers such as via perf counters, you’ll see a lack of instructions executed.

The CPU is busy in this case and the OS won’t context switch to another task, your stalled process will be treated as running by the OS. At the hardware thread level then it will hopefully use the opportunity to run another thread thanks to hyper threading but at the OS level your process will show user space cpu bound. You’ll have to look at perf counters to see what’s actually happening.

>> you might end up with 20 servers at 50% usage, but using 10 servers will take twice as long but still appear to be at 50% usage.

Queue theory is fascinating, the latency change when dropping to half the servers may not be just a doubling. It depends on queue arrival rate and processing time but the results can be wild, like 10x worse.

xorcist · on Oct 10, 2023

When you put it like that, yes. Hardware is cheap and all that. In practice I think that an organization that doesn't understand the software it is developing has a people problem. And people problems generally can't be solved with hardware.

If somebody knows how to make that insight actionable, let me know. No, hiring new people is not the answer. In all likelihood that swaps one hard problem for an even harder.

toast0 · on Oct 10, 2023

IMHO, Usually the people problem is that there are too many people working on the same machine. Sometimes that's unavoidable.

Sometimes, honestly, understanding the software its developing isn't an important business goal. It makes me personally angry, but most businesses do right by not picking business goals to placate me.

Sometimes you just have too many people.

Sometimes you can restructure your software and systems so that fewer people are working on a system and they can understand it better. Sometimes that would also involve restructuring your organization, which has pluses and minuses.

If you can ensure the smaller teams run similar stacks, there can be some good knowledge transfer when one team figures out an underlying truth about the platform that could apply elsewhere. And sometimes you get a platform expert team that can help with understanding and problem solving throughout the teams.

shark1 · on Oct 10, 2023

To free memory. Also, 20% increase is not 20% in total. It's 20% when you go from 10 to 12 cpu usage, or from 50 to 60, for instance.

_the_inflator · on Oct 10, 2023

Well done.

I always appreciate numbers and the differentiation between relative and absolute numbers in this case.

"We doubled our workforce in one week!" - CEO's first hire... ;)

matsemann · on Oct 10, 2023

The CPU can do more tasks without being limited by memory pressure, perhaps?

I guess it depends on if they mean "we used 20% more CPU for the same output", or "we could utilize the CPUs 20% more".

paulbakker · on Oct 10, 2023

It’s a 20% improvement. So less time spent on GC.

znpy · on Oct 10, 2023

> Help me here, why do GC improvements cause CPU increase?

In Java 8 (afaik) there were pretty much no generational or concurrent garbage collectors, so garbage collector would happen in a stop-the-world manner: all work gets put on a halt, garbage collection happens, then the work can resume.

If you have a better GC, you have shorter and less frequent needs to do a stop the world pause.

Hence the code can run on cpu for more time, getting you higher cpu usage.

Higher cpu usage is often actually good in situations like this: it means you're getting more work done with the same cpu/memory configuration.

dboreham · on Oct 10, 2023

Java8 was at least a decade into generational and concurrent GC. It does STW once in a while though which may be what you meant.

_old_dude_ · on Oct 10, 2023

You have two kind of concurrent GCs, the ones where the marking phase in concurrent with the application and the ones where the evacuation phase is also concurrent with the application.

G1 only does the marking concurrently, the evacuation is done in small pauses. A decade ago, there was only one concurrent evacuation GC available in Java, C4 from Azul. Now, we have Shenandoah and ZGC.

tpm · on Oct 10, 2023

I read it as a good thing: GC improvements -> more available memory -> more work done by the CPU. But still would be interested in more detail.

groestl · on Oct 10, 2023

Because the memory / I/O is not the bottleneck anymore, and the CPU can now run optimally.

jjtheblunt · on Oct 10, 2023

I haven’t seen the specific profiling data, but it’s possible that the garbage collector is running a collection thread, concurrently with regular processing threads, and thereby preventing entire world synchronization points which would idle processor cores.

ahoka · on Oct 10, 2023

Higher CPU usage paradoxically means better performance. When I last did OPS we used to watch total CPU usage of all services and if it was not 100%, then we started to look for a bottleneck to fix.

radomir_cernoch · on Oct 10, 2023

Also interested! We saw basically the exact opposite. :-)

pyeri · on Oct 10, 2023

It's like hiring more workers to accomplish the exact same output as before. "See, I achieved 20% growth in my targets!", some recruiter will say!

groestl · on Oct 10, 2023

No, it's like improving a form to minimize the need for follow-up questions to the customer, and now seeing your workers (the same you had before) processing 20% more forms instead of waiting for responses.