How to get C like performance in Java

nvarsj · on July 20, 2011

"One way to get C-like performance is to use C via JNI for key sections of code. If you want to avoid using C or JNI there are still ways you can get the performance you want."

JNI has huge latency. Write code in Java to get "C-like" performance.

"One area Java can be slower is array access. This is because it implicitly does bounds checking."

Which gets optimized away by hotspot.

"However this doesn't mean you can't pre-allocate your objects, use Direct ByteBuffers and Object recycling techniques to minimise your object creation."

Please don't write your own object pools, this isn't 1995.

bermanoid · on July 20, 2011

Please don't write your own object pools, this isn't 1995.

In my experience, object pooling (and poor-man's-object-pooling, using scratch objects and the like) can still make a major difference for certain applications. I worked on JBox2d, a physics engine (which creates a ton of tiny Vec2 objects for intermediate steps in calculations), and for most of its history the highest ROI optimization has been to chip away at any and all unnecessary object creation; sometimes this meant inlining simple operations that would have created intermediate vectors or matrices (this can be a huge win because you both avoid GC load and object-size overhead, which for a cache-miss dominated application like a physics engine can be extremely important), sometimes it meant pre-allocating "scratchpad" objects, and so on, but the end result was that we got more than a 4x speed increase by focusing exclusively on getting rid of small object shenanigans. Recently there was a rewrite which included some more explicit pooling strategies; I haven't tested those against the unpooled variants, but I should, I suspect there's still a huge benefit. The additional bonus is that if you're careful about garbage it's a lot easier to use your code on Android or to push it through GWT, as those platforms don't have anywhere near the GC efficiency of the Sun JVM.

I don't know about ByteBuffers, though, we never experimented with those because we didn't want to mess around with any unsafe code that might not be allowed in applets or on other platforms.

Things like escape analysis and scalar replacement help, but not enough. The problem is, those optimizations are extremely fragile, and for a lot of real world code the type of analysis required to figure out that those optimizations depends on knowledge of the overall design of the system (for instance, knowing that certain methods will not mutate fields of objects passed to them in certain situations).

Granted, a physics engine that's creating ~100,000 small objects per frame in some situations is an extreme edge case, and when I'm not doing that type of programming I'm much less concerned with garbage.

srean · on July 20, 2011

> Which gets optimized away by hotspot.

I believe it would do so automatically only when it can prove that the bound will not be violated. That wouldnt be very often in a program written in an imperative style.

Lest it be mistaken for a swipe at imperative style, functional languages don't have it much easier either. I think the problem is that for the most part, array use patterns are not very amenable to this kind of analysis. So even in functional languages you have to explicitly ask the runtime or the compiler to avoid bound-checks.

If I remember correctly, there were a few languages designed to handle this case, i.e. use compile time type-checking to ensure that all the arrays are of the right shape and that none are accessed un-safely. Fish I think is what it was called. Dont know what happened to it or the idea. Seemed pretty handy to have. I think the first place to check now would be SAC (single assignment C). But I have digressed far from the topic of this thread !

> Please don't write your own object pools, this isn't 1995.

The snark was unnecessary. It is frowned upon here.

krakensden · on July 20, 2011

> > Please don't write your own object pools, this isn't 1995. > The snark was unnecessary. It is frowned upon here.

Plus he's wrong, especially for all the people writing GUI applications who want to avoid GC pauses (for Dalvik, not the JVM, but still).

davidtgoldblatt · on July 20, 2011

On the contrary, for almost all types of common array code the compiler is really good at optimizing away the bounds checks (see http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.96... ). Things only get tricky when you have, for instance, arrays of indices into other arrays, or when the next array position to examine depends on the contents of the one being examined. Even there though (and even in array-heavy code), bounds checks don't typically take up a ton of program execution time.

srean · on July 20, 2011

For sparse matrix operations that is the most common pattern of access though. For linear algebraic operations on dense matrices, BLAS and LAPACK takes care of most of the issues and is super fast. I think the times I do not use BLAS/LAPACK (directly or indirectly) are when I have to apply operations on matrices that are not the typical linear algebraic operations. This comes up fairly often too in my line of work which is machine learning.

gaius · on July 20, 2011

Sorry to link to a Martin Fowler article, but that is what people are doing now: http://martinfowler.com/articles/lmax.html

svx · on July 20, 2011

Interesting, I hadn't seen his article on LMAX. For more details (and the code) check out their google code page: http://code.google.com/p/disruptor.

The disruptor pattern itself was new to me but the general principles (particularly the idea of running trading business logic in a single thread with all data in memory and no garbage collection) is very familiar.

gaius · on July 20, 2011

It's a silly name for a Pattern. All the others - Command, Strategy, Visitor, etc are more-or-less meaningful. In another blog post, they say they named it after the weapon in Star Trek. They could have called it Ring, because at the end of the day it is just a ring buffer.

jshen · on July 20, 2011

"Most systems can handle 1K-10K threads efficiently. If you need more connections than that, buy another server, a cheap one cost about $500."

Shouldn't the title be, How to NOT get C like performance in Java ;)

wladimir · on July 21, 2011

I also found that particular sentence really strange. "How to get C-like performance in java? Buy more servers!". Yeah, sure, but that approach will work for any language. With async you could handle even more connections with that larger amount of servers...

pkl · on July 21, 2011

You can handle a much larger number of connections provided they don't do anything useful. However if you expect to do anything useful, you are likely to need some horizontal scalability once you have thousands of connections.

AdWords charge about $1 per click. Lets say your hardware budget is only $1 per connection. With 10K connections you still have enough money for 10 servers if you wanted.

jshen · on July 24, 2011

"You can handle a much larger number of connections provided they don't do anything useful."

Define useful? I think you mean, computationally complex, but that isn't the same thing as useful. At my last job I worked on a web crawler, and fetching a page, then persisting it in our grid, was useful work. It was useful to have more than 10k connections in that context.

One of my pet peeves is this idea of throwing servers at problems. We currently have enormous problems with power and heat dissipation in data centers. At my last job, where I did the web crawler, we couldn't put any more servers in our data center because we didn't have enough power. Doing more, with fewer servers can be very important.

I also wonder how much larger the carbon footprint of computing is because of such philosophies.

peter_lawrey · on July 21, 2011

I would be very interested if you could explain what you mean. ;)

jshen · on July 21, 2011

it's mostly a joke, but I think the author is ... making a bad argument.

C doesn't require you to buy another server to handle more than 10k concurrent connections. So that contradicts the title, right?

But, you can of course handle more than 10k concurrent on the jvm using the NIO stuff.

pkl · on July 22, 2011

No business or web site which has 10K concurrent users will be using just one server. Scalability beyond 10K users/server doesn't buy you anything.

jshen · on July 22, 2011

surely you jest

peter_lawrey · on July 23, 2011

In your experience, have you ever seen or heard of a business or web site with more than 10K concurrent users who thought it was a good idea to place all those users on just one server?

jshen · on July 23, 2011

connections and users are not the same thing. Accomplishing a task with fewer resources is better, so when you say " Scalability beyond 10K users/server doesn't buy you anything", you are wrong.

Imagine if you have a giant datacenter like google, and heat and power are a serious problem. Accomplishing a task with fewer servers is a serious win.

peter_lawrey · on July 25, 2011

From a search I did suggest that Google have nowhere near 10K users per server. That is likely to be because the servers spend most of their time performing searches. The overhead of managing persistent connection is likely to be a small portion of what they do, like most real applications.

jshen · on July 25, 2011

You seem to be implying that since, in this one use case, it's not useful, then it's not useful at all. That's a non sequitur. I already gave one concrete example of where it was useful in my past experiences.

peter_lawrey · on July 27, 2011

Fair point.

jshen · on July 24, 2011

sorry, that quote wasn't from you, but I think the two of you are making the same point, which I disagree with.

astine · on July 20, 2011

"Note: Most of these suggestions only work for standalone applications rather than applets."

Whoa! That made me double-take. Applets have been an effectively abandoned technology (by the browser makers, not necessarily Sun,) for what, almost 10 years now? It's amazing this guy would feel the need to specify this.

Then again, had applets ever taken off... we might be coding client-side code in Lisp now, (or your favorite JVM language :)) Crazy.

tomjen3 · on July 20, 2011

All Danish online banks and all government services relies on a common login system which uses Java Applets.

So no, they are most certainly not dead.

gruseom · on July 20, 2011

You can code client-side in Lisp now (or your favorite JS-targeted language). I do it every day.

astine · on July 20, 2011

Parenscript isn't quite the same thing.

gruseom · on July 20, 2011

That's true. You're not encapsulated from JS, so you have to be aware of JS issues like null-undefined-false-0 and all the rest of them. So it feels like you're writing in two languages at once. On the other hand, that has some benefits too.

hello_moto · on July 20, 2011

Facebook photo batch upload uses Applet if I'm not mistaken.

speckledjim · on July 20, 2011

"Most systems can handle 1K-10K threads efficiently. If you need more connections than that, buy another server, a cheap one cost about $500."

Terrible advice. Use NIO and do it properly. Creating 1 thread per connection is for unskilled programmers.

In my experience once you get up to about 1k threads in Java you'll be wasting most of the CPU up switching between them.

For comparison, it's trivial for any skilled programmer to use NIO to get up 100k+ connections without any real CPU usage.

papaf · on July 20, 2011

Creating 1 thread per connection is for unskilled programmers.

Its not as clear cut or obvious as it first seems:

http://mailinator.blogspot.com/2008/02/kill-myth-please-nio-...

Edit: The original presentation is more interesting: http://www.mailinator.com/tymaPaulMultithreaded.pdf

speckledjim · on July 20, 2011

The article is from back in 2008 so I'd firstly dispute its accuracy with modern JVMs.

It's not just speed (The speed obviously changes as you change number of connections - 100k threads isn't going to beat 100k NIO connections). It's also that it's more scalable, better code, more maintainable, less prone to bugs, and you don't have to worry as much about concurrency issues. I'd argue that the lines of code is likely to be less also. And memory usage is likely to be less.

gojomo · on July 20, 2011

For better or worse, JVMs don't change that fast. (In 2008, the latest official release was JDK6, reaching 'update 12' in December. In July 2011, until the official release of JDK7 next week, the latest official JVM is still JDK6, 'update 26'.)

Or do you know of specific NIO performance improvements between JDK6 in 2008 and JDK6u26, or in JDK7?

100k threads isn't going to beat 100k NIO connections

That may be true but I wouldn't assume it with certainty without evidence. Threading has been improving, too. Tyma's 2008 presentation mentioned a JVM limit at the time of 16000 threads, but that may have changed as well.

A single-threaded NIO implementation, using only one core, could very plausibly lose to a 100K-threaded implementation on a multicore machine. And once you split your NIO over a few threads to use cores effectively, you have most of the same concurrency worries as with connection-per-thread approaches.

spullara · on July 20, 2011

And performance suffers terribly when you don't need to scale the number of connections. Like anything, use it when it is needed.

cube13 · on July 20, 2011

>Terrible advice. Use NIO and do it properly. Creating 1 thread per connection is for unskilled programmers.

Depends on the application. If the connection is handling real-time financial data(for example), there might be enough heavy lifting on the data stream to warrant one thread per connection. A good programmer wouldn't have too many of these threads running on a given machine, however, because one thread can easily consume a full CPU core, and still need more.

For connections that aren't doing much work, it's definitely better to aggregate connections onto a much smaller number of threads to avoid context switches.

>In my experience once you get up to about 1k threads in Java you'll be wasting most of the CPU up switching between them.

I agree. Thread context switching is extremely expensive. I've seen 10 thread applications consume full CPU time with context switches alone. There are a lot of tricks that can be done to mitigate it, like binding threads to certain CPU cores, making sure that threads that are communicating with each other are on the same CPU to avoid L2/L3 cache thrashing, etc. It's a lot of work, and requires a lot of knowledge about the code and system, but the results are staggeringly good. We've seen performance boosts on the order of 25% with single threaded applications, and boosts around 30-40% for multithreaded apps. Almost all of this is done outside the application code, however.

I can't imagine any reason why you'd have 1k threads running on the same box in one process, to be honest. 1k+ threads seems to indicate a lot of threads doing very little work. If they're doing basically nothing, consuming less than 1% of the CPU each, it's better to try to consolidate the work they're doing into a much smaller number of threads. It's a lot of work, but will get pretty considerable performance gains.

peter_lawrey · on July 21, 2011

> Terrible advice. Use NIO and do it properly. Creating 1 thread per connection is for unskilled programmers. Sometimes the simple solutions are the best. Developing a complex and clever solution may your feel more skilled. Using a dispacther pattern often doesn't provides a real advantage and or the advantage it provides isn't needed.

I would agree with the suggestion that if you are going to use a dispatcher pattern with NIO, use a library like netty which does this for you. (In any case you don't write it youself)

speckledjim · on July 21, 2011

It's like 100 lines of code. Write it yourself.

pkl · on July 21, 2011

I have and if you use blocking IO or NIO its like twenty lines, code just about any one can read and maintain. Why make it more complicated than it needs to be?

sehugg · on July 20, 2011

Fer sure. Netty (http://www.jboss.org/netty) is your friend for handling massive numbers of NIO connections.

fauigerzigerk · on July 20, 2011

I've been working very hard to make Java use less memory using techniques like these and others more. The problem is, if you really want to specify exactly how non-trivial structured objects are packed into memory, you end up writing your own garbage collector and memory allocator, which is a rather large project all by itself. Also, I have found that cutting down on memory usage like that hurts performance a lot.

At the end of the day, going down that road makes you less productive than working in C++ and the result will still be subpar compared to a C or C++ solution. So I'm back with C++ right now even though I'm painfully aware of its weaknesses.

srean · on July 20, 2011

If you are coming back to C++ after a stint with Java, I would highly recommend D. I think it hits the sweetspot between C++ and Java. The not so nice thing about D is that its libraries are not as pervasive and mature. The Phobos and Tango split is currently being addressed. I think all of Tango's functionality will be incorporated into Phobos. With D you get a simpler C++ with optional automatic memory management, and some functional idioms and nice support for parallelism.

@down-voter: If you could explain what you found down-vote worthy that would be nice.

fauigerzigerk · on July 20, 2011

D looks interesting, but I'm a little concerned that the libraries may not be robust or comprehensive enough yet. It's an uphill battle for any new language.

(downvoters are indeed a mystery sometimes)

sliverstorm · on July 20, 2011

So what you're saying is, if you want C-like performance in Java, switch to C?

fauigerzigerk · on July 20, 2011

Exactly, because otherwise you get assembler-like productivity. Specifying memory layouts in Java is much more low level than C or C++ programming.

rudiger · on July 20, 2011

If you truly need C-like performance in Java, use C++.

peter_lawrey · on July 21, 2011

I would say I am less concerned about how memory is packed, than how much is discarded. This is because memory is relatively cheap these days. A machine with 24 GB of memory costs around £1K. However you make a good point that C++ gives you better control over how memory is arranged and you have to use hacks like the Unsafe class in Java (which is not as friendly)

smcl · on July 20, 2011

The title is a little misleading, there's only a few techniques and it seems that one of the more interesting ones ("Unsafe" class) is only available in some JVMs, one doesn't help with speed in many cases (compressed strings) and a couple of the others boil down to "use DirectByteBuffers".

peter_lawrey · on July 21, 2011

Does your comment suggest that it can't be that simple?

sehugg · on July 20, 2011

For servers, memory is usually the long tent pole. For me it boils down to:

* Use less memory.

* Don't use a lot of threads.

* Use fewer objects.

Worst-case GC is a big issue. For just a 2 GB heap you are looking at up to 7 second GC pauses with the Java 6 collector. You can use concurrent GC but it adds significant CPU and memory overhead. Direct buffers don't live in the heap so they don't contribute to GC time.

Threads in Java are extremely heavyweight in memory, if not in CPU. A rule of thumb is about 1 MB per thread. You can make it better by decreasing the stack size and/or using 32-bit addresses.

peter_lawrey · on July 21, 2011

If you create low enough garbage and your eden size is large enough, you can avoid having any minor collections at all. I have a full GC which is triggered once per night as maintainence.

gaius · on July 20, 2011

So use JRockit with deterministic GC on.

uriel · on July 20, 2011

This is one of the greatest efficiency advantages of Go over Java, for the same task, Go code uses only a tiny fraction of the memory of what Java uses.

http://shootout.alioth.debian.org/u64/benchmark.php?test=all...

Roboprog · on July 20, 2011

Most of these tips seem to be variations of the "flyweight" design pattern (http://en.wikipedia.org/wiki/Flyweight_pattern), to avoid around having many objects for the GC to keep checking, as well as all the other overhead for each individual object.

peter_lawrey · on July 21, 2011

The answer to why would you still use Java is here "Java makes high level programming simpler and easier at the cost of making low level programming much harder. Fortunately most applications follow the rule of thumb that you spend 90% of your time in 10% of the code. This implies you are better off 90% of the time, worse off 10% of the time. ;)" Why would you develop mroe than a small portion of your application in a low level language unless every part of your application is performance critcal?

jyperion · on July 20, 2011

where does one find a 16 gb server for 1k?

krakensden · on July 20, 2011

I don't know, but I believe it- Fry's has an ad out now for 16GB for $100 (after $50 mail-in-rebate).

Sure you might want ECC, or a brand name, but the floor is pretty damn low.

daeken · on July 20, 2011

I just put an additional 12GB of DDR3 in my laptop (for a total of 16GB). That 12GB cost me $100 total. For ECC server RAM, you can get by with $50/4GB, so $200: http://www.newegg.com/Product/Product.aspx?Item=N82E16820139...

cultureulterior · on July 20, 2011

http://www.abmx.com/1u-xeon-e3-server

Click 16gb.

$1,008.99

(ed:no endorsement, never tried this website before)

peter_lawrey · on July 21, 2011

Dell sell one.

However for £1K which is about $1.5k you can buy a PC with 24 GB, a 3.8 GHz i7, 120 GB SSD, two screens. ;)

known · on July 21, 2011

no pointers = no speed

peter_lawrey · on July 21, 2011

Can you provide any benchmarks which support this as I can provide plenty which don't?

known · on July 22, 2011

http://shootout.alioth.debian.org/u32/benchmark.php?test=all...

peter_lawrey · on July 23, 2011

That's a good link. The n-body and spectral-norm benchmarks were fastest in Fortran. The fannkuch benchmark was fastest in Ada. The Mandlebrot benchmark was fastest in ATS.

In each case, the language has pointers, however they were not used in the core part of the benchmark. i.e. they didn't get their speed advantage from pointers.

Haskell was the fastest for the thread-ring benchmark, it doesn't have pointers.

If Java had pointers as it does in the sun.misc.Unsafe utility class some code would be faster, but I don't believe this explains why Java is not the fastest in each benchmark. ;)

Object orientated programming is used in a significant portion of all programs, it would be interesting how C/C++ performs in such benchmarks. ;)

igouy · on July 24, 2011

>>That's a good link.<<

Those measurements for programs forced to use just one core on a quad-core machine, you may also be interested in the measurements made when the programs are allowed to use all the cores -

http://shootout.alioth.debian.org/u32q/which-programming-lan...

and also measurements made on x64

http://shootout.alioth.debian.org/u64q/which-programming-lan...

>>The n-body and spectral-norm benchmarks were fastest in Fortran.<<

Maybe we should wonder if the Intel C++ compiler would produce programs comparable to those from the Intel Fortran compiler ;-)

>>Object orientated programming is used in a significant portion of all programs, it would be interesting how C/C++ performs in such benchmarks.<<

Maybe you're suggesting that OO C++ programs wouldn't perform as well as Java programs? But doesn't that highlight the advantage of being able to write programs in different styles?

peter_lawrey · on July 27, 2011

I agree, I am suggesting that "no pointer = no speed" is not demonstrated by the benchmarks provided. The developers who wrote the benchmarks which were faster than the C/C++ ones didn't appear to agree. i.e. when they has the optional to use them, they didn't and were still faster.

gaius · on July 20, 2011

If you have to fight the language like this, would you not be better off with a different language? Scala maybe.

jshen · on July 20, 2011

for most of his points you'd have to do the same things in scala, right? How do you do arbitrary memory access in scala? How do you use space efficient strings in scala? etc, etc.

lucian1900 · on July 21, 2011

While the parent is generally wrong, it has been shown that Scala programs have the tendency to generate slightly (5-10%) less garbage. This is especially interesting on Dalvik.

gaius · on July 20, 2011

OK, C++ then, or my own favourite, OCaml, for GC.

sehugg · on July 20, 2011

Given that Scala targets the JVM, you are going to have to use the same tricks and probably a few others.

uriel · on July 20, 2011

You would be better off with a language that gives you much more control over memory layout, for example Go.