"One way to get C-like performance is to use C via JNI for key sections of code. If you want to avoid using C or JNI there are still ways you can get the performance you want."
JNI has huge latency. Write code in Java to get "C-like" performance.
"One area Java can be slower is array access. This is because it implicitly does bounds checking."
Which gets optimized away by hotspot.
"However this doesn't mean you can't pre-allocate your objects, use Direct ByteBuffers and Object recycling techniques to minimise your object creation."
Please don't write your own object pools, this isn't 1995.
Please don't write your own object pools, this isn't 1995.
In my experience, object pooling (and poor-man's-object-pooling, using scratch objects and the like) can still make a major difference for certain applications. I worked on JBox2d, a physics engine (which creates a ton of tiny Vec2 objects for intermediate steps in calculations), and for most of its history the highest ROI optimization has been to chip away at any and all unnecessary object creation; sometimes this meant inlining simple operations that would have created intermediate vectors or matrices (this can be a huge win because you both avoid GC load and object-size overhead, which for a cache-miss dominated application like a physics engine can be extremely important), sometimes it meant pre-allocating "scratchpad" objects, and so on, but the end result was that we got more than a 4x speed increase by focusing exclusively on getting rid of small object shenanigans. Recently there was a rewrite which included some more explicit pooling strategies; I haven't tested those against the unpooled variants, but I should, I suspect there's still a huge benefit. The additional bonus is that if you're careful about garbage it's a lot easier to use your code on Android or to push it through GWT, as those platforms don't have anywhere near the GC efficiency of the Sun JVM.
I don't know about ByteBuffers, though, we never experimented with those because we didn't want to mess around with any unsafe code that might not be allowed in applets or on other platforms.
Things like escape analysis and scalar replacement help, but not enough. The problem is, those optimizations are extremely fragile, and for a lot of real world code the type of analysis required to figure out that those optimizations depends on knowledge of the overall design of the system (for instance, knowing that certain methods will not mutate fields of objects passed to them in certain situations).
Granted, a physics engine that's creating ~100,000 small objects per frame in some situations is an extreme edge case, and when I'm not doing that type of programming I'm much less concerned with garbage.
I believe it would do so automatically only when it can prove that the bound will not be violated. That wouldnt be very often in a program written in an imperative style.
Lest it be mistaken for a swipe at imperative style, functional languages don't have it much easier either. I think the problem is that for the most part, array use patterns are not very amenable to this kind of analysis. So even in functional languages you have to explicitly ask the runtime or the compiler to avoid bound-checks.
If I remember correctly, there were a few languages designed to handle this case, i.e. use compile time type-checking to ensure that all the arrays are of the right shape and that none are accessed un-safely. Fish I think is what it was called. Dont know what happened to it or the idea. Seemed pretty handy to have. I think the first place to check now would be SAC (single assignment C). But I have digressed far from the topic of this thread !
> Please don't write your own object pools, this isn't 1995.
The snark was unnecessary. It is frowned upon here.
On the contrary, for almost all types of common array code the compiler is really good at optimizing away the bounds checks (see http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.96... ). Things only get tricky when you have, for instance, arrays of indices into other arrays, or when the next array position to examine depends on the contents of the one being examined. Even there though (and even in array-heavy code), bounds checks don't typically take up a ton of program execution time.
For sparse matrix operations that is the most common pattern of access though. For linear algebraic operations on dense matrices, BLAS and LAPACK takes care of most of the issues and is super fast. I think the times I do not use BLAS/LAPACK (directly or indirectly) are when I have to apply operations on matrices that are not the typical linear algebraic operations. This comes up fairly often too in my line of work which is machine learning.
Interesting, I hadn't seen his article on LMAX. For more details (and the code) check out their google code page: http://code.google.com/p/disruptor.
The disruptor pattern itself was new to me but the general principles (particularly the idea of running trading business logic in a single thread with all data in memory and no garbage collection) is very familiar.
It's a silly name for a Pattern. All the others - Command, Strategy, Visitor, etc are more-or-less meaningful. In another blog post, they say they named it after the weapon in Star Trek. They could have called it Ring, because at the end of the day it is just a ring buffer.
I also found that particular sentence really strange. "How to get C-like performance in java? Buy more servers!". Yeah, sure, but that approach will work for any language. With async you could handle even more connections with that larger amount of servers...
You can handle a much larger number of connections provided they don't do anything useful. However if you expect to do anything useful, you are likely to need some horizontal scalability once you have thousands of connections.
AdWords charge about $1 per click. Lets say your hardware budget is only $1 per connection. With 10K connections you still have enough money for 10 servers if you wanted.
"You can handle a much larger number of connections provided they don't do anything useful."
Define useful? I think you mean, computationally complex, but that isn't the same thing as useful. At my last job I worked on a web crawler, and fetching a page, then persisting it in our grid, was useful work. It was useful to have more than 10k connections in that context.
One of my pet peeves is this idea of throwing servers at problems. We currently have enormous problems with power and heat dissipation in data centers. At my last job, where I did the web crawler, we couldn't put any more servers in our data center because we didn't have enough power. Doing more, with fewer servers can be very important.
I also wonder how much larger the carbon footprint of computing is because of such philosophies.
In your experience, have you ever seen or heard of a business or web site with more than 10K concurrent users who thought it was a good idea to place all those users on just one server?
connections and users are not the same thing. Accomplishing a task with fewer resources is better, so when you say " Scalability beyond 10K users/server doesn't buy you anything", you are wrong.
Imagine if you have a giant datacenter like google, and heat and power are a serious problem. Accomplishing a task with fewer servers is a serious win.
From a search I did suggest that Google have nowhere near 10K users per server. That is likely to be because the servers spend most of their time performing searches. The overhead of managing persistent connection is likely to be a small portion of what they do, like most real applications.
You seem to be implying that since, in this one use case, it's not useful, then it's not useful at all. That's a non sequitur. I already gave one concrete example of where it was useful in my past experiences.
"Note: Most of these suggestions only work for standalone applications rather than applets."
Whoa! That made me double-take. Applets have been an effectively abandoned technology (by the browser makers, not necessarily Sun,) for what, almost 10 years now? It's amazing this guy would feel the need to specify this.
Then again, had applets ever taken off... we might be coding client-side code in Lisp now, (or your favorite JVM language :)) Crazy.
That's true. You're not encapsulated from JS, so you have to be aware of JS issues like null-undefined-false-0 and all the rest of them. So it feels like you're writing in two languages at once. On the other hand, that has some benefits too.
The article is from back in 2008 so I'd firstly dispute its accuracy with modern JVMs.
It's not just speed (The speed obviously changes as you change number of connections - 100k threads isn't going to beat 100k NIO connections). It's also that it's more scalable, better code, more maintainable, less prone to bugs, and you don't have to worry as much about concurrency issues. I'd argue that the lines of code is likely to be less also. And memory usage is likely to be less.
For better or worse, JVMs don't change that fast. (In 2008, the latest official release was JDK6, reaching 'update 12' in December. In July 2011, until the official release of JDK7 next week, the latest official JVM is still JDK6, 'update 26'.)
Or do you know of specific NIO performance improvements between JDK6 in 2008 and JDK6u26, or in JDK7?
100k threads isn't going to beat 100k NIO connections
That may be true but I wouldn't assume it with certainty without evidence. Threading has been improving, too. Tyma's 2008 presentation mentioned a JVM limit at the time of 16000 threads, but that may have changed as well.
A single-threaded NIO implementation, using only one core, could very plausibly lose to a 100K-threaded implementation on a multicore machine. And once you split your NIO over a few threads to use cores effectively, you have most of the same concurrency worries as with connection-per-thread approaches.
>Terrible advice. Use NIO and do it properly. Creating 1 thread per connection is for unskilled programmers.
Depends on the application. If the connection is handling real-time financial data(for example), there might be enough heavy lifting on the data stream to warrant one thread per connection. A good programmer wouldn't have too many of these threads running on a given machine, however, because one thread can easily consume a full CPU core, and still need more.
For connections that aren't doing much work, it's definitely better to aggregate connections onto a much smaller number of threads to avoid context switches.
>In my experience once you get up to about 1k threads in Java you'll be wasting most of the CPU up switching between them.
I agree. Thread context switching is extremely expensive. I've seen 10 thread applications consume full CPU time with context switches alone. There are a lot of tricks that can be done to mitigate it, like binding threads to certain CPU cores, making sure that threads that are communicating with each other are on the same CPU to avoid L2/L3 cache thrashing, etc. It's a lot of work, and requires a lot of knowledge about the code and system, but the results are staggeringly good. We've seen performance boosts on the order of 25% with single threaded applications, and boosts around 30-40% for multithreaded apps. Almost all of this is done outside the application code, however.
I can't imagine any reason why you'd have 1k threads running on the same box in one process, to be honest. 1k+ threads seems to indicate a lot of threads doing very little work. If they're doing basically nothing, consuming less than 1% of the CPU each, it's better to try to consolidate the work they're doing into a much smaller number of threads. It's a lot of work, but will get pretty considerable performance gains.
> Terrible advice. Use NIO and do it properly. Creating 1 thread per connection is for unskilled programmers.
Sometimes the simple solutions are the best. Developing a complex and clever solution may your feel more skilled. Using a dispacther pattern often doesn't provides a real advantage and or the advantage it provides isn't needed.
I would agree with the suggestion that if you are going to use a dispatcher pattern with NIO, use a library like netty which does this for you. (In any case you don't write it youself)
I have and if you use blocking IO or NIO its like twenty lines, code just about any one can read and maintain. Why make it more complicated than it needs to be?
I've been working very hard to make Java use less memory using techniques like these and others more. The problem is, if you really want to specify exactly how non-trivial structured objects are packed into memory, you end up writing your own garbage collector and memory allocator, which is a rather large project all by itself. Also, I have found that cutting down on memory usage like that hurts performance a lot.
At the end of the day, going down that road makes you less productive than working in C++ and the result will still be subpar compared to a C or C++ solution. So I'm back with C++ right now even though I'm painfully aware of its weaknesses.
If you are coming back to C++ after a stint with Java, I would highly recommend D. I think it hits the sweetspot between C++ and Java. The not so nice thing about D is that its libraries are not as pervasive and mature. The Phobos and Tango split is currently being addressed. I think all of Tango's functionality will be incorporated into Phobos. With D you get a simpler C++ with optional automatic memory management, and some functional idioms and nice support for parallelism.
@down-voter: If you could explain what you found down-vote worthy that would be nice.
D looks interesting, but I'm a little concerned that the libraries may not be robust or comprehensive enough yet. It's an uphill battle for any new language.
I would say I am less concerned about how memory is packed, than how much is discarded. This is because memory is relatively cheap these days. A machine with 24 GB of memory costs around £1K.
However you make a good point that C++ gives you better control over how memory is arranged and you have to use hacks like the Unsafe class in Java (which is not as friendly)
The title is a little misleading, there's only a few techniques and it seems that one of the more interesting ones ("Unsafe" class) is only available in some JVMs, one doesn't help with speed in many cases (compressed strings) and a couple of the others boil down to "use DirectByteBuffers".
For servers, memory is usually the long tent pole. For me it boils down to:
* Use less memory.
* Don't use a lot of threads.
* Use fewer objects.
Worst-case GC is a big issue. For just a 2 GB heap you are looking at up to 7 second GC pauses with the Java 6 collector. You can use concurrent GC but it adds significant CPU and memory overhead. Direct buffers don't live in the heap so they don't contribute to GC time.
Threads in Java are extremely heavyweight in memory, if not in CPU. A rule of thumb is about 1 MB per thread. You can make it better by decreasing the stack size and/or using 32-bit addresses.
If you create low enough garbage and your eden size is large enough, you can avoid having any minor collections at all. I have a full GC which is triggered once per night as maintainence.
This is one of the greatest efficiency advantages of Go over Java, for the same task, Go code uses only a tiny fraction of the memory of what Java uses.
Most of these tips seem to be variations of the "flyweight" design pattern (http://en.wikipedia.org/wiki/Flyweight_pattern), to avoid around having many objects for the GC to keep checking, as well as all the other overhead for each individual object.
The answer to why would you still use Java is here
"Java makes high level programming simpler and easier at the cost of making low level programming much harder. Fortunately most applications follow the rule of thumb that you spend 90% of your time in 10% of the code. This implies you are better off 90% of the time, worse off 10% of the time. ;)"
Why would you develop mroe than a small portion of your application in a low level language unless every part of your application is performance critcal?
That's a good link. The n-body and spectral-norm benchmarks were fastest in Fortran. The fannkuch benchmark was fastest in Ada. The Mandlebrot benchmark was fastest in ATS.
In each case, the language has pointers, however they were not used in the core part of the benchmark. i.e. they didn't get their speed advantage from pointers.
Haskell was the fastest for the thread-ring benchmark, it doesn't have pointers.
If Java had pointers as it does in the sun.misc.Unsafe utility class some code would be faster, but I don't believe this explains why Java is not the fastest in each benchmark. ;)
Object orientated programming is used in a significant portion of all programs, it would be interesting how C/C++ performs in such benchmarks. ;)
Those measurements for programs forced to use just one core on a quad-core machine, you may also be interested in the measurements made when the programs are allowed to use all the cores -
>>The n-body and spectral-norm benchmarks were fastest in Fortran.<<
Maybe we should wonder if the Intel C++ compiler would produce programs comparable to those from the Intel Fortran compiler ;-)
>>Object orientated programming is used in a significant portion of all programs, it would be interesting how C/C++ performs in such benchmarks.<<
Maybe you're suggesting that OO C++ programs wouldn't perform as well as Java programs? But doesn't that highlight the advantage of being able to write programs in different styles?
I agree, I am suggesting that "no pointer = no speed" is not demonstrated by the benchmarks provided. The developers who wrote the benchmarks which were faster than the C/C++ ones didn't appear to agree. i.e. when they has the optional to use them, they didn't and were still faster.
for most of his points you'd have to do the same things in scala, right? How do you do arbitrary memory access in scala? How do you use space efficient strings in scala? etc, etc.
While the parent is generally wrong, it has been shown that Scala programs have the tendency to generate slightly (5-10%) less garbage. This is especially interesting on Dalvik.
JNI has huge latency. Write code in Java to get "C-like" performance.
"One area Java can be slower is array access. This is because it implicitly does bounds checking."
Which gets optimized away by hotspot.
"However this doesn't mean you can't pre-allocate your objects, use Direct ByteBuffers and Object recycling techniques to minimise your object creation."
Please don't write your own object pools, this isn't 1995.