Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are reasons why implementing a system in Java might be a questionable decision, but unless that system involves extremely intensive number-crunching or has hard real-time requirements, performance probably isn't one of them.


A messaging system should really be very attentive to performance. If it itself is a bottleneck, it's not possible to create anything that requires performance on top of it. And by performance I mean time constraints as well.


Read their paper before making a claim that performance isn't important in Kafka:

http://research.microsoft.com/en-us/um/people/srikanth/netdb...


You're right, in principle. However, rewriting it in C++ instead of Java would maybe get you a factor of 2 performance improvement, and that's if it was completely CPU-bound, which isn't a reasonable assumption for a tool designed to manage large disk-resident datasets.

By "hard real-time" I was referring to latency, rather than throughput. Achieving very low latencies is difficult in Java because you don't know exactly when the GC will kick in, but nevertheless it's possible to get very high throughput.


But does Java allow easy management of memory, like .NET? From building a moderately high-performance packet-processing system in F#, apart from a few specific encoding parts where the machine code was subpar, the biggest hit to CPU seemed to be the GC. Removing a single allocation from a path provided a measurable improvement. Most of the gain was being able to allocate memory directly (1GB managed heap on top of 16GB+ unmanaged).

With Java, you don't have the unmanaged or struct support, so doesn't that really add up? If you go "native", isn't there significant overhead since you can't have pointers in Java (right - the bytecode doesn't support it?)?

People pull it off, but it seems that GC overhead would be a killer.


Kafka uses Linux's zero-copy transfer to move bytes between the network and the disk without going through user-space, let alone the JVM.

There's still GC from objects allocated by Kafka in the JVM, but the actual message data doesn't even go through the JVM.


The standard way to do this is to go off heap using the Unsafe packages. Essentially you start handling your own memory, with all that entails.


Once you are saturating disk and network, who cares about efficient cpu usage?


As was discussed in this thread, garbage collection can be a major problem.


Can be, but tends not to be for Kafka. We run using a CMS collector and never have pauses that are enough to care about.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: