JEP draft: 64 bit object headers

ardel95 · on Oct 11, 2022

I'm glad folks are starting to look at memory efficiency of Java programs - there's just so much improvements to be had there. For a fun exercise, take a look at memory footprint of storing a million entries in a HashMap<Integer, Integer>. Between "object header" overhead, "everything is a pointer to a heap allocated object" overhead, "GC needs enough headroom to remain effective" overhead, the actual memory consumption can be as much as 10x of a comparable C, C++, or Rust implementation.

moring · on Oct 11, 2022

Others have mentioned IntMap, and in a more general way, I'd recommend treating those standard classes as a general-purpose implementation that should be replaced be specialized implementations -- to improve performance, memory footprint etc. -- WHERE NEEDED (avoid premature optimization!), and forget about it in the remaining 99% of cases where a HashMap contains less than 100 entries.

It's true that C/C++/Rust don't have this problem because they are optimized for this. OTOH if the problem you are solving is more suitable to garbage collection than manual or static-analysis based memory management then Java's high-performance GC will shine.

fulafel · on Oct 11, 2022

Object representation/memory layout optimization is an area that has been really missed in the PLT field, in no small part because of restrictions imposed by specified PL semantics. There are huge amounts of slack left on the table for nearly all languages as currently implemented. Both vs compactness of representation and placement in memory for efficiency of access. The bag of tricks that low-level programmers use to painstakingly optimize memory representations and placement should be available from compilers as automatic transforms, incorporated into profile feedback based optimizer passes, etc.

These are not just memory usage gains but also execution speed gains since memory access is frequently the main bottleneck and cpu cache effectiveness depends on how many of your objects they can hold.

Java is actually one of the few languages that does at least something in this field already before this JEP: Compressed object references (https://www.baeldung.com/jvm-compressed-oops).

whizzter · on Oct 11, 2022

It's a bit of a chicken-and-egg issue (compare to the Python GIL).

Doing these kinds of analysis and optimizations takes time (and might impact language design), most "popular" languages before Rust got traction seldomly took memory models into accounts (still in the mid/late 90s lookup tables was still faster than doing certain kinds of "primitive" calculations that CPUs/GPUs eats for breakfast now). Bakers's Linear Lisp (a kind of precursor to Rust) and some StandardML that had region allocations, but that research was still mostly about tradeoffs in memory liveness.

The "data-oriented-design" movement in games has led to ideas of allowing structure-of-arrays instead of arrays-of-structures to be used in identical ways in a couple of languages, sadly many people involved in these spheres are allergic to GC's so it's still a bit of a narrow design field whereas I'm kinda curious about where layered designs could be taken.

fulafel · on Oct 14, 2022

The Mike Acton & co C++ gamedev targeted data oriented design campaign is good and deserves credit but data representation/layout emphasis was old hat in other domains perf work before that. And since the "memory wall" has been talked about a lot much earlier, the fact that memory/cache is the bottleneck was inevitable. Thus the need to start programming around that absent hardware inventions to solve it.

Game programmers hit this on consoles due to facing platforms with relatively slow/tricky memory a bit earlier than PC devs (Cell processor[1]). But HPC programmers had by then been worryign and programming around the problem for longer, since the extinction of SSI vector supercomputers and transition to message passing clusters. (Also the data oriented design movement is only partly about this, and much about other C++ / programming / measurement etc ideas)

I was searching for older memory wall discussions and found at least this from 1994: https://www.eecs.ucf.edu/~lboloni/Teaching/EEL5708_2006/slid...

[1] in fact if you browse some of Mike Actons slides you can see his username on slideshare is "cellperformance" - eg https://www.slideshare.net/cellperformance/data-oriented-des...

TazeTSchnitzel · on Oct 11, 2022

Apple's Objective-C runtime does a lot of very interesting tricks to make more efficient use of memory. Despite being fundamental types in the language, NSString, NSNumber, NSDictionary and so on are only interacted with via dynamically-linked dynamic dispatch, which gives the runtime a lot of freedom to change their representations without affecting the ABI. For example, contents of strings are sometimes stuffed into tagged pointers rather than being heap-allocated.

exabrial · on Oct 11, 2022

I would use the Apache Collections int map and move on though… I see the Map specifically oriented towards objects, not basic types

winrid · on Oct 11, 2022

This is very true and the suck of Java today, but fyi you can improve this exact scenario by using a special IntMap etc, which there are many libraries for (LibGDX has some, and Caffeine has off-heap maps).

JanecekPetr · on Oct 11, 2022

Not Caffeine, that's the caching library. You probably meant one of FastUtil, Eclipse Collections, Koloboke, HPPC(-RT), they all provide primitive specializations of Maps and Collections. For off-heap maps, now that's more interesting! I know about Chronicle Map, MapDB, https://github.com/snazy/ohc and https://github.com/RuedigerMoeller/fast-serialization/wiki/O....

NovaX · on Oct 11, 2022

Another to add to your collection, https://github.com/yahoo/Oak

NovaX · on Oct 11, 2022

Caffeine doesn't offer off-heap maps or primitive collections (many other great libraries do). It does use a data sketch for compact frequency storage and optimizes against the cpu cache line to avoid unnecessary memory accesses. There are many dimensions to being hardware efficient.

winrid · on Oct 11, 2022

oh, you're right. Weird, wonder why I thought it was off heap. No wonder that service has higher memory usage than I was expecting. LOL

winrid · on Oct 14, 2022

coming back to this - amazing what you can accomplish with String.intern() and switching to Shenandoah GC. Memory usage is about 2.2gb now for 800mb of cache entries: http://media-server.fastcomments.com:8881/stats

About 500mb of that is netty SSL session cache...

lenkite · on Oct 11, 2022

Solved by libraries such as Eclipse collections: https://www.baeldung.com/java-eclipse-primitive-collections which has memory-optimized lists, sets, stacks, maps, and bags for all the primitive types

Someone · on Oct 11, 2022

I wouldn’t call that solved, but improved. Programmers will have to do extra thinking and work to get those benefits, code reviews may spend hours discussing whether making the switch is worthwhile.

And that’s not one-time work. In the ideal world, if the size of your data grows (or shrinks), you’ll have to revisit those choices.

Also, as a library author, you can’t know what choice to make.

Yes, you’ll always have that problem, but this change would make ‘just use the provided collections’ a much safer choice.

kjeetgill · on Oct 11, 2022

I disagree. In general once you have a primitive specialized library kicking around in your path, you can pretty much just always use it for anything with primitive keys no questions asked. You don't have to over-optimize by deciding which one each time.

Usually you just pick fastuil or kobloke and then use it every time.

Edit: as comments below corroborate.

rockwotj · on Oct 11, 2022

Also a lot of types like this fall back to boxing when accessed generically, which has its own performance issues

marginalia_nu · on Oct 11, 2022

java.util.collections.* isn't very optimized in that regard.

Both GNU Trove and fastutil offers primitive collections that are still Java, but much better memory optimized.

lichtenberger · on Oct 11, 2022

Fastutil is amazing. Specialized collection types for almost everything.

kwertzzz · on Oct 11, 2022

I would genuinely interested to see how much bytes a HashMap<Integer, Integer> would have in Java (or other languages) for a million entries. If I am not mistaken in julia, this would be 151 MB in julia or 88% more than just storing the keys and values.

     julia> k = Int32.(1:1000_0000);  v = rand(Int32,1000_0000); 
     julia> dict = Dict(zip(k,v));
     julia> Base.summarysize(dict) / (2 * 4 * 1000_0000)
     1.8874391
     julia> Base.summarysize(dict)
     150995128

sundarurfriend · on Oct 11, 2022

1000_0000 is 10 million, you want 1000_000. And k is better initialized as `Int32(1):Int32(1000_000)`, which is significantly faster (though it doesn't matter for the measurement here).

Checking with a million values, on my Julia v1.8.0, the dict takes up 18 MB, or about 130% more than the plain values (or an array of Pairs) would.

kwertzzz · on Oct 11, 2022

Thank you for correcting this! (I get the same numbers as you for a million elements).

adgjlsfhk1 · on Oct 11, 2022

note that the overhead will vary a ton based on the number of elements you use. I believe Julia's dicts currently resize to be 4x current size (to avoid hash collisions), so you should see anywhere from 30% extra to 300% extra depending on how many elements you have. There has been some effort recently to move to a more SwissDict-like approach which should reduce the memory overhead.

chrisseaton · on Oct 11, 2022

> HashMap<Integer, Integer>

I feel like that's a crazily-specific worst-case. Could have made it HashMap<Integer, Optional<Integer>> to really push the boat out.

grishka · on Oct 11, 2022

There's a reason Android offers SparseArray as a more efficient way of mapping integers to objects. Though Android is a whole different beast with its ART and register-based bytecode.

origin_path · on Oct 11, 2022

ART and Dalvik bytecode doesn't change the Java memory model in any way, though.

Thiez · on Oct 11, 2022

Just using C# and the nice generic `Dictionary<int, int>` will fix this. The memory overhead is really Java's fault for sticking with type erasure.

kjeetgill · on Oct 11, 2022

I don't want to rehash this here, but this is actually a less obvious choice than it can seem like at first blush! Erasure and primitive specialization have a very deep set of considerations and tradeoffs that touch a ton of language features. There's a ton written about it, value-types generally, monomorphization, etc.

Thiez · on Oct 11, 2022

What a nice and condescending reply. Do you believe that anyone who knows about these considerations and tradeoffs would automatically conclude that type erasure is superior, so someone who disagrees with it is obviously uninformed?

Since you didn't actually give any actual argument apart from "I disagree and I am smart" I don't think we have anything concrete to argue about, so let's stick to my original point: would you agree that in C# it is much easier to do "the right thing" when putting primitives in a datastructure such as a list, map, or set, because C# has value types and does not use type erasure for its containers, thereby preventing both a lot of overhead and the need for specialized container types for various (combinations of) primitives?

kjeetgill · on Oct 12, 2022

Hey, sorry if it came across as condescending, I really didn't mean it to be. All I wanted to say was I don't think the choice is obvious and leave a few googleable buzzwords for anyone interested to go read up more on it.

I don't think I took a stance pro/anti erasure other than to say I think it's way more complicated a topic than it might seem (maybe not to you, but likly to others!) and it's worth reading up on. AND I didn't want to dig up old HN comments/posts to confirm what I'd read about it, so I was necessarily vague.

Edit 1: For instance, I vaguely recall that Scala head trouble porting to C# because of erasure related stuff. But I'm not genius enough to remember the details so I didn't get into it.

Edit 2: So I dug one up: https://cr.openjdk.java.net/~briangoetz/valhalla/erasure.htm... And posted here https://news.ycombinator.com/item?id=33171832

Edit 3: Other good conversations on it: https://news.ycombinator.com/item?id=18679684 https://news.ycombinator.com/item?id=13051594

Again, my only stance is that it's a surprisingly deep topic — Not making any claims!

adgjlsfhk1 · on Oct 11, 2022

And this isn't even getting into special cases where it can be worse. There are lots of places where a better language can use an 8 bit integer, while if you want to use anything vaguely generic in Java, you have an 8 byte header (plus another 8 bytes for a pointer to it).

excsn · on Oct 11, 2022

Ah, Roman is the guy from RedHat who worked on Shenandoah GC. He has an excellent background on this and will be really glad to see when this is stable.

cowsandmilk · on Oct 11, 2022

Appears the owner of this JEP just started as a Principal Engineer at Amazon. Curious if that's the source of some of the research in the motivation and what the savings would be just for AWS and Retail operations.

dastbe · on Oct 11, 2022

if you look at the associated github he’s been working on this for a year, probably longer.

kichik · on Oct 11, 2022

How come a whole pointer is required for class pointer? Can't it be a small offset into a class table? Do many Java applications often require millions of classes?

dexwiz · on Oct 11, 2022

Definitely have seen a few enterprise stacks that reach this size. It’s not common, but it happens.

paulirwin · on Oct 11, 2022

I think I'm understanding correctly, but please correct me if I'm wrong... the class pointer is a pointer to the type (class) the object implements. So that would mean to exceed even an unsigned 32-bit pointer size, there would need to be more than 4 billion types in the program to exceed that, not objects. I can't imagine there are (m)any programs that hit that. And it definitely doesn't need to be a native 64-bit pointer. As pointed out in the proposal: "It may be possible to compress class pointers to less than 32 bits, at the expense of smaller addressable class space."

maxfurman · on Oct 11, 2022

As other comments have mentioned, in the JVM every lambda creates a new type. So there are way, way more "types" created in Java than in a comparable program in another language.

symaxian · on Oct 11, 2022

It does seem like some additional compression could be performed, if they wanted to keep it a pointer to avoid a table lookup I wonder if they could do something like ensure that all class definitions must exist in the certain contiguous chunk of memory that has a fixed size of 1-4 gigabytes which would let them shrink the pointer size down.

layer8 · on Oct 11, 2022

Application servers can run an arbitrary number of applications in parallel within the same JVM. Also, every lambda expression currently creates a new class. The interface-proxy mechanism also creates new classes at runtime. Classes are not expected to be a limited resource in JVM-land.

exabrial · on Oct 11, 2022

It’s uncommon for millions in an average program unless you’ve got a special use case. Typical usage would be a couple thousand.

remram · on Oct 11, 2022

You mean add a level of indirection to every class access?

Also there is already compression for that pointer, making it 32 instead of 64 bits.

saagarjha · on Oct 11, 2022

Turning a pointer loading into an addition and then a pointer load is not that much overhead.

remram · on Oct 11, 2022

That's how the current compression works, I guess I don't understand what you propose.

saagarjha · on Oct 11, 2022

I guess I am confused as to what the extra indirection is.

chrisseaton · on Oct 11, 2022

The current compression scheme doesn’t involve reading any memory. Indirection through a lookup table would.

But who cares? 99% of the time you just need class identity, not values from it.

saagarjha · on Oct 11, 2022

If you need identity you can just compare indices instead of pointers, no?

chrisseaton · on Oct 11, 2022

What's what I just said?

> 99% of the time you just need class identity, not values from it.

So no need to deference the class pointer, or the table index.

saagarjha · on Oct 11, 2022

Ok taking a step back I just realized you aren't 'remram :P But anyways my point was that in the old scheme if you wanted to know where the class was you would do some arithmetic on the compressed pointer to get a full pointer and now you can do a comparison, or if you want to read something do a dereference. Assuming the new layout is an array of classes you have a similar scheme where you have an index and you can compare those directly, and if you want data in the class you deref.

twic · on Oct 11, 2022

Also, i believe that 'classes' aren't fixed size, because they contain the class's vtable [1]. As such, you couldn't keep them in an indexable array anyway.

[1] in HotSpot they are instances of InstanceKlass: https://github.com/openjdk/jdk/blob/master/src/hotspot/share...

chrisseaton · on Oct 11, 2022

I think they mean keep references to them in an indexable array.

remram · on Oct 11, 2022

That is the extra indirection I was talking about.

chrisseaton · on Oct 11, 2022

> in the old scheme if you wanted to know where the class was you would do some arithmetic on the compressed pointer to get a full pointer and now you can do a comparison

Well also you didn't actually need to decompress the pointer first - compressed pointers are unique as well of course.

saagarjha · on Oct 11, 2022

Ah, yes, true.

twic · on Oct 11, 2022

> When the object is locked by a stack-lock or monitor, or if the object is forwarded by the GC (all of which is indicated by the lowest two bits of the header), the upper 62 bits of the first word of the header are interpreted as a pointer to the stack-lock, object monitor or the forwarded location of the object.

The term "stack-lock" is new to me. What's that? Is that just when the lock record is stored on the stack? In which case, what is being locked by a monitor, because i would expect the lock record to always be on the stack. Is this about thin vs inflated lock states?

maxfurman · on Oct 11, 2022

Why are there so many unused bits in the header in the first place? 12% heap size reduction across the board sounds really nice but I feel like some historical context on the original design is missing.

nayuki · on Oct 11, 2022

Probably for alignment. You want the class pointer (whether 32- or 64-bit) to be easy to dereference with little or no preprocessing.

It looks like the HotSpot JVM implementers decided to use 5 bits for lock and GC, and 31 bits for identity hash code (not sure why they didn't use 32). That adds up to 36 bits. You'd want to round this up to a multiple of 32 bits so that both the beginning of the object and the class pointer are aligned to 32 or 64 bits.

origin_path · on Oct 11, 2022

If you look at the tricks needed to shrink it to 64 bits then you'll see why this is hard. Consider that this object header has to contain:

- GC state bits.

- The class metadata, i.e. pointer to a Class<> object.

- Lock data, including possibly a pointer to a lock.

- Possibly, an identity hash code.

- Possibly, a GC forwarding pointer.

Several of these are pointers! If you're on a 64 bit machine, just one* of them implemented naively if immediately 64 bits. In other words Java object headers provide a lot of stuff.

To pack the whole thing down to just 64 bits requires a lot of clever implementation tricks, exploiting what exact combinations can appear, etc. Also note that this overhead isn't Java specific. In C++ you have malloc headers and then the C++ vtable pointer, in many cases. So just running "new SomeObj()" in C++ will immediately blow you past the heap overhead of Java after this JEP is merged, and if you need to add a lock and/or reference count to catch up with the Java feature set then you're at multiples of it.

GrumpySloth · on Oct 11, 2022

1. C++ doesn’t use vtables for classes which do not have virtual methods (in practice most of them). And vtables are shared between objects of the same type.

2. “Malloc headers”? Malloc usually stores metadata in freed data.

3. C++ developers don’t call “new” for every single object, the way Java devs do. Idiomatic code defaults to allocating fixed-size objects inline (on stack or in a pre-existing bigger allocation), not in a separate allocation.

4. In C++ there are other ways to allocate dynamic memory than “new”. Arena allocators, directly calling mmap/VirtualAlloc/whatever makes sense in a given case.

origin_path · on Oct 11, 2022

Yes, I know C++, I didn't say that's the only way to allocate objects in C++, just that if you do write a heap allocated object that has virtual methods (a pretty common thing to want to do), then the metadata for that will be larger than 64 bits. Your malloc storing metadata non-adjacent to the allocation doesn't change the fact that there is some, and it has to be stored.

twic · on Oct 11, 2022

The header traditionally contains:

1. a pointer to the class

2. the object's lock

3. some GC bits

4. the identity hash code

In a naive implementation, the class pointer is a machine word, the lock is multiple machine words, the GC bits are a few bits, which get rounded up to a word for alignment, and the hash is 32 bits.

Object headers used to be even bigger than 128 bits!

There have been improvements: compress the class pointer, use a lazily-inflated lock which takes up only one word in the header, make the identity hash smaller, and make the lock and the identity hash share space (when the lock is deflated, the identity hash lives in the shared slot; when the lock is inflated, the slot has a pointer to a lock structure which also contains the identity hash).

The change in this JEP is effectively to extend the inflation-dependent space sharing scheme to include the class pointer too. When the lock is deflated, the header contains the class pointer and the identity hash; when it is inflated, it is a pointer to the lock structure, which contains the class pointer and identity hash.

I suppose this means that virtual method calls through a locked object will involve an additional indirection - but this should be rather easy for a compiler to optimise, because it can safely hoist the loading of the class pointer.

chrisseaton · on Oct 11, 2022

What would you have put in them instead? If you can't think of anything... then probably nobody could think of anything either.

akokanka · on Oct 11, 2022

Wonder what the original author intended unused space for.. Also wonder if moving both hash and bit set will change performance characteristics.

thayne · on Oct 11, 2022

> Wonder what the original author intended unused space for

My guess is it is padding so that the class pointer is on an alignment boundary.

exabrial · on Oct 11, 2022

Is this part of the tck or jvm spec? I’m surprised such an implementation detail isn’t left up the vendors if so.

But in any case this will be yet another awesome perf improvement bringing jvm closer to cpp speed.

kjeetgill · on Oct 11, 2022

These enhancement proposals are often OpenJDK/hotspot specific. I do believe this left to vendors.

pjmlp · on Oct 11, 2022

It is, other JVM implementations tend to also have their own extensions, for example J9 had AOT for years, long before OpenJDK started to play with it.

It also had support for a special kind of value types (packed objects) with are seen as intrisics by its JIT.

https://www.ibm.com/docs/en/sdk-java-technology/7.1?topic=po...

I guess this was eventually dropped due to Valhala efforts.