Hacker News new | past | comments | ask | show | jobs | submit login
Referencing Substrings Faster, without Leaking Memory (twistedoakstudios.com)
76 points by Strilanc on Dec 3, 2013 | hide | past | favorite | 30 comments



The original behavior, where a substring pins the full original string, is also how Go's substring slicing works. I much prefer this behavior, as it lets you profile and add a copy yourself if you find yourself pinning large strings. The Java change to do a copy for every substring seems like it harms the general case to help out poorly reasoned code.

While the tracking of intervals with a tree is certainly impressive, it is a lot of complexity to again (in my opinion) support poorly written code. If you know you are going to only use 10 bytes of a 1 MB incoming HTTP request, make a copy.


It previously wasn't clear that is how substring operated. In fact the Javadoc didn't even mention it. "new string" is rather misleading here.

http://docs.oracle.com/javase/6/docs/api/java/lang/String.ht...

Not documenting it in the Javadoc kind of makes sense since it is an implementation specific detail whether substring returns a view onto the original char array or not. However, it is certainly surprising that what is apparently a 2 character string could be taking up gigabytes of ram.

It's the kind of behaviour that it would be useful to express and be able to choose via the API. e.g. make people write .substring(n).view() or .substring(n).copy() or similar.


"new string" is rather misleading here.

But the string _is_ new; there was no object that was Object.Equals to it before the call. That new string just never forgets where it came from.

e.g. make people write .substring(n).view() or .substring(n).copy() or similar.

That's what people have been doing for years: .substring(n) versus new String(substring(n)) (and, as a third option: .substring(n).intern())

I find it weird that this change is made because, as a side-effect, they almost have to start optimizing "new String(s)" calls. With this change, the only thing that can be useful for is for creating a String that is the same as another String, but not equal to it. I think that aspect of that call is rarely, if ever, used. If one can proof that it is never used, the next step on this path would be to make that call into a no-op. If one cannot proof it, it still would likely be a good speed up, but it also would be extremely risky.


I think the best answer is to take the size of the parent and child strings into account. If the parent is 1M and the child is just a few bytes then make a copy so the parent can be freed up. Otherwise use the old behavior.


Yeah, I don't understand why the JDK dev did not take this approach.

Even better, they can add a new API that allows developers specify the implementation that fits better. Something like String.substr(..., boolean allocateSpace) that defaults to the original behavior of substr.


That'd be even worse. Now I have yet another variable to deal with when tracking down performance issues in my code. One that would be non-obvious to anyone not terribly familiar with how things are working behind the scenes.

Better would be to introduce a methods that explicitly exposes the type of substring method being used.


But you wouldn't have to track down performance issues in your code if it never was slow ;)


While you are not wrong about this all being to accommodate for badly written code, I don't think that a programmer should specifically have to adapt there code for something as simple as substring() to accommodate for a flaw in the language implementation(memory leak).


It's not a flaw - nowhere is it specified that the substring doesn't retain a reference to the original string; nor in any case does the garbage collecter make very many guarantees in the first place.

The new behaviour is fine. But it's not a bugfix, and it causes nasty problems - probably causes more problems than it fixes simply because it's a change in behavior.

In order to write efficient string-processing algorithms in java, since the docs habitually don't specify efficiency, you need to make assumptions about which operations are how fast. Making an operation this much slower - and it's really a huge difference - is pretty risky.

Also, note that memory use in the new model is also likely to be higher, and can even be catastrophically higher (e.g. a recursive descent parser might retain references to all suffixes it builds during descent, causing quadratic memory usage).

I'm sure there are good reasons for the change, but the the way it was released is really shoddy - hiding it away in a minor (ideally just bugfix+security) release without a good warning well in advance means they didn't consider this to be breaking change, and that's just being ridiculously naive for a project as mature as the JVM.


Why not just do the copying conditionally?

If it's at least half the size of the parent, reference the parent. Otherwise just make a copy...


Part of the reason for the change was likely to allow better optimizations of the strings. The new strings are basically equivalent to arrays and might be optimized as such; even without that they certainly don't need an offset nor length field anymore since the underlying array can provide that.

If you conditionally used one implementation or the other you'd lose the optimization opportunity. You could alternatively make string non-final; but then your method resolution is more expensive and that would really be a pretty big change in any case.

I think java made the right choice here since its trivial to wrap string in another structure whenever you need a fast substring, it's just that that custom string won't be passable to lots of other API's which makes it a potentially nasty surprise for existing codebases.


Couldn't the substring copying be deferred until the parent string is ready to be GC'd? In the common case of a short-lived substring, no string copies are necessary. Only if the substring outlives the parent string would a copy be necessary.


In that case, it would be difficult for the programmer to reason about the memory usage of their use of substring().

This change lets programmers not worry about the implementation of substring() and memory leaks in return for linear runtime complexity.


How would it be difficult to reason? You know you're never off by more than a factor of 2 (or whatever factor is used), just like with an ArrayList.


This change is going to destroy the performance of a lot of apps that have not been modified in years.... There are lots of programs that might substring every word in a large block of text...

Letting a programmer use too much memory and have to learn why, and when to copy is much better than drastically changing the runtime characteristics of many existing programs. Not copying is way more efficient, you just need to be in the know-

There are even solution that can do both like conditional copying or keeping track of the parent and doing a conditional copy if needed at GC time.


I doubt it'll be that bad. This new behavior is how .NET has done it all along, and out of all the mountains of code I've written that loop over the words in a block of text I've found exactly one situation where benchmarks showed it was worth worrying about. Not for lack of looking, either.

I'm inclined to think that Java's old way of doing it is an optimization that largely became obsolete when the HotSpot VM came out over a decade ago. The nature of generational garbage collection means that the way this method ends up using memory in practice is rather different from what one might expect at first glance. Combine that with modern CPU architecture and it might even end up being faster because of better cache-friendliness.


Substringing every word in a block of text is actually the ideal situation for the change here...


I think they're referring to what Java's doing:

>Last month, Java changed how it computes substrings.

... [tl;dr: previously always kept whole string, substring just stored indexes]

Java now uses a different method: always copying the substring.


The author of the change in Java discusses it in more detail here (from the article): http://www.reddit.com/r/programming/comments/1qw73v/til_orac...


Alternative: on substring, reference the original string through a SoftReference registered with a global ReferenceQueue. Have one global system thread poll that reference queue and to convert substrings to hard copies (requiring a reverse map from the parent to its children via WeakReference) when the parent is about to be collected.

Problem is this is still reasonably complicated (though much less so than the article). A lot of memory churn and possible OOMEs will occur at the time of GC instead of the time of substring allocation. That makes the overall system less predictable and less debuggable, as will any deferred substring reallocation.


JVM has a garbage collector, so if we are talking about modifying JVM itself, we wouldn't need anything that works online: you could determine which parts are referenced in a simpler way during each GC.


The JVM garbage collector is not capable of tracking intervals nested inside other objects like arrays. At least, not that I'm aware of.


this is all very interesting, especially to see the end result, but as a rule i don't leak memory or use garbage collection.

again i am left feeling that garbage collection causes more problems than it solves. certainly every instance of debugging a memory 'leak' in that context is much harder than it needs to be (don't use gc or refcounting) - and also a slap in the face considering this is the problem its meant to solve anyway.

if you care this much about performance your biggest bottleneck is the gc - not just interms of performance cost, but the amount of your time you will waste solving a problem like this instead of optimising your code.

this is a problem i don't need to solve and pretty much everything i write is high performance code... whilst i know this isn't targetted optimisation, but a fun experiment, i would worry that people will read some kind of best practice here where it really isn't...


It's not the gc that causes this problem. The problem is: you allocated a block of, say, 100KB. Then you needed two 10KB subsets of that block. While you're using those, you may stop needing the rest of the block. How do you create those subsets? Do you just point to the original memory? Then you can't free() the 80KB you don't need anymore. Do you copy those 20KB? Then you're potentially using more memory than necessary.

The problem holds even if you're not using a gc or refcounting at all.


This proposed new method seems way over complicated to me.


I took it to be more of a thought exercise than a serious change proposal.


the author appears to have rediscovered nested containment lists http://www.ncbi.nlm.nih.gov/m/pubmed/17234640/


I haven't read the entire paper, but just based on the abstract it sounds like nested containment lists are for querying what's within in a range.

In other words, they're just like interval trees in that they tell you where things are. Like I mentioned in the post, we want to solve the opposite problem: where things aren't.


Which only proves him right when he says

> I also dislike trying to find existing implementations of trees because, for some odd reason, trees tend to be named by abbreviations that only make sense in hindsight


I'd consider "nested containment list" a pretty good name, actually. Of course then they abbreviate it to NCList...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: