Hacker News new | past | comments | ask | show | jobs | submit login
Please grow your buffers exponentially (blog.mozilla.org)
231 points by AndrewDucker on Nov 4, 2014 | hide | past | favorite | 156 comments



From the comments, the linked SO question is probably better a treat of the title than the article: http://stackoverflow.com/questions/1100311/what-is-the-ideal...

I would have guessed by now it is common knowledge that you should grow your buffers exponentially. The only question being by how much?

For general usage it seems 1.5 is usually used but another factor could be better if you have different constraints for reallocating and memory fragmentation or value them differently.


Agreed. I thought everyone knew by now to do this. I'm somewhat surprised Mozilla decided to write a blog post on it. I guess it's always news to someone.


Well it's not Mozilla writing the blog post, it's a guy who works on memory-related issued for Mozilla blogging.


I think that we need smart realloc. Something that takes a buffer's history to determine the proper increase ratio.


I prefer an expressive API to a smart one. Which is to say that I wouldn’t mind specifying the expected “history” up front when allocating.

I’ve been thinking of writing a library for “roles”, which would let you describe how a value is expected to change over time, and thereby encode some of the intent about what a variable or bit of code is for, with the happy side effect of helping the runtime system select fast paths.

For example, the “accumulator” role would be for values that are expected to have a property that grows monotonically, like the size of a vector or the value of a counter. For a vector, you might additionally encode whether its growth should be linear or exponential, what its minimum and maximum sizes should be, and various growth factors.

Violating a role would not be an error, though it might be slow, and in debug mode you might get a diagnostic message to help you adjust the roles in your program.

Some role-like features are already present in languages, libraries, and hardware, such as “restrict” to encode the expectation of non-aliased buffers, “__builtin_expect” to encode likelihood, and “__builtin_prefetch” to encode the intent to access a value soon.


Sounds like a type system to me.


Ideally you just preallocate what you're going to need.

There are circumstances where you genuinely need a growable buffer, but often you can determine either the exact size or a reasonable maximum.


Yep, I think this is the correct answer for the general case. With structures close to or larger than a page and on 64 bit systems you can go nuts with overallocation because the virtual address space is much larger than the amount of physical memory.


It's quite possible to go overboard even on a 64-bit system.

For example, I've seen people allocating 4GB buffers for everything, just because that made it impossible to have buffer overflows as long as they indexed with 32-bit integers. But on x86-64 you can only fit 65k such buffers in the address space, and it's not that hard to exhaust that.


I guess my definition of going nuts was a bit more conservative. Both x86-64 and ARM's AArch64 currently only use 48 bits for virtual addressing.

Still, even if you design for computers with 256GB (2^38) physical memory, you still have a virtual address space that's 2^9 times larger (assuming addresses with the MSB bit set are reserved for the kernel's memory space). This is opposed to 32 bit systems where the physical memory space is close to or larger than the virtual address space. E.g. high end smartphones sold in the last few years have 2GB RAM and only 3GB of virtual address space.


Probably easy enough to implement, but it'll take someone smarter than me to explain why it probably won't be an improvement.


Heap allocators already are ridiculously smart. But there's just too many divergent use cases for allocators.

I wrote a simple string heap allocator (grab a 16MB block and hand out 256-byte chunks) and the resulting code was 85x* faster than jemalloc ( http://pastebin.com/dAqa4dbN ) [* I know the way I am benchmarking is shit and probably way off from real world usage, it's hard to do benchmarking ideally though.]

But of course, mine wasn't thread safe, couldn't handle larger blocks (fallthrough to malloc), sucked at fragmentation, and couldn't detect double-frees.

Similarly, I found a massive speed boost for strings by using my own hand-rolled memcpy [[ while(l--) * t++ = * s++; ]] ... undoubtedly because all the trickery in setting up SSE and handling lengths not evenly divisible by the register sizes was more expensive than just dumbly transferring the data, when most strings are very small in size. Though I am sure memcpy would destroy me if I wanted to copy 2GB of data in one go.

All of the extra logic to be smart will help you when it actually works, but will make your worst case substantially more painful.

Let's say you had to create 100 objects. Would you rather create all 100 objects in 10ns each, or 95 objects in 1ns each, and the remaining 5 objects in 1000ns each? What if you were writing a particularly weird app that ended up needing 1000ns for every alloc it did?

One immediate concern I'd have with a smart exponential allocator was when you were memory constrained and your application was a web server or SQL database.


Back when we used 32 bit machines, the exponential container would kill us. Couldnt use all of the memory without running out of ram. When ur at 1gb of space do you really want to double that on insertion? Yeah, yeah demand paging, doesn't work in practice. My instinct is that u want a doubling allocator with a max, some sigmoid that responds to free space and allocation rates.

For FF, they might be better off just making a couple memory pools per page and bump allocating. Throw the pool away on exit and be able to migrate a long running tab into a dynamically allocated pool.


I suspect a large portion of your performance improvement was just from code inlining. Your simple allocator was probably inlined, whereas malloc requires a function call (as the code lives in a shared library).


"but will make your worst case substantially more painful."

Not necessarily (see, for instance, your memcpy SSE example), but it is bound to make some case more painful.


It will absolutely be worse for small buffers. These kinds of optimizations make sense only above certain size/lifespan.


For small allocations, memory allocators often cache power-of-two-sized memory blocks, so a 1.5x growth factor will likely be rounded up to the next bucket size and have unused space (internal fragmentation).


That, or don't make your buffers contiguous unless you absolutely have to.

I did one of those classic "100X" speedups once where the buffer was growing by a set value on every expansion. Could have gotten 10X or better improvement with exponential growth, but did even better by having an interface that didn't guarantee address-based access from one element to another, so all I needed was a tracking structure for multiple smaller buffers.

Don't use an array unless you need an array's cross-element addressing leakage.


It continues to baffle me that most standard libraries don't have collections which work like this. They seem like they'd be a much better general-purpose, worst-case-avoiding choice than either arrays or linked lists.

It's not like they're difficult. I wrote one in Java not long after 1.2 came out!


They don't because it's easy enough to make one using what's available (a linked list or tree of arrays), and then you get to decide all the little things that matter with composite structures like this: the size of each array, whether to resize an existing "segment" or to add a new one, how you'd like to access the data in them, etc.


I thought Firefox was mostly written in C++? Pretty surprised that raw realloc()s are widespread enough to warrant a note like this.

I agree that it's better to know how much space you need up front - even if it turns a one-pass algorithm into a two-pass one.

One of my happiest moments as a programmer came when reading a co-worker's code to render 2d parametric functions smoothly in screen space. It recursively divides down the parameter interval until it's small enough, then draws line segments. The code is written iteratively and manually manages a stack to simulate recursion. The stack is a fixed-size static C array instead of a vector. "Aren't you worried about overflow?", I asked. But he had computed the number of divisions in half to get from a FLT_MAX-length interval down to an FLT_MIN-length interval, and made the array big enough to hold that many steps. Goodbye malloc(), hello one not-really-that-big static array that stays warm in the cache.


>I thought Firefox was mostly written in C++? Pretty surprised that raw realloc()s are widespread enough to warrant a note like this.

It is, but STL usage is very limited, because of all the headaches involved in linking against all the different STL implementations in all the different platforms. mozilla-central has its own containers.


The C++ variant most used in the real world is 'C with (some) classes'.


That, or people stay put in a framework such as Qt. I'd consider myself an experienced Qt developer. C++? Not so much.


I understand, but writing your own dynamic resizing array class is not hard.


> Pretty surprised that raw realloc()s are widespread enough to warrant a note like this.

Firefox is a large, old codebase. After working on it for a while you stop being surprised by almost anything :)


I wonder if an STD vector would have similar performance, if you preallocated the memory on creation but still used standard library functions for code readability?


This approach is still actually suboptimal from a spatial efficiency standpoint, wasting half or a third, depending on your chosen factor, of memory whenever you push that key Nth element on to the back.

The real question is, why do people use completely contiguous memory so much when they really don't need to? For I/O we have nice APIs like readv()/writev() which deal with segmented buffers just fine... and you really don't need humongous contiguous vectors to get all of the benefits out of your CPU cache...

This paper details another approach that has much of the same benefits if you don't need completely contiguous storage, and only wastes O(sqrt(N)) memory at a time, resulting in much smoother allocation while still maintaining O(1) append operations.

https://cs.uwaterloo.ca/research/tr/1999/09/CS-99-09.pdf


Not using exponential growth in buffers / vectors / other contiguously allocated collections is a rookie mistake. It shouldn't be made by anyone who's ever studied algorithm complexity, never mind studied CS at any level.

About the only exception is when you have global knowledge about how a buffer is going to be used. But that's rare in modern modular software.


What a remarkably misguided rant.

To each his own. What works for a general-purpose app on a desktop hardware doesn't work for a resource-constrained firmware (that still has to deal with potentially unbounded inputs). Yes, there's exponential reallocation, but there's also a multitude of other strategies, each best fit for its particular application domain. "Rookie mistake".


You're making a pedantic special case correction to a generally correct truth. This is not an endearing trait.

Everybody knows that a statement such as the one the parent post made will have special case exceptions: The point is that exponential growth should be the default & any other choice requires justification.

Asserting this is not a "misguided rant".


> Everybody knows that a statement such as the one the parent post made will have special case exceptions

Thank you for reading with a spirit of intellectual generosity.

When I write posts like this I always wonder about how many qualifications and weasel words I should add just in case it makes Hacker News and the nitpickers come out in force. In this case I wrote things like "A strategy that is usually better is exponential growth" but still many of the comments are polite variations on "exponential isn't best in all circumstances, you idiot".


Instead of weasel words, [RFC 2119] should be sufficient (https://www.ietf.org/rfc/rfc2119.txt)

(Google also gave me http://tools.ietf.org/html/rfc6919, which I did not know about yet. The third month of the year often sees remarkable productivity, culminating in superb output in the beginning of the fourth month)


If you know you have special case constraints, you presumably will know enough to deal with them properly.

In my experience the far more frequent mistake is developers on "normal" systems that seems to think that system calls in general are "free" and are far too cavalier about doing things like small reads, or allocating small buffers.

As a more general advice, rather than growing buffers exponentially:

Developers should in general treat kernel space as a remote system in terms of performance, and remember that every system call is slow. You can typically afford to do quite a lot of extra bookkeeping in user space to cut the number of system calls and still come out on top (e.g. user space buffering of reads).

And learn to love strace/dtrace/systemtap/whaever mechanism to trace and profile system calls.

Paying attention to this can have a dramatic impact on performance for very little effort.


I've been burned far more often by O(n^2) algorithms being used inappropriately than by slightly too high memory usage. It usually comes up when you give a piece of software an order of magnitude more data than it's used to, and discover it becomes unusable.

Resource constrained firmware is a special case, because the package is tested as a unit, rather than individual software modules which may themselves be used in very different use cases. The default algorithm should still usually be exponential growth, with tuning back where necessary to meet memory usage goals.


I studied algorithmic complexity and software enfineering. I don't think there was ever a mention of buffers or how to allocate them. I do remember there was a brief mention of dynamically allocated arrays and that the usual way to do those was to double up their space when needed. Anyways, there are tons of programmers working in high level languages and I wouldn't call them rookies for not being C programmers.


You should have learned that the exponential growth is the precise reason why appending to dynamic growth arrays is O(1) – a very surprising and unintuitive result in my opinion. Not knowing about exponential growth, I would have sworn that that must be impossible. If this didn’t leave a strong impression on you, I would say that whoever taught you algorithmic complexity simply failed in this point.


the O(1) complexity is achieved as an amortized result, when taking the average time to append. which is exactly what we are allowed to assume for asymptotic analysis. however, i think we should be sensitive that real computers are machines in the real world can run into circumstances that differ greatly from the mathematical result.

for example, if you allocated a slightly too small array and then append to it just slightly too many items, you will trigger reallocation every time. ideally nobody ever writes code that bad and we get to assume that the O(1) behavior is always true. in practice we need to know how the actual algorithm works and make sure we don't accidentally code perverse cases.


Nope, that is not correct. Insert is always (amortized) O(1). This is worst time. There is no “bad case”. You only need to be aware that a single insert might take longer. But that is not relevant in an absolute majority of cases.

“Append to it just slightly too many items” – this “too many” makes sure that you appended enough items so that the expansion amortizes to O(1).

That’s the nice thing about a complete mathematical proof – it doesn’t leave any dodgy edge-cases.


Yes, sure.

You should know that in the real world basing everything on asymptotic behavior does not work very well. That's why we profile things, because reality is way more complex than CS classes.


Assuming with asymptotic behaviour, provided the constant factor is acceptable, ought to be your default. This is because software tends to need to work with greater volumes of data over time, on machines with more memory, and more and faster CPUs. If your algorithms grow with greater than linear complexity as a function of input, your software will tend to get worse over time.


If you studied algorithmic complexity, you also know that doubling (or larger) the size of a fixed array when you need more space, is about the only way to ensure amotized constant inserts using continuous finite arrays (ignoring the cost of allocating the memory). From there you ought to deduce that allocating any fixed amount is going to blow up, in terms of big-O.


A buffer is a dynamically allocated array. The next step in CS+SE education should be teaching students how widely the lessons learned from theory can be applied in practice.


I've personally tuned a lot of software that ran better with a constant growth factor than an exponential one.

The problem with exponential growth is that, if you're doing a thing that is (kn)+1 bytes, where n is the growth factor and n is a decently large number, you end up with k(kn) bytes, which leaves k^2 * n - kn - 1 bytes useless. Depending on the value of k that can be a big chunk.

Particularly in situations where you're memory constrained and creating long-running processes, adding some-large-percentage of your buffer in empty overhead just wastes resources. It's better to spend the extra allocations in the first minute, and then have it run for a day, than have the first minute be faster and have it take 36 hours because it can't fit the working set into memory.


In most cases the empty overhead won't be part of the working set, though.


But the system still has to commit for the amount of allocated memory, which may deny memory allocations to the rest of the system.


If you have arrays/buffers that take up a significant chunk of a 64 bit address space, you might make a 2nd pass to tune those. Otherwise, "just" make plenty of swap space for the idle memory chunks.

And yes, "swap space" is going to be a bit more constrained on a 32 bit mobile app. CPU cache is the real memory, and DRAM is the new swap :-)


Man, you really want people to know how dumb they are if they don't know this already.


Not even thinking about the buffers being used to actually store your data is more like it.


Please do grow your buffers exponentially: multiplying the size of your array by c at each reallocation yields amortized O(1) operations for any c > 1.

But please don't use c=2: c=1.5 is a much better constant.

The reason is that that when you grow the array one element at a time, after i reallocations your memory is like this:

    <previous allocations>|c^i| |c^(i+1)|
                           old    new
Since the previous allocations are freed already, you would like to place the new allocation in the freed space. How much is that? Well that's 1 + c + c^2 + ... + c^(i-1), that is (c^i - 1)/(c - 1). For c=2 that's 2^i - 1, so the new allocation can never fit. It turns out that if c is smaller than the golden ratio (1.618...) then eventually new allocations will fit in the freed area. It is a nice exercise to see how the golden ratio shows up.

c=1.5 is a generally used constant also because it's easy to compute as n + n/2.


I don't understand; why hang on to previously freed array space to reuse for new iterations of the array? Isn't the whole point of freeing array space to allow arbitrary other data to use it as needed?


Take a look here: https://en.wikipedia.org/wiki/Memory_management#DYNAMIC.

No one is really hanging on to previously freed space (except for the allocator, or if you're using a custom object pool allocation scheme) specifically for new versions of the array.

But if you're allocations look like this:

  array = malloc(<capacity>);
  // do stuff with array
  free(array);
  ...
  array = malloc(<new capacity>);
  // do stuff with array
  free(array);
with no other allocations in between then it is possible that the allocator might reuse previously freed array space.


I guess let me put my question another way: what's the advantage of being able to reuse previously freed array space for new iterations of the array, vs. having that space used for something else and using some other space for new iterations of the array?

It seems to me that, when reallocating, one ought to be able to say to the memory allocator "I want you to reallocate this array to a new size of [whatever]; go find a contiguous block of size [whatever], possibly overlapping the current array but not overlapping any other allocated memory, and copy/move the contents over there appropriately". (I believe that's what the "realloc" function does, no?). And, in that context, I can't see why any of the golden ratio stuff matters (though, of course, exponential resizing is still useful as noted).


In embedded apps I use a 'bucket heap' where freed memory goes on a list by size e.g. 64, 128, 256 etc. So allocations come from their bucket only; freed to their bucket. Never re-used.

Why use it? Because each bucket is separately managed; no unexpected garbage collection; hardly any collisions. And at runtime you quickly reach your working-set and fill buckets with enough pieces. Its amazingly fast, simple and has very predictable latency. Which matter in embedded environments.


O(1), or O(log N)? Small difference but I was just having this argument yesterday. Each reallocation takes O(1) time (we assume), and you need O(log N) reallocations to grow by a cumulative factor of N.


By "amortized O(1)", what's meant is essentially O(1) _per array element_; in other words, O(N) overall to reach a size of N. Note that this is when considering each reallocation to take time O(current length of array) [as one has to copy over all the current elements to the new space].


Ah yeah, I was double wrong. I see now that N insertions require a maximum of 2N-3 copy operations (with a growth factor of 2), so O(1) amortized. Thanks!


This argument in this article is mainly based on a naive realloc() that does allocate-copy-free, but on a system with virtual memory the memory allocator should be able to reallocate by modifying page table entries, which is far faster than copying the data around; in the (highly improbable) unfortunate case that there are no contiguous VAs on every reallocation, this method still has to move a quadratic number of PTEs, but for the given example of growing to 1M, assuming 4K pages that's only 256 PTEs in total - and 1 + 2 + 3 + ... 256 is 32896. Assuming each PTE is 4 bytes, that gives 131584 total bytes moved, in the pathologically worse case.

And if you’re lucky the OS’s virtual memory system will do some magic with page tables to make the copying cheap. But still, it’s a lot of churn.

There is no actual copying of data, and as the numbers show, a little over 128K for zero-copy reallocations via VM is still 16x less than the 2M of a doubling, copying buffer.

Thus, a better strategy could be allocate-copy-free with doubling size for small buffers, and resizing in page-sized increments allowing realloc() to do zero-copy via VM for large buffers.

Related links:

http://stackoverflow.com/questions/16765389/is-it-true-that-...

http://blog.httrack.com/blog/2014/04/05/a-story-of-realloc-a... (discussed previously at https://news.ycombinator.com/item?id=7541004 )


This advice is not nearly as good as it may first seem. C++'s std::vector generally uses array-doubling for growth. A few years ago I had some code to create a graph for an automaton, backed by C++'s std::vector. Growing to a few million vertices from 0 would take minutes (on a machine with only 3GB of RAM). When I added a routine to guess the ultimate size of the graph and preallocate a large enough array, the time to construct the complete graph fell to a few seconds.

Array-doubling works fine on small–medium arrays, but once you're in the MB you really need to find a way to guess the right size, or switch to a better data structure, like a linked list or a rope, and rely on smart pointers, etc. As noted in the comments, a smart realloc() can also avoid copying data.


I think you found the bottleneck you were looking for and not the one that actually existed. 31 doublings gets you from 1 byte to an out of memory exception on your 3GB machine. I doubt that a pure alloc and copy 30 times took several minutes. You would have only moved 4GB total, worst case. Either your vector implementation was not using a doubling strategy or something else was at fault.

What was the structure of the object you were putting into the vector? Did you have an expensive copy constructor? Perhaps some deep copying was happening (possibly chasing lots of pointers)?


It was a small value type. Allocation never failed. Thanks for playing.

"Several" in this case may be "two", can't quite remember that detail very well, and we're talking about a white plastic MacBook with the first-gen Core2 Duo with a crappy 5400rpm drive. I suspect some fragmentation-induced paging. But you are right that I didn't get all dtrace on it; I chose a good guess for std::vector::reserve() and that was that.

Jon


> Did you have an expensive copy constructor?

This would be my bet. Copy constructors are one of the elements of C++ that are notoriously difficult to get right where the automatically implemented one doesn't cut it.


My experience with c++ was that the standard library data structures were pretty slow. Are you sure using push_back 3m times wasn't the problem? If you are doubling the buffer size every time you only allocate 2x as much memory as you need overall. Nothing that should turn seconds into minutes.


It depends on the problem domain. Can you guess ahead of time how many child nodes that div will have, how many tabs the user will open, or how many functions this javascript file will create?

The classes in Firefox that he's talking about all start at some size larger than zero, either by default or by specifying a size when you construct them. For instance, if you know the string you're working with can live on the stack (it won't have to stick around in the DOM, or be sent to another thread, etc), you can use a string class that starts out as ~32 bytes on the stack and only allocates on the heap if the string grows larger than that. This saves an immense amount of allocation.


Preallocating beat reallocating? You don't say!

The context of this post is when you are implementing a growable abstraction like a string or vector. When you are the library, you have no reasonable way of guessing how big it will ultimately grow. At the application layer, of course you have better options available.


I was writing a library. I gave the user the option to provide me with the initial preallocation size, and a good method for guessing it.

When libraries deal with lots of things, it's generally important to expose special handling for dealing with lots of things.

Jon


copying the data was probably the bottleneck, not malloc.


A related puzzle - a robot is on an infinite line going both directions. Somewhere on this line there is a pebble which the robot has to find, which is only visible when robot crosses over it. The problem is that the robot doesn't know which way the pebble is.

Program the robot so that it finds the pebble, optimize for the distance traveled as the function of the initial distance between the robot and the pebble. The robot can tell the direction (say east from west) and measure the distance traveled.


What's the statistical distribution for the location of the pebble? (Without that, the answer is "it depends"/undefined.)

Note that a uniform distribution isn't possible over an infinite line.


Agreed on the impossibility of the uniform distribution. Partially disagree on undefined: a function from the initial distance to the travel distance can be defined regardless of the distribution of the initial distance.


Go with any scale invariant distribution.


Great puzzle!

My strategy would be to move to positions 1, -2, +4, -8, 16, etc, then we get there in n + 2*(2^0 + 2^1 + ... 2^r) where 2^r < n.

(We always get there having moved n distance from the origin as our last step, otherwise we travel back to the origin spending distances 2, 4, 8, 16, etc. The sum (1, 2, ... 2^r) = 2^(r+1)-1; We double that again to get ~2^(r+2)

An example of the worst case, we lie on a number of the form -(2^r + 1) we might round-trip to +16 before heading down to -9. In that case we'd travel 2,4,8,16,32 + 9 = 71 steps.

Best case, say we lie on a number of the form 2^2r, e.g. 64.

We travel 1, -2, +4, ... , -32, 64.

We've travelled 127 steps.

Overall this takes method travels approximately between 2n and 8n.

I'm not sure how this compares to 1, -1, 2, -2, 4, -4; That has a better worst case.


> Best case, say we lie on a number of the form 2^2r, e.g. 64.

> We travel 1, -2, +4, ... , -32, 64.

> We've travelled 127 steps.

I think it's 190 = 64+2*(2^0+2^1+2^2+2^3+2^4+2^5) = 64+2+4+8+16+32+64

> Overall this takes method travels approximately between 2n and 8n.

I think asymptotically the range is [3n,9n]


That's bad general advice in my opinion, instead encourage that a programmer reserves the expected capacity beforehand, or at least tweaks grow attributes for the specific 'growth pattern'. At least in games (only area where I have experience), dynamic containers will often grow very quickly to and remain at a fairly predictable size for a long time (several seconds to minutes). Reserving capacity beforehand to the expected size will always beat 'smart' growth strategies. If a lib can't predict the size on its own because it doesn't enough information, provide API functions to configure it from the outside. As a fallback I use 1.5x growth factor of previous capacity, clamped against a per-container-configurable min and max growth value.


Back when I was writing parser/tokenizer stuff for a dev tool shop called "Morada" back in the early 90s, it was a bit hard to predict how many lines (parsed items) the input files would be, unlike a game with a constrained "arena". So, this double-upon-realloc was exactly the approach I used in some library calls I made up for dynamic array / "index" / list functions I put together in C. So, I would say that the article is very GOOD advice for many, if not all, usages.

. . .

To me, it seemed an obvious implementation of the "List ADT" stuff that I learned in uni in the 80s (even if it was in Pascal back in the day), but my bosses were minicomputer guys from the 70s, so they wanted to be sure I didn't "steal" the code from somewhere. (and this is pretty much the approach taken in the Java runtime for Vector & ArrayList)

It was my first job coding C on a regular basis. I think my employers may have confused my initial unfamiliarity with where to put my "asterisks and ampersands" (C gobble-de-gook syntax) with some kind of general ignorance :-)

(to their credit, they knew I wasn't stupid - I had done a sub-contract job for them in another language)


hints about how to decrease a dynamic structure at this page's bottom : http://c2.com/cgi/wiki?DoubleAfterFull #hysteresis


Doubling is actually pretty memory efficient in real world scenarios. i.e. in libxml, using 2n buffers wasted less memory than an additional 10 bytes on the end of each.

https://mail.gnome.org/archives/xml/2012-May/msg00009.html


In one of our structures here we used to use exponential growth, which worked pretty well.

We improved it by taking some real-world measurements of the structure's use and using that to pre-allocate an estimate. If it still ran out we would use the old exponential growth. It's been some years since we made the change, so I no longer have the figures but I do remember the heuristic yielding orders of magnitude performance improvements.

As for the exponential growth factors, we arrived at those by gathering stats on real-world scenarios - with the value being the mode of all the scenarios.


I just addressed this the other day, when a pull request wanted to use exponential growth in the buffer. I told them not to. The difference is that this software is going to run on an embedded sytsem for which I wrote the kernel, and I wrote the realloc function for. My realloc expands the block instead of moving data when possible, and I know that that's a likely scenario during normal operation. On top of that, there's only 32K of memory to go around, so wasting space on exponential growth is a bad idea.

Always be willing to question the standard doctrine.


There's only 32K of memory to go around, and you're doing dynamic memory allocation? It's been 20 years or so since I last touched such a limited memory system - and everyone I know who did switched to static allocations after being bitten by the unpredictability of dynamic allocations.


Going on 4 years and it's working out well so far.


This is great until you start vending an api to a higher level client that makes several arbitrary mutable copies of your buffer object. There are still 32 bit machines and it is not unlikely to have a contiguous 4 meg region of heap unavailable in a long lived app. (One option to help is to ensure that the naive object duplication code always compact the new copy. But malloc can (and will) fail.)


If you are handing out (long lived) references to your elements, then yes, you cannot relocate them. In that case, you need an array of pointers, rather than an array of inline values. In this case, your memory use will become even more fragmented, though.

Either way, make sure you have swap space, and you should be all right, especially if the swap is usually just holding idle slop. 2 GB (2^31 bytes) is still a pretty big address space for many purposes. It will hold about 500 4 MB buffers (512, less overhead).


P.S. - I never had to deal with 8-bit machines, but spent plenty of time on 16-bit machines. 32 bits is a big address space, despite Microsoft having pissed all of it (and a CPU core) away just to load the OS, leaving little-to-nothing for actual work.


Browsers need large contiguous areas all the time for things like image buffers and JS heaps. I doubt there are many apps that try to tolerate badly fragmented and cramped 32-bit VM spaces, at least while using malloc.


This is actually a pretty old idea:

http://en.wikipedia.org/wiki/Buddy_memory_allocation

The primary benefit of it is that it makes space reclamation more efficient.

(At least I think so. Might not be true with modern garbage collectors.)


There is no substitute to profiling and testing your code thoroughly under real world conditions (and worse, stress conditions).

Having this in mind you can have your rules of thumb, but trying to sell a problem like this as solved with a simple, universal silver bullet is not a good idea IMO.


Sure, but when writing a generic container implementation -- such as nsTArray -- you have to pick a single strategy. And generic containers generally use exponential growth. And in this case it made a clear improvement on memory usage in the general case, as the AWSY result showed.

Someone will probably now ask why Firefox doesn't use standard containers like std::vector. Several reasons: (a) having our own container lets us know that the implementation is the same on every platform; (b) having our own container gives us more control and allows us to add non-standard features like functions that measure memory usage; (c) Firefox doesn't use C++ exceptions (for the most part) but the standard containers do.


Yep but your post is about "fixing" all possible allocation strategies to exponential growth: "please grow your buffers exponentially", "if you know of some more code in Firefox that uses a non-exponential growth strategy for a buffer, please fix it, or let me know so I can look at it".

That sounds as if very little effort was put into the allocation strategies in the first place, since you're willing to override whatever thought was put into deciding an allocation strategy and fix it with exponential growth (that, admittedly, is often a good heuristic) or just have other people do it.

It's perfectly plausible than in other circumstances a different approach is better. That said, in both patches mentioned it makes sense (I think the minimum for XDR encoding buffers should be at least quadrupled from its current 8KiB if it's true that on startup it gets already bigger than 500KiB). One thing about exponential growth with rates as high as x2 each time, is that picking a reasonably big (even slightly overshoot) on the expected buffer size it's the conservative thing to do. Because if you let the allocator do the growing it's often going to overshoot a lot more. If you are going to have buffers in, say, a normal distribution of maximum sizes over their lifetime, it's wise to preallocate as much as the 90% percentile expected size and then instead of growing x2 perhaps grow x1.5 . Something worth testing and tweaking because it makes a real difference.

Sorry if this sounded negative, it wasn't meant to.


Chances are that this scheme will be used to implement your language's dynamic array (or the standard library for that). So unless many people tend to use different dynamic array implementations, or fiddle with their regrowth-factor, then it is indeed being used as a "silver bullet".


That is one thing, and a different thing is recommending that people go through all their allocations ever and just make them all grow exponentially on a x2 rate without much more consideration to what was expected in each case.


If someone is not critically reasoning about what is needed in their specific case, exponential growth is going to give far better results on average than constant growth.

If someone is critically reasoning about their allocations, then presumably they'll be able to pick up the (fairly rare) cases where exponential growth is sufficiently detrimental to matter.


I wouldn't immediately assume that someone working in my organisation put zero thought into his allocations.


What is the rational behind this? My guess is that when a buffer reaches its limit it's expected that it will need to grow by another factor of its current size, while a fixed-size rate does not incorporate previous or current growth needs.


This is a problem that occurs in lots of places, not just memory allocation. To pull an example I'm a little bit familiar with, check out CVE-2012-3444 from Django a couple years ago.

The tl;dr there is that determining the dimensions of an uploaded image file was done by reading 1024 bytes at a time until enough information had been read to figure the dimensions. Which works on a lot of common stuff, but falls over hard on other files, since there are plenty of cases where you have to read the whole file to get the dimensions. At 1024 bytes per read operation, that can be a lot of operations, enough that it was a denial-of-service vector.

The solution was exactly what the linked blog post advocated: we switched from reading a fixed size in bytes on each operation to doubling the size each time. The exponential growth of successive reads means that the pathological case (having to read the whole file) becomes far less pathological.

IIRC Python's dictionary implementation does a similar trick, doubling in size every time it needs to grow. Turning lots of small read or write or realloc (or whatever) operations into fewer but larger-each-time operations is almost always correct, in my experience, and when not done from the start you'll almost always end up noticing sooner or later when you wonder why you have degraded performance.


You're absolutely right, and too few realise this.

In particular people doing IO with small buffers drives me crazy. People unfortunately don't seem to realise how expensive context switches are, and how brutal they are on throughput.

I've seen this so many places. MySQL's client library used to be full of 4-byte reads (reading a length field, and then doing a usually-larger-but-still small read of the following data). I believe it's fixed, but I don't know when. I also remember with horror how t1lib - a reference library Adobe released for reading type 1 fonts ages ago (90's) that spent 90%+ of it's time on the combination of malloc(4) and read( ... ,4) - for tables of a size known at the outset (basically some small table with one entry per glyph, that stored a pointer to a 4 byte struct instead of storing it inline).

Currently I'm hacking on SDL_vnc now and again, and it's full of 1-16 byte reads (seems to make sense at first glance: After all the VNC protocol packets are of a size that depends on values of different fields; but for high throughput networks or local connections it makes the small read/writes totally dominate overall throughput even when the protocol overhead is a tiny percentage of the bitmap data being pushed)

Basically pretty much anywhere where you want to read less than 4K-16K, possibly more these days, it's better to do buffering in your app and do non-blocking read's,so you can read as large blocks as possible at the time...

But the general problem is not paying attention to the number of system calls. People not paying attention to stat()/fstat()/lstat() etc. is another common one (common culprit: Apache - if you use the typical default options for a directory, Apache is forced to stat its way up the directory tree; it's easy to fix, but most people don't seem to be aware how much it affects performance)


Let's say you start with a small array, then add N elements, one by one. When you run out of space in your underlying, you allocate a newer array, copy all the elements over, then free the old array.

If your new array is always a constant size bigger than the old one, then the total amount of work you do is O(n^2). If your new array is a constant factor bigger than the old one, the total amount of work you do is O(n).

This is ignoring memory allocator costs, which can change things a bit.


The article states that growing by a fixed size every time creates a lot more memory churn and moving bits around - the operation is to create a new block of memory and move the existing collection to the new block. Doing that a few times isn't too expensive, doing it every time for every small increment adds up quickly.


While I agree with the gist of the article, is the recommendation to use powers of 2 as the size correct? Every allocator adds a small overhead to the size, so one could be better off by using sizes like 2^n - 16 or so..


It isn't necessarily a power of two, since the initial capacity might not be a power of two.


This doesn't always work.

When you have 512MiB of memory and your current allocation is 256MiB, where do you go?

Also when your dataset allocates 64MiB up front, then chucks two bytes on the end, where do you go?

Know thy data and test test test is the only rule.


> When you have 512MiB of memory and your current allocation is 256MiB, where do you go?

Then you're out of luck no matter what strategy you use. Even allocating 256MiB + 1 byte won't work, because you can't deallocate the old buffer until the new one has been allocated.

> Also when your dataset allocates 64MiB up front, then chucks two bytes on the end, where do you go?

What I didn't say in the article is that for some of the Firefox examples where the buffers can get really large -- such as nsTArray -- it switches to a growth rate of 1.125 once it gets to 8 MiB. So in that case it would grow to 72 MiB.


Well the first point depends on what platform and how the kernel memory allocator is configured. It's possible to overcommit on Linux for example which is acceptable for short windows. NT will most likely tell you to go away.

Now that last point is the important one and is actually what I do which I was hoping to hear somewhere :-)


You can reserve more address space than physical memory just fine on NT. Committing beyond the amount of physical memory available will mean paging at some point, should you try and use all of it.


> Even allocating 256MiB + 1 byte won't work, because you can't deallocate the old buffer until the new one has been allocated

An allocator can (sometimes) grow the allocation in-place if there is unused space following it.


Even better, as long as you have the address space available somewhere, if the allocation is done via mmap (as some allocators do for larger allocations), the physical page can be mapped to a different address in userspace and then allocate the new pages to be adjacent in the user address space. In this way, you don't even need the room after the current allocation.


You can certainly set reasonable limits. "After 512MB, start adding chunks of 64MB at a time." Similarly, you also really want a minimum bound so you aren't reallocating 12x to get to 4KB when you add a byte at a time (if you use realloc it may detect this, but not as easily as a min(4KB, size) in your allocation request.)

Plus at that object scale, you may want to start considering block pools for your data so you aren't copying all of that data by hand. It's very unlikely you have a >512MB object and also need to very frequently access all areas of it completely at random.

But as you said, it's always best to profile if you find yourself needing more performance.


> You can certainly set reasonable limits.

No, not really. Not unless you plan to review them every few years or so. Which no-one will ever do.

The point about exponential allocation is that it is scale-independent. It doesn't depend on measures of the underlying size, so the same strategy works in 2000 and 2020.


What if you query the total and available amount of RAM on the system to set your limits?


Why try and be so clever?

I'd hate to be tracing a problem and find code deep in a lib trying to do that. It's OK-ish to expose such things in explicit API, but then you put the burden of choosing (and updating) such constants on the app developer.

If you're trying to avoid too much slop (up to 50% waste), you could either:

- grow by a smaller factor (1.25x, 1.5x)

- trigger a shrinking realloc when certain usage/stability criteria are met (bonus points for exposing this as explicit API rather than implicit behaviour on access, since that could cause exactly the kind of "weird slowdown" people spend ages trying to find (http://antirez.com/news/84)

Even worrying about slop is probably wrong, since if it's a big alloc wasting space, your VM system will probably not even map any physical RAM until it's used. i.e. there would be literally no downside.

(And as to your specific idea - imagine for a sec that future systems might introduce new ways of restricting memory (e.g. linux cgroups) or adding memory (hot-swap RAM, emulated by VMs?). Then your code would be tuning with the wrong value. Not saying that that is likely, but imho you'd just making things more fragile by trying to embed magic in the code.


Well, consider it in the context of a web browser or other generic desktop application.

If you have (say) 4G memory, and you have to process 2G, then you more or less accept that things are going to be slow. (Unless you are in some kind of embedded system, but then I assume you actually measure exactly how many bytes you need anyway...) And if it really bothers you, you would probably add more RAM to your machine and the problem will go away.

If you extend arrays by constant amount, you instead end up with a 4MB data that takes five minute to process. Your users won't like it a bit. What is worse, this problem can't be solved by adding more RAM.


> When you have 512MiB of memory and your current allocation is 256MiB, where do you go?

I'm not sure if this is assuming that 2.0 is the only growth factor that is exponential.


Maybe this is a bad idea, but my thought is, assuming you are on a 64 bit system, virtual memory is basically infinite.

So, when you start getting close to available physical memory, reserve as much virtual memory as there is physical memory. Then commit pages as you need them.


A few months ago a similar story popped up with hacker news about jemalloc (can't find it now). Had very cool proofs why it was optimal as well :D


Pedant:Doubling is geometric, not exponential. Right?



I would think 'geometric' meant x^2 or x^3, while 'exponential' meant 2^x. Are there different words to describe these different growth rates?



Slightly confused: doesn't realloc() grow existing memory by n, avoiding the new/copy/free cycle?


realloc will grow if possible, but will malloc and copy if it cannot be grown. Some allocators will round up to a larger size (like a page size), so allocations up to the page boundary will be free. Most modern allocators use small block optimizations and other techniques, the net effect of which is that realloc will usually require a malloc/copy.

You should treat realloc as a semantic hint to the allocator, but for performance analysis you should always assume that it will allocate/copy.


More importantly, Firefox, please shrink your monstrous memory footprint exponentially!


I bet this never occured to them


I mean, I don't want want to be the "640K oughtta be enough for everybody" person here, yet somehow I can't accept that it should take a quarter gigabyte to render a "hello world" HTML document. Every morning when I get into work, the first thing I have to do is click a button that I installed into Firefox to restart it so that my machine (4Gb RAM) becomes responsive. (That this important button even has to be a third-party add-on indicates serious reality denial.)


This sounds like atypical behaviour. One option is to try resetting Firefox, which gives you a new profile while keeping your history, bookmarks, etc: https://support.mozilla.org/en-US/kb/reset-firefox-easily-fi.... It can fix a lot of weird problems.

If that still doesn't help, I'd be interested to see what about:memory says. Instructions are at the top of https://developer.mozilla.org/en-US/docs/Mozilla/Performance.... Please file a bug in Bugzilla or email me. Thanks.


i was under the impression that it was common knowledge to have a doubling strategy for buffer sizes to amortize the cost of each addition. fwiw, java uses a 3/2 strategy, and python uses a 9/8...


Shouldn't that be geometrically?


Technically yes but geometric growth becomes exponential growth in the continuous limit, so the two terms are often used interchangeably.


Although exponential is technically correct when the base is 2, I find it really masks what real exponential growth is like.

When I hear "exponential growth," I think 256, 65536, 4294967296, 2^64, 2^128


2^(2^n) is a double exponential function. (https://en.wikipedia.org/wiki/Double_exponential_function)


They're synonyms.


This is the same old story of time vs. space. There is no clear answer it really depends.


Is there a reason the link is directed to a comment?


Yes. I'm an idiot, and it's been more than 15 minutes since I posted it, so I can't trim the URL!

If there are any mods about, I'd appreciate their help...


Should be fixed now.


There is no circumstance under which Firefox should be allocating 4kb of heap.


https://areweslimyet.com/ seems badly named: all the curves are trending upwards, so it seems like it should be "wellneverbeslim.com"


haha, cue firefox going back to using ridiculous amounts of memory in 5, 4, 3, 2, 1.

seriously, there is no one size fits all solution and you cannot really predict who is going to end up using your code later on. or what manner they will use it in.

i've worked on defects created by exponential buffer growth, as you can imagine they can be a lot worse than malloc churn.


I was under the impression that modern kernels and memory allocators are clever enough to not initialise memory pages that have been allocated but not used yet, is that not the case? So unless you calloc() or memset() your newly allocated memory, overallocation won't really affect actual memory usage?


That's right. The untouched memory won't be committed, i.e. put into physical memory or swap. But it will take up address space, which can be a problem -- on Windows Firefox is a 32-bit process and running out of address space tends to happen more often than running out of physical memory, ironically enough.


That reminds me (as I now have windows 8.1 on one of my machines again) - does anyone know if there is a roadmap/plan for official 64bit builds of ff on win64?


It's being actively worked on. https://wiki.mozilla.org/Firefox/win64 seems to be an up-to-date planning document.


i think you will find memset 0 after most calls to malloc in the firefox code.


> haha, cue firefox going back to using ridiculous amounts of memory in 5, 4, 3, 2, 1.

I just had to force-kill Firefox (v. 33.0.2, on a OS X v. 10.8.1 with 4GB of memory) when it decided to not respond anymore after I had tried (silly me) to open a 14 MB XML page stored on disk. Looking in the Activity Monitor that XML file caused FF to load one of the CPUs at 100% and to use 1.5GB of memory (for opening a 14MB file, that's two orders of magnitude less, I cannot even understand how this can still happen, in 2014).

Don't get me wrong, I've sticked to using FF for more than 10 years now, but they need to get their s.it together.

Later edit: I also checked in Chrome (v. 38.0.2125.111) and saw the same behavior when trying to open that 14MB XML file. At least in Chrome's case killing the tab was much faster (in FF's case I had to kill the whole window from inside Activity Monitor). I've taken a quick look and found bug reports like this one (for FF: https://bugzilla.mozilla.org/show_bug.cgi?id=291643), opened in 2005 and still not closed, which mentiond something like "it all happens because of XML processing". My stupid question is: why in God's name does a browser need at least two orders of magnitude more memory in order to process XML? Isn't XML just marked-up text? I don't get it. What does the browser need to "process"? This is just stupid.


>I don't get it. What does the browser need to "process"? This is just stupid.

All of the code is open source; this question is amenable to research. It might be better to do a little bit of that research before calling present implementations stupid. Real people put a lot of work into them.


Most browsers will show a raw XML file as an element tree, allowing you to open and close branches/elements by clicking on a "+" or "-" symbol to the left of each element.

That requires either building a memory intensive tree model of the XML (DOM), or, repeatedly reparsing the XML every time the user wants to open/close part of the view.

The standard "DOM" methodology requires building a tree node for every element and text chunk between the elements, and maybe for the attributes on the elements (or at least attaching an associative array to each element for its attributes). All of these little pieces fragment memory, unless a very clever compaction and/or string interning setup is used.


> Real people put a lot of work into them.

I know this, I've been involved in open-source projects myself (some good years ago), but how does that address the fact that the project has what appears to be quadratic time "performance"?


...but they are stupid :)




I bet if you work out how many nodes and attributes are in that XML file, multiply that by (what's a good size for a C++ object? 256 bytes?) and I bet it's a substantial fraction of that 1.5GB.


Exactly. If you have to have a "DOM" of an XML, it's going to take A LOT of memory.

I recently replaced some generalized Java code (my own, alas) that built a DOM of some XML with substring operations to find the element text in the only element in the XML that mattered. (fortunately, I could guarantee that the text had no markup related special chars in it) This caused about 6 GC cycles (in a 32 bit JVM) to disappear from this process.

However, if you have to display or navigate the document tree, you are stuck with the memory hogging DOM.


nah, the guy is right there's better tech since 2010, just nobody has bothered to swap out the old style DOM stuff for pugixml/rapidxml/vtdxml.

http://pugixml.org/benchmark/

specifically: http://pugixml.files.wordpress.com/2010/10/dom-memory-compar...


I meant "DOM" as in the general "tree in memory" data structure. You did find an implementation that uses about 1/3 the memory of some of the piggier ones, though.


closer to 1/5th.

also, in actual use -> those are peak values, so while libxml will hog the memory until the DOM is freed, the streaming style parsers hold on to the smaller amount of memory for a shorter time.


In the problem above, I did try using the SAX parser that the WebLogic-JVM "factory factory factory" returned, but the element text was about 2 MB, and it wanted to return it in pieces by repeatedly firing the event handler.

Manually finding the index of the open/close elements and doing a substring to get the element text was SO incredibly much faster and smaller, albeit something that only worked for a VERY specific situation.


I wonder why those haven't got traction with things like Mozilla though. Presumably -someone- would at least have looked at them if they're that big a win?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: