In some cases you can benefit by "struct of arrays" instead of "array of structs". This benefits packing, enables SIMD, and usually improves persistence speed too. Database people call this a "column store". So if you abstract it a little you can do per-column compression that is really hard to do with vanilla structs.
One thing I learned with Picasa (which is basically an in-memory database): if you're storing a lot of data in RAM, sometimes you should think of the data structures more as a database (with copy in/out, rather than reference). In multithreaded cases, this kind of approach unexpectedly gets you more parallelism and speed.
Column store is really just making sure values for a column are continuous in memory, instead of rows as it is done usually. Basic premise is that modern CPU's can chew through continuous memory locations at light speed, with single core being capable of scanning a gigabyte of RAM in few milliseconds[0].
This is extremely beneficial when your tables have large number of columns, but you only want to do computations on a few of them. Columnar stores usually implemented using dictionaries which allow for efficient duplicate handling and blazing fast lookups. Often those dictionaries also contain run length optimizations, so values 1 through 100 are only stored using two values start-range 1 and end-range 100.
All in all, column stores are efficient at analytic workload, but struggle when a lot of inserts and updates need to happen.
[0] Using prefetch, properly aligned data, and likely other optimizations I am not thinking of.
but struggle when a lot of inserts and updates need to happen.
I ran across this with Redis recently. Redis has an internal data structure called ziplist that's a doubly linked list with no pointers. It just uses byte offsets to the next element, so all elements are in one contiguous block of memory. It's great because you can store a lot of values compactly without needing 1 to 5 pointers (8 to 40 bytes) overhead per element in your data structure.
But, a problem shows up when you want to insert or delete items. Every insert or delete requires expanding or shrinking the entire solid-chunk-of-memory allocation, which isn't the fastest thing in the world when your allocation is large. Also, inserting into the HEAD requires copying the _entire_ ziplist up exactly one element position because the block of your memory layout just changed.
So, obviously these things are useful, but their usefulness degrades as your solid blocks of memory hit cache limits. Everything is super fast when your entire ziplist fits in L1, still about the same in L2, worse in L3, and horrible if your ziplist grows any larger.
But, what can we do? We can fix the entire problem by creating a traditional linked lists of solid memory block ziplists. Now, each ziplist can remain small (8 KB to 16 KB), and when we grow bigger, we just cut a new ziplist, pointer-attach it to the previous ziplist, and now we have a minimal-pointer, high-locality data structure with unlimited growth potential without performance defects. It doesn't buckle under inserting or updating arbitrary elements since each memory block is isolated to a maximum ~8 KB size (reminder: your L1 cache is 32 KB to 64 KB depending on architecture) and now your memory usage due to pointer overhead is now also _greatly_ reduced (assuming your data was small and dominated or matched by pointer overhead size in the first place).
As a double bonus, now when your data structure is a solid block of memory, you can compress individual contiguous memory blocks (usually with great compression ratios) because your contents will tend to be homogeneous in shape inside the same container. Result: huge reduction in pointer overhead + sequential access + compression = best of all possible worlds.
This isn't news. Arrays are fast until you need to insert/delete in them, then they get horrible. This is why we have B-trees. Every page in a B-tree is essentially a sorted array - you can find items in it quickly using binary search, and you can insert/delete in it relatively quickly because it has a small bounded size. When you need to grow beyond a single page you do a page-split and add a parent above it, etc. B-trees are the optimal implementation of dynamic arrays, end of story.
This reply is golden. It starts off completely dismissive with "this isn't news" and it ends completely dismissive with "end of story." I am quite sorry my free internet post failed to meet your quality guidelines.
A wise man once said: Things have now gotten to the stage where I flinch slightly as I click on the "comments" link, bracing myself for the dismissive comment I know will be waiting for me at the top of the page.
1. Name array strongly implies that it has an underlying continuous memory location associated with it. All of modern scripting languages have some form of dictionary, map, set, or hashtable which is implemented using insert/update friendly approach.
1a. Typical workload for array is to iterate over it doing something with each element, B-Tree is strictly worse in iterating over all elements compared to static array.
2. Most instantiated dynamic arrays rarely perform deletes. Most common operation that could change size of array is push_back, and amortized cost for inserts is probably way less than chasing pointers in B-Tree.
3. B-Trees were created and optimized for HDD performance. With current SSD and memory trends, there are simply better data structures to use.
4. Pointer chasing got a lot more expensive relative to continuous memory scanning. This is mostly due to majority of memory performance increases coming from increased throughput as opposed to lower latency.
This does not mean B-Trees are bad, in fact they are still heavily utilized in databases because they scale extraordinarily well with data size. They are commonly used for indexes because membership testing (single value lookup) and range lookups are O(1), and updates/deletes are generally very fast as well.
B-trees are “optimal” in the sense that they’re O(log N) for all the usual operations (insertion/deletion, appenditem, getitem, setitem, and even concatenation with some work) but simple exponentially-growing arrays are O(1) for getitem, setitem, and appenditem, which is asymptotically faster. Also, their constant factors are better, and their memory footprint is smaller.
hyc_symas’s comment seems a little confused because they are talking about doing binary search within B-tree pages. But if you’re using B-trees to implement dynamic non-sparse arrays, which seems to be the topic, you don’t need any binary searches. You can just subtract the beginning-of-page index from the index you’re indexing to in order to get the offset within the page.
That kind of comment, managing to be simultaneously ignorant and arrogantly dismissive, is why I don’t comment much on HN these days.
Sorry, but the only ignorance here is your own. You assumed that "storing into an array" means that the array index is the only thing relevant for locating a record. But the fact that you're talking about inserts/deletes means that the array index of a particular element is mutable and therefore insufficient for uniquely locating a record. Which means you must perform some type of search to retrieve a particular item. Your O(1) getitem is nonsense.
Yes, you can say "give me the 5th element of this array" but the discussion was about maps/dictionaries, where you're more often going to say "give me item foo that was inserted before".
When folks like you post while obviously not paying attention, yes, you elicit a dismissive response.
I would be unsurprised if it was for legacy reasons. Both of those programming languages were written over twenty years ago. Furthermore, arrays play nicely with C extensions.
So if you have fixed sized 'ziplists', why don't you do a memmove() instead of a realloc() on inserts, then split your array on overflow? That would be 1 malloc per 1024 inserts assuming 8 byte data in an 8K page instead of 1 malloc per insert.
This would be vastly more cache friendly, and you would be fragmenting your heap dramatically slower. You could even mmap() a scheme like this, but that's another story!
why don't you do a memmove() instead of a realloc() on inserts
Oh, it's actually worse than that because when inserting we do a realloc (which could require automatically copying the entire memory block), then (if inserting to the head) we memmove the entire contents down to the end of the new allocation. Fun!
Mainly we don't pre-allocate because it's built on top of existing components, and the existing components don't do that. The actual "max size" limit is configurable and we don't really want to allocate the full max size for each list up front. If you have 500,000 lists each with 3 40 byte elements (~60 MB total), it would be overkill to allocate 500,000 8 KB blocks (~4 GB total).
Ideally, we would automatically determine if the user is doing "big operations" then switch to a page-based approach versus smaller "store as much data as compactly as possible" approaches, but it's just not done yet.
Adam Rosenberg wrote a really interesting book advocating not just this but the entire Fortran programming style. He hasn’t persuaded me that this is a good idea in general, but it’s a really interesting way of looking at things: http://www.the-adam.com/adam/rantrave/st02.pdf. If you struggle to imagine how you would structure software without structs or objects, you should definitely read this book!
Indeed, this argument is speaks to a fundamental incompatibility between OOP and efficient code. I wonder how to write a compiler to perform this transformation.
Same tension occurs in Java games for phones, at least pre-android. Instead of an array of Sprite objects, turns out its more efficient to have parallel arrays of ints, which leads to less readable code.
This is my favorite example of where object oriented programming (and in particular C++) is an advantage.
OO as it is usually practiced (gross generalization I know) is often hell on memory layout and memory access which causes problems that never show up in unit testing but show up in deployment. Pattern fetishism exacerbates the problem.
But a data structure abstraction that hides the storage implementation is an ideal application of OO.
Interestingly Intel's ISPC compiler actually uses a hybrid of the two layouts to store its memory of arrays. I think it stores an array of vectors as xxxx, yyyy, zzzz, wwww, xxxx, ...
This was the cause of my very first "hard" C Bug, in 1991 or so. I had written a program in our Novell Netware labs to read in the printer control code definitions for our printer-release consoles - being "clever", and wanting to save a few lines of code, I read them into a memory structure, and overlaid a C-Struct on top of them that mapped precisely to the fields that I wanted. Everything worked fine, code compiled, and we were able to read all the job definitions until I handed it off to the team responsible for the rest of the release console - at which point the code just started breaking. No longer worked. For the life of me I couldn't figure out what was going, until our team leader took a glance, and made it clear that I needed to inform the compiler to byte align the C Structs (which was likely the default behavior in Turbo-C, but not Watcom C).
Really opened my eyes to the many, many things that I didn't know.
Per its revision history, the guide was first published on 1/1/14. This current revision is only a few months old, and besides articles on important C programming techniques by the big toads like Raymond are more or less evergreen content.
I'm not a native English speaker, so "big toad" might be term whose sarcasm escapes me, but there are probably about two dozen people in the world who actually believe that Eric S. Raymond is in any way an authority when it comes to programming (one of them being Eric S. Raymond). The following comic strip of "Everyone loves Eric S. Raymond" sums it up pretty well:
http://geekz.co.uk/lovesraymond/wp-content/images/ep013.jpg
John Romero was making shareware games for years before he met John Carmack. He wrote the level editors, much gameplay code, created many levels for, and significantly contributed to the game design of all the id games from Commander Keen to Quake. His absence is arguably a big part of why most of the later id games just aren't as fun as Doom. Daikatana bombed for a lot of reasons (zero experience as a manager, dotcom-era ridiculousness in hype and project scope, infamously horrible marketing campaign that he had no part in, etc), but not because John Romero was an incompetent programmer or game designer.
If you have the time, watch this series in which John Romero and (Bioshock level designer and apparent Doom fanboy) Jean-Paul LeBreton play through the first episode of Doom and analyze its level design in depth:
Thanks for linking to that video, it was fun to watch. I've looked at the 90s' id games differently ever since I read Masters of Doom 10 years ago (due for a re-read soon) - learning their history made me appreciate those games on a whole new level. Their story was a very big inspiration and motivation boost for my own modest projects.
No, I'm saying that an article by him on a (relatively simple) programming technique is in no way more notable than tens of thousands of other articles on various programming techniques, in contrast to what is being claimed above ("articles [...] by [...] Raymond are more or less evergreen content").
The piece is well written and informative. Lesser minds such as mine even consider it useful.
The content of the article is more or less evergreen. Raymond's celebrity in a constellation of small worlds [deserved or not] makes it more likely to get treated so.
The joke is funny. If Knuth and Ritchie we're the domain for panels one and two though, panel three would get a much larger range. John Skeet, even maybe?
Unless there's some explicit condemnation, I would just assume that "this was on HN before" with a link is just a helpful pointer to more interesting comments.
The conventional HN form is neutral, something along the lines of "Previous HN discussion: <link>."
In American conversational English "this was discussed like a year ago" carries the connotation that another discussion is redundant. It reads in the voice of a teenager's critique.
Just wanted to remark the time period in which it was published before, to give some context. I'm not a native English speaker, so excuse me for the phrasing mishap.
Decent article full of good information. One exception -- the C standard is vague on how bitfields are implemented and the compiler can rearrange them in any way it pleases. You are not guaranteed efficient packing, placement order, or size.
It's not quite that loose. Order of elements is guaranteed, same as any other structure members, and packing is guaranteed if the members fit. Whether bit fields can cross word boundaries, and endianness, are implementation defined, so bit fields are still not usable where there's any interoperability constraint.
The article goes into more detail on the "why", but the "art" is not really that complicated or lost: sort structure elements by size/alignment, biggest first.
Unfortunately it is not as simple when you have to account for different threads on different cores accessing the same structure, as mentioned in the article. [1] Or the performance implications of managing multiple instances of structures. [2] Or the alignment requirements for doing SIMD. [3]
Additionally, it sometimes makes sense to duplicate data to minimize cache misses. For example, translating/rotating/scaling entities in game engines. It makes sense to copy entity transforms (usually a matrix computed once) from a central manager to sub-managers for certain components. The memory overhead is traded for more efficient processing in sub-systems like rendering, animation, physics, etc.
You seem to have missed the whole section in the article that begins: "While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing."
> sort structure elements by size/alignment, biggest first
This does not work if I'm using structure packing to align my structs with an existing protocol (i.e. a protocol that does not have the kind of "holes" that an unpacked struct might have).
I find this a very strange response: "this technique doesn't work if I have to comply with an external standard." Fine, then don't use the technique.
Although, I believe the more common approach is to define two structs: one using the packed standardized layout, and one using a layout more suited for whatever you're actually doing. In that case, you may wish to consider sorting the elements by size and alignment for the internal use-only layout.
Perhaps you aren't? Unlike struct packing, the knapsack problem has a fixed upper limit. There's no limit on number of struct fields, and indeed, every solution consists of all of them.
I'm not saying it's exactly the knapsack problem, but it's quite related. Ordering from largest to smallest does not yield the optimal solution - period.
The article says throughout that sorting the fields in order of decreasing alignment requirement (i.e., size) minimizes slop. The proof for this is pretty straightforward. The article also discusses why you might not do this (e.g., structures might try to match layout of memory mapped devices).
Rudeness is a choice you make, by the way. You can change your behavior.
I have no experience with the others, I've just briefly read about them and assume they are similar. If they don't work in C and you would find the dumps useful, try compiling as C++.
I take it back, the VC++ options are C++ only (faulty memory, it's been too long.)
There is a warning I added which does work in C (https://msdn.microsoft.com/en-us/library/t7khkyth.aspx). As I recall, this warning is sometimes useless (unless it's been improved upon), so it's best to turn it on only when you're investigating packing.
You make a good point about C code not compiling as C++ as much as it used to, but that's probably less true for VC code.
And I should point out - often, for the purposes of tuning, you can extract just what you need from some code, get that small chunk of code compiling as C++, and leverage your compiler's dumps.
Last thing - debuggers also know the layout of your objects. windbg's 'dt' command can show you the object layout - I'm sure other debuggers have a similar command.
If anyone wants to take a look at the struct offsets inside their own code, the article mentions pahole as being a useful tool but I found that it chokes on recent C++ - the dwarf libraries it relies on can't cope with lambdas and various other things IIRC.
However, helpful people have written an extension to gdb which does something roughly equivalent & lets gdb do all the heavy lifting of parsing the struct data from the binary debugging information.
pahole.py ships with Fedora, but since Debian doesn’t include it for some reason I’ve kept my own versions around based on something I grabbed from the gdb mailing lists some time ago:
pahole is fantastic. If it worked on modern clang-generated binaries, it would still be the best. This entire article could be replaced with "use pahole."
Anyway, the problem with pahole is that it is heavily dependent on libdwarves, a custom wrapper around libdwarf by the pahole author. libdwarves has bitrotted since 2010-2011 and newer dwarf extensions produce aborts. Hurray.
I started a similar tool in C just using libdwarf directly. It's not complete and/or perfect, but it works on some binaries that pahole does not. See https://github.com/cemeyer/structhole . (And it's < 500 LoC and BSD-licensed. Please use as you will and contribute patches if you are interested in improving it.) Cheers.
The following is not very useful since I never bothered to polish it up or write any documentation, but in case anyone is curious, it's a similar tool I wrote that focuses on performance, outputs JSON, and uses no external libraries. I needed to dump struct info from a binary I was trying to exploit, which had debug info available in DWARF format, but was huge. (gigabytes) pahole not only took like half an hour to get through it, it leaked memory such that it took up most of my 16 GB of RAM by the end, which was immensely frustrating; my tool does it in a few seconds. (There's also a Windows PDB version which is older and probably broken.)
In Pascal, you could just declare a structure as "packed", and this packing happened automatically. "packed array [0..n] of Boolean" is an array of bits. That feature has not been copied in more recent languages.
Interesting to discover that this is a "lost art".
I didn't graduate, but I did attend PUC-Rio for a while and I remember professor L. F. Bessa Seibel's lecture about that, as part of the INF1008 class (Introduction to Computers' Architecture), which is considered a basic first or second semester course for undergraduate CS students.
I am going to suggest that 'Structure Packing' isn't the lost art. I think the lost art is 'Variable Sized Structs', the emphasis on the last 's' of 'Structs'.
This is the situation where your solution calls for an array of structs that are contiguous in memory but aren't the same size. This happens oftenish in database, os, and compression tech. C99 and gcc have legalized the often used convention:
struct varb {
char *foo;
unsigned char age;
int number_of_bars;
char var[]; // it's something like var[0] in GCC, or var[1] in versions < c99
};
where the last element is actually the start of the variable length piece. In all the C supported implementations of var length structs, we have limitations:
- the variable part must be declared last
- the struct may not appear in other structs or arrays
which totally makes sense, but doesn't help us implement a contiguous array of variable length structs. Also, the restriction of having the variable part of the struct come last may waste a lot of space if we have a structure where we would rather optimize the order of the struct ourselves, but that's another post!
Assume we are OK with the limitations of our struct varb, we now need to declare and malloc (or mmap) a char* (to hold all of our data) and lay out our varb structs fairly manually. Once we have found the start of a particular varb struct, we can cast it's char* address to a struct varb* and the compiler then can help us populate or interrogate our data. Note that we will need to keep track of detailed memory usage 'so far' when we are populating as well track padding, alignment of adjacent structs, as well as any reallocation when our struct array needs to grow.
The field order of structs is unspecified by default, to allow, e.g. reordering to the smallest size, or even randomising for security/fuzz testing.
One can opt-in to a fixed in-order layout with `#[repr(C)]` and even avoid padding for such a struct with `#[repr(C, packed)]`.
NB. in general it is impossible to pack a struct perfectly manually, e.g. generic structs can have their order for minimal size change depending on what types they're used with. It can even be somewhat platform specific, with e.g. pointers changing size/alignment on 32-bit vs. 64-bit computers (although pointers probably aren't problematic, since they're switching between two adjacent classes).
I know better than the compiler how my data is used. It is really simple as that. I can optimize with the totality of the program in mind while the compiler optimizes on heuristic like `order largest to smallest`. So yes, the compiler is doing it sub-optimally. Not to mention the repetitious hell it creates when writing C bindings, or memory-mapping structures, and so on and so forth.
In the end, however, my reservations are mostly due to a "get off my lawn" mentality.
Only when the difference in performance is end-user discernible. I do make sure the code is well crafted in those bounds though. There's no _valid_ excuse for crafting ugly code.
Coincidently my stuffed toy that I have kept throughout my childhood was named "Mel the Pal." I just noticed the irony of that. :)
The compiler can't know exactly how I intend the data to be read or written to at runtime. You generally don't want to read and write to the same cache line at a time.
That guarantees the same layout that C would have for that struct. There is also #[repr(packed)] which puts things as close as possible without alignment, and then you can insert padding manually with fields of type [u8; N] where N is the number of bytes.
The article doesn't appear to mention it, but in the past I had a need for this type of information when interfacing C data structures from other programming languages.
If the structure you are connecting with is packed you have to be aware of that or else you'll end up with data soup :)
Is it really a lost art? I've recently been learning C and C++, and all three of the textbooks I've been studying have mentioned in, and two go very in-depth on it (one of those is a games programming textbook). This is a great resource though, definitely going to add it to my studies.
This was rather readable, more so than I've come to expect from that particular source.
Still, I must point out that using a signed 1-bit bitfield (there are lots of those in that article), is generally a bad idea. Consider for instance something like:
struct foo6 {
int bigfield:31; /* 32-bit word 1 begins */
int littlefield:1;
};
That "littlefield" member is signed but has a size of just one bit, which means it is limited to representing the two values 0 and -1 (in two's complement). This is very seldom useful. The general rule of thumb is to always make "boolean"-type bitfields have an unsigned type, for this reason.
> I'm guessing it's a lost art because available system memory is much larger
This is true, but locality also matters a lot more these days. Minimizing padding keeps more of your structs in a single cache line. That means doing this not only makes your program use less memory, but it can also be faster, in some cases significantly.
> many programs are written in other languages these days.
True, but most of those languages or their VMs are themselves written in C, so it still matters at some point. :)
nikic is an incredible asset to the PHP community. I feel slightly less bad about my own programming abilities now that he is no longer a teenager ;)
Everyone likes to rag on PHP's chain, which is fair enough in some regards, but it has become a far better language recently, with more useful features, faster development cycles etc. People like nikic seem to be at the forefront of this.
Since PHP is a major part of my day job this all makes me very happy!
That directly trades size for less efficient code. The structure packing described in the article keeps the same code efficiency while decreasing the size.
There are good reasons (as mentioned in the article) for this feature but it's not equivalent to manual structure packing.
#pragma(pack) or __attribute__((packed)) still won't re-order structure fields for optimal alignment. The directive/extension only prevents holes in the structure.
And then we might get fields that cross word boundaries, making every access unaligned or way worse on architectures that haven't got addressable bytes.
i definately agree with the 'lost art' part. i've been stunned to find programmers who don't understand this and make arguments like "the compiler optimises that for you"... so much so that i've written my own piece on this, and more than once (and in more or less detail in various places - i.e. bitfields):
How about the lost art of COBOL record packing where COBOL records are like a combination of C's union and structs. It was great you could sometimes squeeze a few extra bits out in one case to use better in another case.
Years ago I worked on a few projects that used complicated structure ordering, but even then it was my understanding that (with the right options) compilers could optimize these things better than most people could by hand.
This is something the compiler can't optimize very well. For one thing, the spec requires that they appear in the given order.
If this wasn't the case, a significant amount of code would break. I've seen a good bit of code where there's a struct A, and then struct B begins with the same fields as in struct A. This allows you to treat either of them as a struct A, a kind of polymorphism. The compiler doesn't know whether you're going to do that, so it can't arbitrarily rearrange the members of the structs.
That would be bad in all kinds of situations which depend on one specific ordering of fields. When you use structs for inter-process communication in shared memory or inside memory-mapped device memory you care a lot about defining your own order. When writing structs verbatim to network or disk (or reading that way) automatic reordering would quickly break things too.
The compiler doesn't care much how your structure element is named. The naming scheme is for the benefit of the developer. Once the program is compiled the variable is referenced by memory address, not variable name. Sorting them by name would not help there.
I don't think (s)he meant sorted alphabetically, but rather, sorted in such a way as to minimize memory usage, ie: packed.
This is obviously not the default in C, but can be enabled via compiler extensions. It's also important that the compiler doesn't automatically do this since structs of the same type can be of different sizes. In order for this to work, the memory layout needs to be as originally defined in order to be correct. For an example of this see the simple dynamic string library used in redis.
Yep, you're right, I was mistaken, it's been a while since I looked at all the extensions and I thought reordering was one of them, I should have rechecked before I made my comment.
I've actually had a question about this on my programming interview for a games development company. And yes, I have had to use it several times since. So no, the art is definitely not lost :-)
One thing I learned with Picasa (which is basically an in-memory database): if you're storing a lot of data in RAM, sometimes you should think of the data structures more as a database (with copy in/out, rather than reference). In multithreaded cases, this kind of approach unexpectedly gets you more parallelism and speed.