I stopped reading after the first example, which once again (as in the last post in the series) demonstrated that the author doesn't really know what he's talking about. The code would be written as:
std::list<person>
Not:
std::list<person *>
...which avoids the double malloc and presents a nicer API to boot. Not to mention that in that case, a C++ programmer would either use a struct or would need to write accessors.
C++ has a lot of warts, and the author seems to have written some non-trivial software in it, but his critiques of it reveal that there are a lot of gaps in his knowledge of the language and a lot of his criticisms are based on falsehoods. (However, a totally valid criticism of C++ is the amount of idiosyncrasies the language has and how relatively difficult it is to master.)
I had to fight the urge to stop there too; it's also more idiomatic and probably both more economic and performant to use a vector in this situation; "lists" in C++, at least on the projects I worked on, tended to be a code smell.
I think this is a very big problem with C++: most of the funnel of new developers for C++ are C programmers, and transliterating C idioms to C++ (and, worse, the C++ standard library) usually produces pessimal C++ code.
I've been programming in C++ for about ten years, first as a hobbyist, then as a student and finally professionally and it is extremely rare that I've seen an std::list needed over an std::vector or std::deque.
The general rules I follow are roughly: If the collection only grows one way and has lots of random access (or even lots of iterating over elements... linked lists aren't exactly the most cache friendly data structures), then use a vector. If the collection needs to grow from both ends and has lots of random access, then use a deque. Only if neither of those are suitable, should a list be considered and the main reason, IMHO, to use an std::list is if inserting or removing elements in the middle of the list is a common operation. But in the case of removing, if copying is cheap and order does not matter, then a vector may still be better and removal can be done by swapping the element to remove with the last element and then removing the last element. In general, if ordering is not important, a std::vector is probably a good choice.
So really, I see the use case for std::list as a real but very rare one.
It was a bit of a shock in 1998 when I first started writing professional C++ to realize that lists and hash tables were unidiomatic in (what was then) STL. I was a lists-and-hash-tables C programmer at the time.
I moved from a heavy C++ role to a pure C role in 2002 and came to realize that this is something that C++ got right. It's not that C++ has crappy lists; it's that lists are usually a crappy solution. Not only that, but red-black trees, as long as you're not the person who has to implement and test them, are often a better option than hash tables.
I wrote a simple "vector.c" for our project, and (I never tire of bringing this up) ported STLport's red-black tree <map> template to C, specialized on void*. I still use them on C projects.
Any time I feel myself reaching for a dequeueuqeque, I recheck my assumptions and convince myself I'm in the wrong direction.
> red-black trees...are often a better option than hash tables
Why is this?
Trees have O(n*log(n)) element lookup time as opposed to O(1) for hashtable (assuming table size is allocated appropriately or grows with the data, and you have an effective hash function). If data structure accesses are on the hot path, a hash table is much faster.
Many times, when you need ordered data, you can actually get away with using either a heap or a sorted list, rather than a red-black tree. Both of these are much easier to implement than red-black trees, and have the advantage of no extra per-node storage for child pointers. Which can make a real difference in memory usage if your objects are very small, and a real time difference if it means the difference between CPU cache/main memory or main memory/swap.
Red-black trees can be good if you can easily write a comparator for your objects but not a hash, really need capabilities like quickly determining subranges, aren't on a performance-critical path, or aren't concerned about CPU efficiency because the available hardware greatly exceeds your problem's requirements.
As with most data structure questions, there are no hard-and-fast rules; the data structure you choose is going to depend on your particular situation. I personally probably use each of the four solutions I mentioned (hashtable, red-black tree, heap, sorted list) less than fifty percent of the time.
As with most data structure questions, there are no hard-and-fast rules
As in everything in programming, there are always going to be trade-offs. At least C++11 gives you options: <map> is generally an RB Tree, <unordered_map> is generally a hash table.
Trees have O(n log(n)) element lookup time as opposed to O(1) for hashtable
According to [1], <map> (And <set> and <multimap> and <multiset>) lookups have O(log n) complexity, while <unordered_map> lookups have O(1) amortized and O(n) worst-case complexity.
Again, lots of trade-offs to consider, which is why the C++11 stdlib, IMHO, does a great job, whereas other popular languages which provide "one true data structure" of each type are really not that great without third party libraries, as they don't give you the option to choose the right data structure for the job. Eg, in python, if you need a dict, you use a dict without thinking about if its implemented as a tree or a hash table or a heap or a... similarly, if you need a list, you use a list without worrying if its a linked list, a skip list, a dynamic array, something else entirely. Usually this is a good thing - you can just get on with your life, but sometimes you really do want to make sure that operation X will be O(1) or O(log n) or whatever, and if the built-in data structure does not guarantee this, you must use third party libraries. At least in C++11, the stdlib gives you some options (and usually makes it clear what the complexity of operations are). I really like that about C++11.
For a lot of languages, "the right data structure for the right job" means a choice between lists, vectors, dictionaries (however they are implemented), sets etc. In reality, that's only the first tier - once you decide that lists are the right tool, you still need to decide what kind of list, and a lot of languages don't give you that choice by default. (Though to be fair, a lot of people don't need that choice - premature optimization and all that)
C++11 has hash tables now. Check out unordered_set and unordered_map if you really require a container backed by them. I agree with you on the RB trees though.
Edit: I should add (and you probably know) that std::set and std::map are typically RB Tree based containers.
STLport, IIRC, always had them. If you picked a specific STL instead of using the c++stdlib bundled with your compiler, you could generally always have hash tables with iterator semantics. But <map> was usually the right choice; even apart from worst-case behavior, you'd always end up needing not just to look things up, but to traverse the collection.
That's true, I remember using <hash_map> back in 2003 or so for something, but since then reverted back to just using <map> and the vanilla C++ stdlib which has always met my requirements and performance needs.
Though, nowadays, I sometimes find myself using tbb::concurrent_unordered_map and tbb::concurrent_hash_map, not to mention tbb::concurrent_queue.
To the first part: I guess it's true that if you see someone using a STL list for "just a bucket of stuff" that this is bad form. But they're general purpose doubly-linked lists and absolutely should be there in the API for situations where they're needed (i.e. something where order is important, the size is likely to be large, and you need to mutate in the middle).
The second point I think is just isomorphic to "breadth is good". A Ruby or Python programmer coming to C++ is going to avoid a lot of the mistakes of the C programmer, but will likewise miss a lot of the core ideas and generate equally pessimal code, just for different reasons.
I'm not saying C++ shouldn't have a doubly linked list. I'm saying that C programmers are trained to use linked lists. They are the first variable-length containers most programmers come into contact with and so C programmers tend to be imprinted on them like ducklings. Linked lists are a poor general purpose container.
I've been putting myself through C++ "boot camp". I got a copy of Bjarne's "The C++ Programming Language" and I'm doing every exercise. I'm up to 302 at last count, and only a couple more chapters to go.
It's been grueling, but very well worth the time. It helps me avoid a lot of mistakes I would have made had I just jumped in from C and C#.
Unfortunately, the 4th edition (which presumably covers C++11) doesn't come out until Feb.
I recently assigned myself a project to learn C++ too, mainly in the interest of learning an entirely different kind of language (I only know python very well) but also to scratch a long pending itch of writing a few simple games and doing some graphics related work.
What I have found, while going through the C++ primer by Stanley Lippman and Barbara Moo (and having read most of accelerated C++ earlier) is that I have not so far seen anything that I really dislike.
Maybe this is because I am not really experienced and so cannot see the obivous pitfalls, or maybe those things will come in later in the book. But so far I see a language which I can use in many places.
Also how is Stroustroup's book for someone who finishes the C++ primer (which goes over just the basics).
I recommend Stroustroup's book as it really leaves no area unexplored. It is better to read it after having read some other material, as you have. I would wait for the 4th edition, though, which covers C++11.
It's really tough to say, I've been working on it off and on for a few months. But I didn't really keep track of the time spent. I'll probably put it up on github when finished and see if anyone wants to rip apart my solutions and/or contribute better ones.
As I see it, the only use case where linked lists are superior to other types of lists, like, perhaps, ArrayList in java or vector/deque in C++, is if the following conditions are met:
1: you care about ordering - often you don't care about ordering and in that case, there is no need for a linked list because you can achieve O(1) insertion & removal then too: insertion can always be at the end, removal can be a "swap with end element, remove end element" operation.
2: inserting and/or removing from the middle of the list is a common operation, if it is not a common operation, then the added cost of doing so with a vector may still be outweighed by a vectors other advantages
3: you do not require random access - lists do not provide random access and lookup is O(n). At the expense of additional complexity in implementation and more memory overhead, you could reduce lookup to, OTOH, O(log n) by using a skip-list
4: you do not iterate through the list often - if you do, you are likely going to blow the cache and mess up prefetching due to poor cache locality. Iterating through an array-based data structure can be much faster in this case.
I would say that for a list to make sense, you MUST have 1 and 2 and probably should have 3. 4 is optional, but if true, should make you consider if there might not be a more suitable data structure. In my own personal experience, this is rare. In fact, in my own personal experience, usually, code either does not require 1 or requires 1 but not 2 - either way, lists are not the appropriate data structure in those cases.
1. Linked lists require extra per-node memory for pointers. Plain lists (vectors in C++) don't.
2. Linked lists have poor cache locality. Each item is in a different place.
3. Using Person* instead of Person for the list objects results in twice the number of objects. In C++ you have this choice; in many higher-level languages like Java, you don't; generic linked lists only support double-object way. (You could of course manually add pointer fields to your objects to obtain a single-object solution, but this defeats the purpose of write-once-run-with-any-element-type idea of a reusable container data structure library.)
4. You often only care about adding and removing from one or both ends. A vector or deque has O(1) removal in this case as well (assuming it's allocated with adequate space, or you don't care about transient CPU spikes due to reallocations early in your program's run, before it reaches its max size).
5. Large numbers of objects are more stressful on memory allocators and garbage collectors.
Is your point that programmers in other languages automatically make good choices about data structures and memory usage? Because that seems, respectfully, to be a dumb point. They do not.
I agree its a problem with C++, but IMO it is an inevitable consequence of its strengths: namely that it is regularly used across a very wide range of domains than other languages. I've sen it used on 16-bit microcontrollers, where templates and exceptions were considered harmful, and in large applications, where those features were considered essential.
Perhaps I'm being dumb, but doesn't std::list<T> already do exactly this?
From SGI's STL docs:
"A list is a doubly linked list. That is, it is a Sequence that supports both forward and backward traversal, and (amortized) constant time insertion and removal of elements at the beginning or the end, or in the middle"
As others have said - yes, as long as you keep the iterator not the object itself. e.g. from T itself you cannot find where it was stored, so with std::list<something>, and then only from something you can't find where it is (you can have to look for it in the list).
The idiomatic way to access and manipulate STL data structures is through iterators. In this case, an iterator for a std::list is basically a pointer/reference to the person container in the article, so if you're referring to the list objects via their iterators there's no need for O(n) list traversal.
It's also not a problem in C++ if you don't care. Just because C++ gives you the means to enforce encapsulation doesn't mean you have to. And, as I said below, when your goal is performance and you need to break abstractions, it makes sense not to enforce encapsulation.
He never mentioned other languages, so no, I don't think that was his point. I think his point was that C++ makes it hard to make good choices about C++.
I'm saying Java programmers make inappropriate data structure choices about Java programs, and boy do Ruby programmers ever make bad decisions in Ruby.
I got that. Not very nice to say his point was dumb because you wanted to make a different one, though. I think we can all agree that C++ is annoying and leave it at that.
To an extent, they're forced to - when all you have are dynamic arrays and hashtables (I'm thinking of python here), it's hard to go very far wrong. (Isn't it? Am I being naive?)
Naive? I do not know, but your logic surely is :-)
Consider the similar "if you have only one kind of string, it's hard to go wrong". If you need Unicode, but your only type of string is ISO-8859-1, or if you need to cramp many strings in memory, but your only string class uses UTF-32 internally, it's almost unavoidable that you go wrong.
The advantage of having only few data structures is that the developers of the system have only few places where they have to spend effort on optimization. So if, like Python, your platform is popular, chances are that those few basic structures have been optimized to death.
That does not necessarily make them optimal for every use case, though. The big advantage of C++ is that, if the need arises, you can (see below) make a special-case data structure that beats even the most optimized general-purpose one.
Of course, that 'can' is relative. It is really hard to make a robust special-case data structure.
From yesterday's bug hunting session: sets (okay, not exactly hash tables) are not ordered data types, so do not use a set for intermediate values when order matters. (My fault - I'm just pointing out that it's not that hard to get Python wrong either.)
It is rather hard (though I believe not impossible) to find a use case where dynamic arrays and hashtables are not within a constant factor of the best possible data structure.
On modern architectures, cache misses are performance killers, and a list is a cache-miss machine. Moving memory is cheaper than doing a linear search through a list. What used to be true 20 years ago isn't true anymore, let's stop using those lists already !
There's no such thing as 'O(n/2)', big-O notation is about worst-case complexity. O(n) means 'linear time in the number of elements', not 'n operations' or something.
But in practice, the amortized time complexity will indeed usually be half of the worst-case scenario when removing elements from a vector.
f(n) being O(n) means: there's a N and an x such that, for all n > x, N*n > f(n). By this definition, O(n/2) is the same as O(n) is the same as O(n/42). You just choose a different N. But it's better form to write O(n).
The removal time seems to be O(1) for the last element and only slightly higher for the second-last
The removal time for the second-last element is still O(1). There is no slightly higher here. O() does not count steps or cost or anything like that - it counts worst case or asymptotic complexity. O(1) means constant time/space/whatever-you-are-measuring complexity. Removing the last element of a non-empty vector always takes the same amount of time (since we are talking about time complexity in this case) no matter how many elements are in are in the vector. Similarly, removing the second-last element always takes the same number of steps regardless how many elements are in the vector (as long as there are at least 2) - still O(1).
That doesn't mean that they take equally long - the second-last element does indeed take longer to remove (by some constant factor), but complexity does not care about that, it only cares about the relative difference in respect to n (usually the size of the input) and the worst case (usually; you can also look up θ(..) and Ω(..) and others, but generally worst case is much more useful than best case, though sometimes average case is good to know too).
Note also that a O(n) algorithm may actually execute faster than a O(1) algorithm, at least, for small values of n. For example, finding an item in a hash table may be O(1), but finding an item in an array, O(n), may actually be a lot lot faster, eg, if the entire array fits into L2 cache. On the other hand, if the array is so big that it swaps to disk, the hash table's O(1) will really shine.
O(..) notation for complexity is usually qualified with "worst case" or "average case". Either way, "best case" behavior of O(1) is not really interesting.
That works only for simple datatypes. Any not trivial object (containing large amount of memory, handles, file descriptors etc.) is non-copyable. Thus list<person> won't work (according to STL, "person" has to be model of Assignable concept.)
Just the discussion of std::list<person> vs std::list<person *> that follows your comment makes my head hurt. It's like not only is c++ a language from a committee, any line of c++ code could be analyzed and argued over by a committee.
I think I'll tend to stick to C most of the time, thanks.
Even the C++ people probably can easily analyze and reason out the C implementation without much effort or dissent, apparently not so with C++.
This is the argument that comes up every time a C vs. C++ discussion occurs. It is faulty, because it assumes a false premise -- that the C and C++ implementations will be largely of the same complexity.
Unfortunately, in the C implementation of anything beyond a trivial application with no large-scale data structures, the majority of the "reasoning" being done will revolve around basic data structure implementations! The very containers, memory allocation structures, and such that require virtually NO reasoning in the C++ implementation, because they are proven.
After you've implemented the same basic structures again, and again for the hundredth time in large-scale C programs, C++ is a breath of fresh air.
Unfortunately, many people approach C++ before they have many miles on their programming chassis; this is like handing an infant a revolver. Bad things are likely to happen.
This is not a problem with the revolver; it is a parenting issue. Don't let a junior programmer use a language requiring significant reasoning skills, before they have earned the right -- by developing those skills, repeatedly implementing the very structures they'll be accessing in the C++ implementation!
We wouldn't let a first-year civil engineering student develop a bridge design, without first gaining many, many years of experience developing -- and failing -- on much simpler designs. Why should we expect wide-spread success, allowing junior programmers access to a spectacularly powerful and complex programming tool?
before C++11 you would use <person *> to prevent additional allocations on insertion. C++11 is a little bit smarter about not wasting allocations for temporary objects, but requires you to have move operators for any classes you make. In C land no one cares since there are no temporary objects to even worry about, you would be passing around the memory anyway.
> requires you to have move operators for any classes you make
Actually, in the specific case of std::list<T>, T can be both unmovable and uncopyable. Instead of using push_back() and insert() to insert elements, you use emplace_back(), and emplace() to construct the elements in place.
Also, the compiler is pretty good about auto-generating move constructors when all the members of a class are safely movable, so classes you write will probably be movable without you needing to do anything.
emplace_* is C++11 - which is rather new, with support on many platforms still lacking.
They started ZeroMQ in 2007 (or even earlier? can't quickly find a reference). So, yes - some of the problems that they faced in ZeroMQ have been addressed years later (although the solution is not yet widely available). I believe this is a point FOR his conclusion that he should have used C, rather than a rebuttal.
A reasonable response, although the problem with this is that it's impossible to manipulate persons without causing their values to be copied (which might not even be possible given the other members of the type - which might not even be under the author's control, e.g. 3rd party libs). Additionally without extra complexity, simple pointer swapping tasks like moving a person between list suddenly starts to involve memory allocation, etc.
Right, but the same trade-offs apply to C; C++ is actually superior here: if something shouldn't be copied, you can enforce that in C++, but not in C.
The author was arguing that writing in C++/STL meant your code was necessarily less efficient than C; that just seems incorrect.
Edit: If you're referring to copying the person values when they're returned from the list: a reference is returned; the compiler will optimize that "pointer" away. Most C++ compilers can even do this when the value itself is returned and avoid the copy. My rule with C++ is that the compiler is stupid, but never where you expect. You need to optimize based on a profiler.
std::list<T>::splice() lets you move elements between lists without copies or allocations.
Before C++11, it was a problem that inserting an element into a std::list required a copy (that's the only place though - once in the list there are no more copies). C++11 adds move semantics and also std::list<T>::emplace(), which eliminate the need to copy.
Actually in this particular case that "copy" is allowed to be elided even in C++98, as the copy constructor of person produces no side effects. It would be hard to notice if the optimizer is that smart or not either way though, since it will be optimized by redundant load/store elimination even if it was initially generated without optimization.
I stopped reading your comment at "I stopped reading" too.
I've _never_ read an intelligent comment starting with "I stopped reading". These kind of comments always try to show how "clever" the commenter is, and how he doesn't tolerate fools, etc, while 90% of the time they miss the point.
Have you continued, you would have seen:
"Assume that the object to be contained in the list is non-Assignable as is the case with any non-trivial objects, for example those holding large memory buffers, file descriptors, handles etc. If the object is Assignable simple std::list<person> would do and there is no problem."
And even if you read the post _before_ he added the above clarification, if you didn't have the knee jerk reaction you could probably have seen that this (object being non assignable) could be a potential problem in your solution too.
He did add that clarification afterwards; that was before the first example. I read up to the first example.
However, his clarification, and your proviso are also invalid.
First, your proviso, that it's non-assignable -- his type was unusable; that it's non-assignable simply seemed to be an error. In C++ the default permissions for a class are private (not so in a struct, as I hinted at in my post, in fact, such is the only difference between a class and a struct in C++).
Now, onto his clarification:
The baked in assumption is that in a non-trivial C project that one would not be using a standard set of container abstractions as well. Such is generally not the case. See, for example, GList [1]. The replacement for a template based container in C++ is generally going to be a macro based container abstraction in C, whereas the replacement for a container of pointers in C++ is going to be a void * based generic container implementation (like GList) in C. Both sets of generic implementations are going to have broadly the same performance characteristics (specifically, they're going to have the same number of mallocs of about the same size).
In both languages, if one needs specific performance characteristics -- specifically if mallocs are so critical that they must be restricted as much as possible -- one will use a special purpose data structure. That however is a somewhat rare case for structs which are composed of non-POD types.
The fact that C++ is so complicated that three people knowledgeable about it can't agree on the best way to create a list of objects is a pretty good reason to avoid it.
C++ is a ridiculously complicated language. And I mean that in a bad way. I don't normally use C++ as my primary language, but there are certain cases where I need the speed or low-level access. And I feel like every darn time I do, I need to relearn all of the language's weirdness and idioms. From order of operator precedence, to best practices with classes, to what a "char" is on which platforms, to the various 101 meanings of "static" and "const".
It's an unnecessary time drain and I can't but help wonder how many man-hours a year are wasted by people learning all the intricacies of C++. (Heck, just check out this guy's FAQ on all the things you probably never knew about C++ http://www.parashift.com/c++-faq-lite/). I could almost memorize a list of x86 opcodes faster.
Complexity != powerful language. Clojure makes this point very well.
C++ is a ridiculously complicated language. And I mean that in a bad way.
What really gets my goat is the minimal differences between 'struct' and 'class'. Why do we need both again? If you're going to allow structs to have member functions and such, then you should never have had a 'class' keyword to begin with. Ugh.
I shudder to think how much app-specific code this would require in C, though. I am not a C programmer, but I don't think I'm missing something saying that in C you have one of 2 choices :
1) you re-implement the datastructure for every case (person). Which explains why you'd use a datastructure like a linked list, because anything else would be way too much work.
Since you're likely to search through a "Person" list by name, a linked list would be close to the worst possible option. Even an unsorted array would be better, even if only for practical "doesn't destroy the cache" reasons, unless you have massive numbers of weird inserts.
You see that a lot, in C programs. People using suboptimal datastructures, not because they don't see why it's suboptimal, not because they don't know how to create the optimal one, but because they don't want to tackle the complexity of rewriting a real datastructure for this specific case, and don't want to use macro-hell libraries.
2) you use macros to abstract over strings, and hope that your string expansions work correctly. At this point I would argue that it's actually more complex than the C++ solution.
Both of these options suck.
I would also argue that for these sorts of problems (anything involving business rules pertaining to database objects) you'd want to use a more high-level language, or just a straight-up database schema if you wish to guarantee correctness and constraints without coding for a week. I'd probably prefer python, but there's plenty of options.
1) He uses std::list<person*>, but std::list<person> is the equivalent to his C example, and would produce identical results to C, with much less code. It would also be safer e.g. in terms of memory leaks.
2) Templates are the C++ magic, which get round some of those theoretical limitations of OO programming. The compiler can evaluate the composition of the various objects, to see if any optimizations are possible. With C, this would be have to be done by hand. Yes, it's slower than compiling C, but it's much faster than having a human do it.
3) Making use of #2, it is then easy to swap out the implementation. Want to try a pooled memory allocator (because # of mallocs seems to be the axis we're optimizing on): just change the declared type. Want to try a different data structure (e.g. sort by age for fast lookups by age): just change the declared type. The compiler "expands" the templates, substituting in the various types, and produces code that is reasonably optimized. For example, it will use inline integer comparison for sorting by age, rather than calling out to a comparator function.
I'm sure there are valid reasons to preferring C to C++, but as a user of ZeroMQ, I wish they would spend their efforts documenting their protocol rather than bashing their tools.
Things that look like they use copying in C++ often don't use copying in the final output. Values are usually returned by reference, and many C++ compilers can often avoid a copy even when that's not the case (e.g. Return Value Optimization).
Even if the compiler doesn't optimize everything away, copying is sometimes faster, always safer and always easier than passing around pointers.
It's a big lesson I learned for C++ - write correct code first; profile; fix any real performance problems, rather than obsessing about my preconceived notions of what would be slow.
Yeah, I pretty much agree with everything you're saying, but trust me, std::list really does work by copying the items, so your point was not technically correct.
Well, if we're striving for technical correctness, std::list really works by invoking the assingment operator on the items. The original article is strictly correct when it says:
> EDIT: Assume that the object to be contained in the list is non-Assignable as is the case with any non-trivial objects, for example those holding large memory buffers, file descriptors, handles etc. If the object is Assignable simple std::list<person> would do and there is no problem.
Except that normal C++ programmers would deal with that by overriding the assigment operator on this class to copy buffer pointers or whatever as appropriate.
Try copying a complex object with threads running inside it, open file descriptors being used etc. What does it mean, for example, to copy a running DB engine? Some objects are just non-copyable by principle.
100% agree. It almost feels insulting to say that C++ programmer will only use std::list and a C programmer will only use a hand made doubly linked list. This almost feels like an interview question:
> Which is faster and uses less memory? A generic double linked list or a hand made double linked list?
This is a data structure, optimization, software development problem. The reason we use any 3rd party generic (in C OR C++) doubly linked list is because it helps code the problem faster. Any time you use a component you need to be aware of its overhead. After profiling my C++ code if I found that the std::list was taking up all my memory and all my cpu time I would then evaluate the algorithms I am using and pick the data structure that best fits that be that a custom link list or something else such as the above mentioned boost library.
You even have to nitpick the example about erase being expensive. When using std::list why isn't the code passing around the iterators rather than the person object? And for the memory of std::list overhead why didn't he use std::list<person>?
There is no problem using a hand made C style doubly link list in your C++ code for the core part of your application for performance reasons. Being a C++ program doesn't mean you can't use C style code or even have asm snippets. Following the same logic as the blog ZeroMQ should have been written in ASM and not C because it could have been faster, or the reverse it should have been written in Ruby because it would have been coded in a quarter of the time (but at the trade of slower runtime)
As for the conclusion I would say that the inefficiency is still in ZeroMQ code as the author doesn't fully understand C++.
Why not pass iterators? Consider the case where the object is contained in two lists. You would have to pass a pair of iterators. What if it is in 3 lists? Etc.
Agreed, but for the example the object only needs to live in one list (as shown by the fact that it could have been a C struct that contained the next/prev pointers) so passing iterator would have been sufficient and would have solved the O(n) problem(s?) he was having.
Yup. It's possible to do that. My point was that if you want to do that you break the encapsulation, i.e. contained object would have to know about its container(s). This applies to Boost intrusive lists as well AFAICS.
Which is entirely true, and is often the case when performance is a goal - abstractions must be broken. But that is not an argument for or against C++.
class person
{
protected:
int age;
int wieght;
};
template <class Base>
class people: public Base
{
private:
Base *prev;
Base *next;
};
people<person> plist;
That's what I thought at first, but friend is for when you want class A to be able to view private things of class B. That is, B needs to know ahead of time what A is interested in. He wants B to remain ignorant of what A wants - he wants A to be able to insert new fields into B. But we can accomplish the same thing with mixins.
In addition to that, he's effectively comparing apples and steaks in the case of std::list.erase() vs linked-list removal. If you want O(1) removal of a random element, you use a data structure that guarantees O(1) removal of a random element (caveat: not all lists are created equal, and not all lists have O(1) removal.)
That said, it's not even removal that's O(n) here, though he may characterize it this way: it's finding the element to remove, and list searches are nearly always[1] O(n) - even with doubly-linked lists.
[1]: Except in the case of CFArray/NSArray, which is occasionally a non-list masquerading as an array - this is, of course, not a true exception as it's not a true list. (http://ridiculousfish.com/blog/posts/array.html)
> If you want O(1) removal of a random element, you use a data structure that guarantees O(1) removal of a random element
Like std::list.
It's guaranteed that insertion and removal is done in constant time. It does mean that you need an iterator pointing to the element you want to insert in front of ahead of time.
If you want to be more specific, it guarantees is that given a list of elements to insert, the insertion and deletion will be linear in the number of elements inserted or deleted (ie, independent of the size of the list being inserted into; the insertion or removal of each value passed to insert() or erase() is required to be O(1))
(That's one thing that I like about the C++ STL: It usually guarantees a certain complexity for the data structures it provides)
Anything you can do in C you can do in C++. It is not that the author couldn't implement more memory efficient containers in C++ but that he chose not to even though he was aware of the drawbacks.
To get constant-time erasure of list elements, the author should be referring to the Person elements with a std::list<Person>::iterator instead of a Person* . Passing an iterator to erase gives you constant-time erasure.
An iterator is equivalent to a pointer (* and -> get you a Person), but it also lets the implementation access the linked list node.
This is a well known trade-off - your data structures affect the design and therefore run time profile of your algorithm.
A tightly coupled data structure is more efficient in time and memory, but harder to reuse, and makes it harder modify the code. A more general data structure is easier to reuse and makes it easier to modify the code, at the expense of time and memory at run time.
We use a combination of Moore's Law, and the cost of a programmer to justify the less efficient general solution in the near term, and explain that we'll optimize hot spots later. This mostly works. The Author has found himself lamenting this choice in piece of high-performance software.
If he has few lists, then coding the doubly-linked list pointer directly in to his structs is the way to go. If he has many lists, I'd suggest using the cpp or m4 (or some other templating tool) to statically generate doubly-linked lists at complile time.
My point was that the trade-off is not something unavoidable. A better language could allow you to syntactically decouple the tightly coupled objects. See the example at the end of the article.
I was actually agreeing with you, and elaborating on what you talked about.
I went back and read the bottom of the article. The Design Pattern people might call that a decorator. The ruby folks would use a mixin to monkey patch the person class. You're in good company.
The trade-offs are still code complexity + programmer time vs. memory use + run time. The more general you make something, the more overhead at run time.
What strikes me is the assumption that C++ is an object-oriented language. Yes, there is some support for the paradigm but it is neither great nor is it the main focus of the language. Developers have come up with reusable intrusive data-structure solutions, well knowing that they are bad for encapsulation and they are widely accepted. Most C++ developers wouldn't even shrug when you write a C-style list (in absence of a good library) precisely for the reasons stated in his article.
I was thinking the same after re-reading my post. I always thought it was a negative statement but it simply confirms what Stroustrup has always been saying about C++: "It's a multiparadigm programming language." Surely the support for one or the other paradigm could be better but this just reflects that things are shifting away a little from pure object-orientedness across languages.
Really? Because it started as "C with classes" IIRC. That you can used different paradigms, from procedural a la C to template-based programming etc, doesn't mean that it's not OO, or that it's not used for OOP 99% of the time.
Yes, that's how it started. But C with classes (or even C++ before standardization) is a really different language from what is there now.
As I said in another child of my post: It is a multiparadigm language. Of course a lot of people are using it for OOP. I don't dare to say what is the prevalent style of C++ nowadays. I don't know and I don't think anybody really does given the widespread use and different domains.
Okay, I hate C++ as much as the next guy, but this is ridiculous. There are dozens of ways to get similar code to what the author mentions.
So lets say a decent C++ programmer writes it with std::list<person *> then does some profiling and determines that heap fragmentation is becoming an issue. They can then refactor the code to use a more intrusive list type.
A crappy C++ programmer won't even know its a problem, but presumably we are comparing programmers of similar expertise?
A relatively new C programmer would probably come up with the C code. From the discussion of this topic it appears the C++ solution needs quite a bit more expertise to get right (I would not have known what exact C++ method to use, but I don't pretend to be a C++ expert at this point).
Hmm, actually I would expect most of the C++ methods not to have a problem with heap fragmentation, if they are copying the list when manipulating. The C one doesn't know when it will alloc a new bit of memory, so you might potentially have a cache miss for every single node in the C list, making traversal the worst case scenario.
It might bloat up a code base, and the resulting binary (if you are using it just for the sake of it), and increase the build-time (if you aren't careful/don't know what you are doing), but it definitely doesn't bloat up the size of the code you are writing...
Having no experience with C++ the next statement will just further show my ignorance of the language.
I was under the impression that an optimizing C++ compiler would be able to inline the container object and the contained object into one when working with template code, so that you would end up with something exactly like the C version but without the manual bookeeping.
At least I thought it could do this for certain types of classes, much like an optimizing C compiler can selectively inline functions based on heuristics.
You are on the right track here, but consider what the contained object is:
std::list <person*> people;
You're right that the instantiated list entry (what the article calls "helper") will directly include a value, but the value here is a pointer to a person, not a person.
I had overlooked that he was using a pointer to the struct, and so I was interpreting things to be that C++ was actually doing under the hood allocations. I'm glad I asked though, cause I was really worried about C++ for a second there :)
So it seems like if the author had wanted he could have used C++ like a safer version of C macros, which if I ever need to use C++ will likely be my approach too.
It wouldn't even require an optimizing compiler just one that properly implemented templates. As megrimlock points out that would be a list of person not a list of pointers to person. Also his delete performance issue would not be an issue if done correctly. It would be a one liner of person_list.erase(person_iter).
So far as I can tell he is either trolling or has internalized procedural programming so much that he can't see any other path. I wonder what his thoughts on FP are.
If you always reference the person in it's people list (i.e. the object is "owned" by the list), you never need to pass around a ptr to the object of type:
person *
Instead you can pass an list iterator which references the "person in the list" of type
std::list<person *>::iterator
Then you don't need to do an O(n) walk of the list to find the entry for erasure, you can just people.erase(it) directly.
EDIT: A lot of people point out that iterator should be used instead of pointer. However, imagine the object is contained in 10 different lists. You would have to pass structure containing 10 iterators around instead of the pointer. Morever, it doesn't solve the encapsulation problem, just moves it elsewhere. Instead of modifying "person" object every time you would want to add it to a new type of container you would have to modify the iterator tuple structure.
If the object is contained in 10 different lists, wouldn't the C version have 20 ptrs (10x prev and next)? That seems to me to be roughly equivalent complexity to your proposed "10x iterator" in C++, if I've understood correctly.
Basically, something needs to track your references if you have multiple lists. With C prev/next ptrs, the tracking is explicit in the object (well, the adjacent objects in the lists, to be more precise). With C++ containers, the tracking is iterator based.
With C++ you also have the option of using a smart pointer to do your usage tracking, which is probably simpler than either of the two approaches.
So you're saying one should use C over C++ because C++ doesn't provide an easy way do X and yet C doesn't even try? You seem to be complaining about a number of shortcomings in C++ and then preach that C is better for this even though the solutions you write in C have all the same shortcomings and more (not well-encapsulated). You're comparing apples to oranges. I don't quite understand the logic here at all.
The beauty of C++ is that you can use the features that you want and ignore the rest (and only pay for what you use). There is nothing stopping you from writing certain parts of your codebase in C-like ways where it makes sense and still benefit from other C++ language features.
The C version doesn't support having an object in multiple lists at all, so having it merely be awkward to have an object in multiple lists in the C++ version is still an advantage.
Incapsulation is a very questionable concept as well. However what C++ does provide is the generic programming: in C you will have to copy and paste implementation of List for every data structure as well as the algorithms. As suggested in the comment there are no inherent deficiencies in C++ and there is an implementation of intrusive lists in Boost. You don't have to use incapsulation if you don't want to.
Incapsulation is a very questionable concept as well.
One can live without it on a smaller scale. But for anything that will have any lifetime whatsoever it's absolutely essential. Distinguishing between a module's intended interface and its implementation details is what makes it possible to amend the implementation should the need arise. Without proper encapsulation every change must be assumed to be a breaking change, software maintenance costs go through the roof, module interactions become unpredictable, and that upstart kid who wears Chucks to work but knows how to write testable code eats your Wheaties.
The two are not separable. If people can monkey around with the data structures that your algorithm relies on, then the algorithm is not encapsulated.
Languages like Haskell rely on it as heavily as any other well-crafted system. Imagine Data.HashTable it were originally implemented using linear probing, and someone wanted to modify (upgrade?) it to use cuckoo hashing or hopscotch hashing instead. It wouldn't be possible to do without breaking existing software if users were permitted to directly access the underlying data structures.
Thanks for correction. I am not a native English speaker and in Russian(my native language) the word Encapsulation starts with Cyrillic "и" to which the closest analogue in Latin is "i" - hence the confusion!
> I am not a native English speaker and in Russian(my native language) the word Encapsulation starts with Cyrillic "и" to which the closest analogue in Latin is "i" - hence the confusion!
I figured, I'm a native English speaker learning Russian, as it happens.
While I am an advocate of straight C programming, I wonder if his issues with his person list in c++ vs a person dlink list implementation was an incorrect comparison.
stl vector would have been the proper "optimized" way of implementing that list of people in c++, as vectors don't deal with the heap. If your vector's memory space can fit entirely in L2, we're looking at IMMENSE performance increases.
I think the comment was a bad phrasing of "a vector is a contiguous memory structure, whereas a list is a series of independent allocations at various points in the heap".
Sure, but also remember that while vectors are better at locality, both vectors and lists can trash the heap; both are making variable sized requests on demand from the allocator.
Didn't mean to imply it was always the case. Just a counterpoint to the conventional wisdom that fragmentation is only a problem with lots of tiny objects.
Let the underlying algorithm take care of it. No need to write out the loop. Also, this makes it easy to then switch to std::vector or a hash based list or something else where walking the list may not be required.
std::list (used correctly) is IMHO a great example of why you should write in C++, not in C.
Most large C projects I've seen end up with multiple implementations of lists. You will have some combination of:
void* for a generic list, performance issues because everything is a function call to another C file which won't get inlined, memory leaks, subtle bugs, thread safety issues because it's often unclear what guarantees the homebrew version provides.
Do you want to spend your time on reinventing the list or on things that provide value?
Hrm. Check out sys/queue.h in any BSD (Linux too - though Linux kernel code IIRC uses a different list implementation). You don't necessarily need void * for a generic list, and you most certainly don't have to reinvent the wheel.
queue.h is an improvement. You still need to somehow know this is in BSD and you'll need to copy it to build on different systems.
The thing is that in C, most people do reinvent the wheel, including the person in the original article and as you say, the Linux kernel. That's been my experience.
The level of disagreement and misunderstanding surrounding C++, its typing and implementation is a sign that the language and its community openly welcome extraneous cognitive load. Or its a sign of my own confirmation bias or both (or neither). Other languages have their quirks as well, but C++ stands out for me. I haven't done a study--perhaps there are studies on this, so my opinion amounts to so much line noise.
It has more to do with layout than with access control. Fields belonging to one class get stored and laid out with another object. They're accessed as part of the other object, but the locks they acquire are for the original object.
I see faint echoes of ruby's open classes if I squint..
C++ has a lot of warts, and the author seems to have written some non-trivial software in it, but his critiques of it reveal that there are a lot of gaps in his knowledge of the language and a lot of his criticisms are based on falsehoods. (However, a totally valid criticism of C++ is the amount of idiosyncrasies the language has and how relatively difficult it is to master.)