GCC 4.7 adds transactional memory extensions for C/C++

srean · on Nov 21, 2011

On a related note, the cilkplus branch of Gcc 4.7 contains the cilk work-stealing multithreading runtime and language extension that intel has open sourced. http://software.intel.com/en-us/articles/intel-cilk-plus-spe...

Exciting times ahead.

lukesandberg · on Nov 21, 2011

cilk is definitely cool, but i don't think it makes designing parallel programs any easier, it just removes a lot of boilerplate. I have worked with similar infrastructures (not the language support) and found it to be extremely difficult. I looked into cilk and I don't think that the language support would have been a game changer. Does anyone have experience with cilk? What have been your experiences?

dadkins · on Nov 21, 2011

How do you know all of this if you haven't tried it? I think you'll find Cilk a lot more polished and refined then you might at first realize. Having language level support for fine-grained parallelism together with a provably efficient scheduler is a huge win.

But you'll have to be more specific with your complaints. What kind of parallel programs are you trying to design? What similar infrastructures have you worked with?

lukesandberg · on Nov 21, 2011

I was working with a small library that i implemented (with help) on a grad school research project. We were experimenting with various low level techniques/approaches to how one would implement an operating system for a 1000+ core chip. We were experimenting with memory management and scheduling designs. In order to test our designs we wrote a number of toy benchmark programs. Basically the typical set: merge_sort, fft, mat_mult, kmeans.... We ended up with a programming model that is not disimilar to the cilk model, though without the compiler support. So we were manually managing continuation scheduling with latches on atomic variables. This was a somewhat annoying thing to do, but we were able to make it work. I looked at cilk at the time and it definitely would have been nice to have compiler support but i don't think it would have fundamentally changed the way we implemented our algorithms.

scott_s · on Nov 21, 2011

If you're parallelizing algorithms with divide-and-conquer behavior (recursive algorithms, or anything that forms a tree of tasks and sub-tasks), I think it's a very natural form of parallelism.

I did something similar as a C++ library for my Master's: http://people.cs.vt.edu/~scschnei/factory/ I felt it was an intuitive way of expressing task parallelism, but if the dependent tasks don't operate on strict subsets of each other's data, I agree it's not much help. If that's the case, then you've successfully expressed the parallelism, but the larger problem remains: synchronized access to data structures.

Of course, transactional memory could help there. (Hey, full circle.)

ot · on Nov 21, 2011

Cilk is definitely more than syntactic sugar.

How would you implement work-stealing in pure C?

lukesandberg · on Nov 21, 2011

I've done it. You need hardware support to do it without locks. Here is a paper describing an implementation using CAS http://www.cse.chalmers.se/~tsigas/papers/JPDC-Lock-free-ski...

in order to do work stealing in general you just need to organize your computations into tasks. Having a language that supports closures (or a similar thing) makes it easier, but it is not neccesary because you can just use function pointers to define tasks (and some kind of struct to pass args)

marshray · on Nov 21, 2011

That is awesome.

I've been using g++-4.6 lately and with the new lambdas I'm migrating more and more to this async task application model.

One point that seems frequently understated with the lamentations about the absence of a true double-pointer-sized DCAS is that on machines with a 64 bit CAS (e.g. x86 and x64) you could still implement something useful CAS on pairs of 32-bit handles.

2^32 concurrent work items "ought to be enough for anybody", right? :-) Well, it might be useful to those of us with less than 100 GB of RAM.

lukesandberg · on Nov 21, 2011

Luckily these days people are designing algorithms/data structures that don't rely on DCAS :) So you can actually implement them on real hardware

marshray · on Nov 22, 2011

Hmm, maybe they should.

In the paper you linked they don't seem to be able to perform a priority queue op in under about 1 us. Even when there are only 15 threads running on an exotic 29-core machine, they can only handle 750e3 ops/second (with no workload on each op). (This is consistent with a benchmark I did on Boost.ASIO a while back.

So the tasks should be sized to take at least (thread cnt)* 9 us to complete to keep the overhead from the priority queue overhead under about 10%. The best cases for the lock free algorithm might let you bring that down to about 3 us in theory. I'm not sure that justifies the additional complexity.

This survey paper http://www.zurich.ibm.com/pdf/sys/adv_messaging/shortConcurr... says a lot of stuff and then at the end Figure 4 shows a simple coarse-locked single-threaded structure providing double the performance when there is actual load involved and significant preemption. Which is the opposite of what I'd expected, which would be the occasional preemption of the thread holding the coarse lock would cause a slowdown.

lukesandberg · on Nov 28, 2011

In my investigations i did find the overhead to be fairly high and in most of my benchmarks i found that using simple fixed size fifo queues was significantly faster. However my benchmarks were generally simple cpu bound tasks you would definitely want to explore more varied workloads. Also since you are using a priority based scheduling algorithm presumably the priorities are significant and therefore sacrificing some throughput would be worth it.

scott_s · on Nov 21, 2011

I think this is the implementation (pdf article available): http://www.velox-project.eu/velox-transactional-memory-stack

They point to the Velox project, which has many published papers. But this paper has Ulrich Drepper of Red Hat as a co-author. Since Drepper is active in glibc, I can imagine he worked with them on integration. The notation in the article also looks like what's shown on the website.

There's plenty of other work that could have gone into this implementation: http://www.velox-project.eu/publications There's a full TM system that tries to use idle cores or SMT threads (also known as hyperthreads) for the transactions, called STM2. Then some papers on lock-free techniques, static analysis, and a benchmark suite. There's also what looks like a direct response infamous "STM: Why Is It Only a Research Toy?" (http://queue.acm.org/detail.cfm?id=1454466) article: http://www.velox-project.eu/why-stm-can-be-more-research-toy

I don't know for sure, of course. The STM2 paper published at PACT of this year also looks interesting. Email me if you'd like to read it.

Edit: the paper I linked to at the top says it's implemented in gcc.

camperman · on Nov 21, 2011

Is this the first step towards a GCC that would have all the features of Clojure? That would be incredibly useful to me for one - I love Clojure but just cannot make any sense of what the JVM tells me when I screw up.

dandrews · on Nov 21, 2011

The short answer is no, STM is only a small (and some smart people suggest overrated) part of Clojure infrastructure.

But the Deep Thinkers in the Clojure community feel your stack trace pain, and now that 1.3 is in the can it seems to me that there was renewed enthusiasm at Conj for doing something about debugging clarity. You shouldn't give up hope yet.

moomin · on Nov 21, 2011

In the meantime, I'd recommend installing clj-stacktrace as a leiningen plugin. It's far from perfection, but it's an improvement. There's a technomancy article describing how to do it.

bretthoerner · on Nov 21, 2011

http://nickclifton.livejournal.com/9501.html

"The support implements and tracks the Linux variant of Intel's Transactional Memory ABI specification document. Currently this is at revision 1.1, (May 6 2009). For more information see:

http://software.intel.com/en-us/articles/intel-c-stm-compile... "

signa11 · on Nov 22, 2011

i have a fundamental question regarding stm in general: for 'manual' locking, we need to worry about dead-locks, for stm, i feel live-lock would be more sinister, and extremely hard to debug/reason about. not to mention the fact that, it would make client code non-composable as the transaction size or the system load increases.

or am i missing something ? thanks for your insights !

chalst · on Nov 22, 2011

Two points to bear in mind:

1. If live locks are a problem coming from load, rather than bad interactions between components, you are likely to have a choice between (i) pessimistic, where most threads do nothing vs. (ii) optimistic, where most threads do work that gets thrown away. In practice, optimistic tends to be faster, because it is not better to do nothing than do worthless work and the committer is working with the results of successful computations, where the locking algorithm doesn't know which computations might not work out;

2. Extremely hard to reason about is just how it is with threads. I haven't done enough concurrent programming to really say, but the optimistic commit model seems to be more intuitive than the pessimistic lock mode. Peyton Jones makes this point forcefully in Beautiful Concurrency http://research.microsoft.com/pubs/74063/beautiful.pdf

iam · on Nov 21, 2011

I was hoping for more information on how they implement it.. there's nothing in there about which hardware facilities they use, and they say that at worst STM is a global lock for the process.

Hopefully they're at least using some kind of compiler analysis to only use the same lock across transactions if it's touching the same memory addresses (pessimistically of course)?

exDM69 · on Nov 21, 2011

GCC will probably not use any specific hardware facilities, which means this is probably going to be implemented with regular atomic operations.

Within a transaction block, the results of all reads are stored (to a local, hidden variable). When the transaction is about to finish, all reads are repeated and if any of them yields a different result, the transaction is restarted. When the transaction is committed, there will likely be some kind of a global lock (that will be held for a very small time).

As GCC probably doesn't require any kind of threading or locking, it's most likely that the write lock will be a spinlock using an atomic read-modify-write and some kind of yield instruction (monitor/mwait on new cpu's, pause on older).

As far as I can see, there really aren't lots of other methods to implement STM, especially from within the C compiler.

roxtar · on Nov 21, 2011

The article hints that GCC uses a combination of HTM and STM, if HTM is available.

scott_s · on Nov 21, 2011

The paper I linked to claims they have an STM, and a hybrid HTM-STM system if hardware support is available.

eis · on Nov 21, 2011

Since you can't do all reads in one atomic instruction and you also need to make it atomic with the write (CAS), wouldn't that still require a lock for the whole operation?

onemoreact · on Nov 21, 2011

As long as you write to separate parts of memory and are cautious with freeing memory you don't need locks for reads.

eis · on Nov 21, 2011

How can you make sure there are no writes to the memory you are reading from?

onemoreact · on Nov 21, 2011

Don't write to the same location. Everything needs to be a pointer, but updating pointers is an atomic operation. aka assume a is an integer.

  a->(0x00010001)->5
  a->(0x00030001)->6

You can keep reading 0x00010001 and getting 5 all day even as a is "actually' 6. This also works with strings or objects etc, the only downside is you tend to eat up a far amount of memory, and you need to avoid freeing 0x00010001 when something still thinks a's value is stored there.

eis · on Nov 22, 2011

You can still not read or write to more than 2 pointers atomically on x86_64 so my question remains.

onemoreact · on Nov 22, 2011

All you need to do is read the location that pointer points to as an atomic operation. So allocating a 50kb string would work in the same way as long as you could store it in a specific processes memory.

  PX a=(0x00010001) //which points to 5
  P0 x0=a=(0x00010001) //which points to 5
  P1 y0=a
  P1 pointer y1=0
  P1 y1 = malloc(sizeof(int))
  P2 x1=a=(0x00010001) //which points to 5
  P1 *y = *y0 + 1 //aka 6
  p3 x2=a=(0x00010001) //which points to 5
  P1 a=y //you could do a lock to verify that a == (0x00010001) but if you don't care about dirty writes then then you can also do this as an atomic operation.
  p3 x3=a=(0x00030001)//which points to 6

And once x0,x1,x2 stop pointing to (0x00010001) you can free that memory. The assumption is xN and yN is a process specific local variable preferably a register.

eis · on Nov 22, 2011

I couldn't really follow your example. I can't see how this solves race conditions where multiple threads can intermingle those instructions at will. That's the basic problem with lock-free algorithms.

Did you maybe suggest to add one more indirection so all variables in a transaction are in one memory block behind a single pointer which can be atomically updated to newly allocated code?

I wonder what the overhead of that would be. All the allocations plus either keeping track of references or doing garbage collection...

onemoreact · on Nov 22, 2011

Sorry, I will try and be more clear there are some good descriptions of this technique on the web but I can't remember what it's called. Still you have the basic idea and it's downsides. Still normally your dealing with larger objects so the overhead is a little different.

Just think about how a you normally work with objects. Normally you have a pointer to some allocated memory locationObjectPointer = (x00100) which points to a chunk of memory the size of 3 floats (x,y,z). Now normally when you update x and y you overwrite their memory locations which is fast in single threaded programming but you need to worry about something reading after you update x at (x00100) but before you update y (x00104).

However, if instead of updating in that memory block you created a new object locationObjectPointer2 = (x00600) copy'd what was in the original object and then when your done changed the pointer from the old object to your new one. Well, as you say you have the overhead of keeping track of references or doing garbage collection, but at the same time you can do a vary fast lock when you want to change the pointer and test to see if the object was changed. That test is actually hard to do with most types of memory management systems and normally creates a lot of overhead so it's a trade off.

CGamesPlay · on Nov 21, 2011

This is pretty cool. Is it done naively and using one global lock, or is it be more intelligent? Can GCC identify what memory location require locking for a given transaction, and lock just those? What is the granularity?

roxtar · on Nov 22, 2011

It's smarter than a global lock (come on!). I would suggest reading the article which describes the implementation [1].

[1]: http://www.velox-project.eu/velox-transactional-memory-stack (courtesy scott_s)