Are there any languages that allow you to give the compiler a hint that you're a...

kevindqc · on April 19, 2017

There are "stream" instructions available on some CPU. These instructions tell the CPU to store data, but not to load it into a cache line (it might actually load it into a temporary cache line that is not part of the normal cache lines, depending on architecture). This is useful is you are writing data to memory but don't read it - there's no point in loading that data to the cache line since you will not access it again.

One such instruction can be used with the _mm_stream_si128 intrinsic:

void _mm_stream_si128(__m128i *p, __m128i a);

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.

Learned about it in "What every programmer should know about memory": https://www.akkadia.org/drepper/cpumemory.pdf

It's from 2007, but still relevant and a good read.

simias · on April 19, 2017

Not the compiler per se, but you can remap memory to mark it as "non-cacheable". This is typically done in drivers for instance, when you access memory-mapped device registers that should most definitely never be cached.

You probably don't want to do that for regular RAM unless you access a lot of data only once and your access pattern is completely unpredictable (which is almost never the case).

Even if you only read the data once having a cache means that the prefetcher can work its magic if it notices a pattern in your accesses, preloading what it thinks you'll read next in the cache ahead of you.

You can also help the prefetcher by explicitly adding preload instructions (if the ISA allows it, at least) if you know ahead of time where you'll read next but you expect the prefetcher won't be able to guess it. That's useful if for instance you're navigating through some sort of linked list data structure, where the access pattern is potentially random looking "from the outside" but you know where you'll go next in your algorithm.

dom0 · on April 19, 2017

Don't. Yes, there are instructions for this. No, don't use them, unless you really exactly know what you are doing and optimizing towards a specific, single µarch only, otherwise they will invariably hurt performance, not improve it.

Similarly explicit prefetching usually does not improve performance, but reduces it.

(Non-temporal stores are quite a good example here, since a game engine used them in a few spots until recently, causing not only worse performance on Intel chips, but also heavily deteriorated performance on AMD's Zen µarch. Removing them improved performance for all chips across the bank. Ouch!)

vvggff · on April 19, 2017

Links, examples, tutorials?

StillBored · on April 19, 2017

I suspect you will find its sorta unnecessary, most modern machines have stream detection built into the caches as well, and simple memcpy's and the like won't actually roll the entire cache. They might roll a set, or simply stop filling after some number of cache lines. This is also why prefetch hints don't tend to help streaming operations.

That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.

slashdev · on April 19, 2017

There are non-temporal loads and stores that bypass the cache, but last I checked the non-temporal loads don't actually work with cacheable memory types - this may have changed by now.

The cache partitioning in Broadwell can help with this. If you have a very noisy background task that accesses a lot of data (e.g. like garbage collection scanning), you could partition it with a small slice of cache so that it doesn't evict the data the rest of your application uses. I got myself a Broadwell Xeon v4 for my desktop specifically so I could play with this feature.

jnordwick · on April 19, 2017

What level of cache?

For memory mapped regions there is madvise() DONT_NEED to tell the kernel to that you won't need this data again so don't bother with the disk cache.

For the CPU cache the non-temporal instructions MOVNTI that tells the CPU to not bother caching the move to prevent cache pollution.

openasocket · on April 19, 2017

The short answer is no, because the machine language doesn't allow general manipulation of the cache. So the cache is pretty strictly controlled by the hardware. In general the hardware works very well for a wide variety of different workloads. It can often work better than a compiler can, because it works at runtime, and is thus flexible to changes in workloads. The preloading optimization you mentioned is sort of done already: it's called prefetching. The idea is a cache will automatically cache memory near where you just accessed, because for a lot of memory intensive operations you are linearly scanning a specific section of memory.

jamesfmilne · on April 19, 2017

SSE allows you to do non-temporal loads & stores, within certain constraints.

"A case for the non-temporal store" https://blogs.fau.de/hager/archives/2103

JoshTriplett · on April 19, 2017

Languages, no, but the madvise syscall has flags for that kind of pattern.