Are there any languages that allow you to give the compiler a hint that you're about to grind over a gigantic dataset so don't bother to cache any of this data because it won't be accessed again for a long time? It seems like it could be helpful in keeping a big crunch from obliterating the cache constantly. You might also be able to apply other optimizations, like preloading the next data blocks so they're ready when the CPU rolls around. Maybe compilers already do this behind the scenes?
There are "stream" instructions available on some CPU. These instructions tell the CPU to store data, but not to load it into a cache line (it might actually load it into a temporary cache line that is not part of the normal cache lines, depending on architecture). This is useful is you are writing data to memory but don't read it - there's no point in loading that data to the cache line since you will not access it again.
One such instruction can be used with the _mm_stream_si128 intrinsic:
void _mm_stream_si128(__m128i *p, __m128i a);
Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16 byte aligned.
Not the compiler per se, but you can remap memory to mark it as "non-cacheable". This is typically done in drivers for instance, when you access memory-mapped device registers that should most definitely never be cached.
You probably don't want to do that for regular RAM unless you access a lot of data only once and your access pattern is completely unpredictable (which is almost never the case).
Even if you only read the data once having a cache means that the prefetcher can work its magic if it notices a pattern in your accesses, preloading what it thinks you'll read next in the cache ahead of you.
You can also help the prefetcher by explicitly adding preload instructions (if the ISA allows it, at least) if you know ahead of time where you'll read next but you expect the prefetcher won't be able to guess it. That's useful if for instance you're navigating through some sort of linked list data structure, where the access pattern is potentially random looking "from the outside" but you know where you'll go next in your algorithm.
Don't. Yes, there are instructions for this. No, don't use them, unless you really exactly know what you are doing and optimizing towards a specific, single µarch only, otherwise they will invariably hurt performance, not improve it.
Similarly explicit prefetching usually does not improve performance, but reduces it.
(Non-temporal stores are quite a good example here, since a game engine used them in a few spots until recently, causing not only worse performance on Intel chips, but also heavily deteriorated performance on AMD's Zen µarch. Removing them improved performance for all chips across the bank. Ouch!)
I suspect you will find its sorta unnecessary, most modern machines have stream detection built into the caches as well, and simple memcpy's and the like won't actually roll the entire cache. They might roll a set, or simply stop filling after some number of cache lines. This is also why prefetch hints don't tend to help streaming operations.
That said, most modern machines also have stream oriented instructions (sse nontemporal stores for example) that don't allocate on write, but you have to be careful because for workload after workload they have been shown to be slower. There are a couple reasons, but the most important thing to remember is that microbenchmarks repeatedly copying the same piece of data over and over aren't really reflective of most workloads.
There are non-temporal loads and stores that bypass the cache, but last I checked the non-temporal loads don't actually work with cacheable memory types - this may have changed by now.
The cache partitioning in Broadwell can help with this. If you have a very noisy background task that accesses a lot of data (e.g. like garbage collection scanning), you could partition it with a small slice of cache so that it doesn't evict the data the rest of your application uses. I got myself a Broadwell Xeon v4 for my desktop specifically so I could play with this feature.
The short answer is no, because the machine language doesn't allow general manipulation of the cache. So the cache is pretty strictly controlled by the hardware. In general the hardware works very well for a wide variety of different workloads. It can often work better than a compiler can, because it works at runtime, and is thus flexible to changes in workloads. The preloading optimization you mentioned is sort of done already: it's called prefetching. The idea is a cache will automatically cache memory near where you just accessed, because for a lot of memory intensive operations you are linearly scanning a specific section of memory.