Even at 1GHz, a trillion (10^12) writes is only 1000 seconds of work for a modern CPU. OK latency is a thing, so multiply by 10 and it takes a day. This is for DRAM where cells are individually addressed. For flash with wear levelling the numbers of course get bigger.
Volatile requires it emit instructions that access the object. So if the object is in RAMA, it will emit memory access instructions. However, on modern CPUs, that will still hit the cache. You need to either map in the memory as uncached, or flush the caches to force a memory access
no, that won't work. You'd have to clflush after every store. And even then, the cacheline might only ever get to the write pending queue (wpq) - and that you can't control.