Article says: Obviously this 8 – 10GHz clock range would be based on Intel’s 0.07-micron process that is forecasted to debut in 2005. These processors will run at less than 1 volt, 0.85v being the current estimate.
Intel introduced a 65 nm (0.065 micron) process in 2006. The "Cedar Mill" Pentium 4 processor ran at 3.6 GHz at a whopping 1.3V although a small double-pumped part of the processor ran at 7.2 GHz. It could be overclocked to 4.5/9.0 GHz at 1.4V.
The discrepancy between 0.85V and 1.3V was caused by the end of Dennard scaling. Basically transistors require much more voltage than predicted and thus consume far more power than predicted. Although the transistors can technically run at 9 GHz, the resulting power density is very difficult to cool.
They are not used because with transistors that keep getting smaller but that don't get much faster, it's better to just have more add units in parallel than it is to run the ones you have faster. A modern Intel CPU effectively has 4 add units to the 2 of the P4.
> Although the transistors can technically run at 9 GHz, the resulting power density is very difficult to cool.
But nowadays we have processors with multiple cores, where sometimes you need only 1 core (and it needs to be fast). So would it be an idea to increase clock frequency for those cores, but multiplex them quickly to allow them to cool?
While the clock speed has largely stagnated, the actual work done per cycle, even on just one core, has gone up significantly. Consider: A double-precision fused multiply-add consumes only 4 cycles of result latency today. The number of memory operations (and other instructions) in flight at any given moment has gone up dramatically, the number of execution units has gone up a little bit (so that the maximum instruction-level parallelism is higher), and so on.
Its not the rapid growth of the 90's and early 2000's, but it is still growth.
I would argue that we pretty much reached peak single-core x86_64 scalar instruction stream concurrency ~6 years ago with Sandy Bridge. SIMD has gotten wider since then (AVX2, etc) and there are occasional new instructions for certain workloads (including FMA as you noted), but general purpose scalar (non-SIMD) workloads have not gotten much of an IPC boost. Actually, to the extent that those workloads have gotten faster, it's been mostly from the return of clock frequency scaling - from low 3GHz to low 4GHz on the desktop SKUs.
Nope. It is not just thermals but also memory latency. If you have four cores and each has two register files you can get 8x the bandwidth at the same latency.
That 10GHz talk was a lie on the part of Intel to intimidate people away from AMD, not only would a 10GHZ P4 melt down, but it would be stalled all the time from memory latency. So many things did not work that it was not an honest mistake.
Today there is talk of a big clock rate bump (to 200 GHZ or so) if they go to a different semiconductor, but at that point you probably need a fiber optic or terahertz wave link to memory to keep the pipeline full.
>Today there is talk of a big clock rate bump (to 200 GHZ or so) if they go to a different semiconductor, but at that point you probably need a fiber optic or terahertz wave link to memory to keep the pipeline full.
You talk as if there couldn't possibly be a benefit to an increase in speed without a corresponding increase in memory bandwidth. Whilst it wouldn't be an optimally efficient system, if we /could/ bump to 9GHz (or 200GHz), wouldn't it be worth doing so for at least some kinds of calculations, even if the memory can't keep up?
edit: Both responses were super-interesting. Don't wanna reply to both, but thanks all :)
There's a word for this: computational intensity, i.e. the ratio of useful compute operations per memory load in an app.
Are there are apps that have high computational intensity? Sure, matrix multiply is one of them. That's one of the reasons why dense linear algebra serves as the standard benchmark to determine the top 500 supercomputers in the world.
But even in HPC (high performance computing), many if not most apps actually have relatively low computational intensity (i.e. in the range of one or so compute operations per word of memory loaded). In this regime, it really doesn't make sense to grow compute out of proportion with memory bandwidth because you'll just be idling the processors.
And while I have no proof, I'd expect HPC applications to generally be more computationally intense than general consumer computing tasks. So I'd expect that computational intensity goes mostly down from here.
Maybe you could do cryptography more quickly, but for general-purpose computing, and even most specialized tasks, memory latency and bandwidth are critical. For instance, look at the use of GDDR5 and HDM together with GPUs.
Most of the market is for things that are generalizable; maybe you could make some kind of hyper-DSP for millimeter wave base stations or something like that, but you have to spread out the development cost across a low number of units.
That is a long term target: definitely you can get transistors to switch that fast. 200GHz would not be a short-term target, but would it happen in 20 years? maybe.
When the engine did NOT have coolant, it would run only one bank of cylinders at a time and alternate between the two to let them cool.
The system was at least somewhat effective. There's a story from the time about a journalist that was testing the feature in the desert. He stopped at a truck stop after conducting the test and amazed the folks at the stop by opening the hood, filling the engine with coolant, and driving off like nothing was amiss.
That's just a bit different to Intel Turbo Boost which has been widely deployed for a while. When fewer cores are needed it will increase clock speed on a small number of cores if there is work to do.
Some BIOSes have settings to completely disable a bunch of cores to enable more turbo boosting.
This isn't as sophisticated in thermal as what you outlined, but it saves on cache coherence.
The general term for that is Dark Silicon [1]. It helps a little and as, others have pointed out, Intel CPUs already have a similar feature called "Turbo Boost." NVIDIA processors also have a similar "GPU Boost" feature. But I don't know if that can enable a single core to run at 10GHz. Shutting off other cores does not lower the local power density / thermal dissipation of the single core at 10GHz. You still have to work hard to cool that single core, in order to prevent the silicon from being damaged.
Right, but the suggestion is that when "one core" is running, you switch around which core it is that's actually hot, so they're taking turns generating the heat.
Obviously there's some cost there in sharing registers, cache, etc, but it's an interesting notion.
I see, that is an interesting notion. But how do you "share registers" at 10 GHz, without massive stalls during context switches? The suggestion seems require that the cores are separated by some considerable physical distance, in order to allow for heat dissipation. However this distance also means that data sharing between the cores would be slow and also very power intensive in itself. I'm not a chip designer though, just wildly guessing.
I don't think you'd need to. I expect you'd be able to run the core very fast for a number of milliseconds before having to switch.
If performed at the hardware level, you may also get some improvement by having the core push its register and low level cache state directly to the next processor, rather than have them pull the data through shared cache or RAM. Unlike a typical context switch, the process is not being resumed from idle, but has an active cache state.
The Windows Kernel already does this automatically, at a period of around 10-100ms. If you look at a single threaded program it will actually utilise all cores not one, by default its not locked affinity and the thread skips around.
There's a limited form of this already - on many multicore processors if you're running a single-threaded workload then "turbo boost" or similar will allow the one running core to clock higher than normal. Not aware of the multiplexing idea being implemented, but I'm not sure how much value that would give - I suspect the overall amount of heat to be dissipated is the limiting factor more than where exactly it is in the package.
introduced in ~2007 in Core 2 mobile CPUs (Penryns I think). You can load Middleton modded bios into your oldshool Thinkpad R61/T61 and it will enable dual IDA[1] in top CPUs.
You can even load modded(by chinese) modded middleton bios, solder one wire and plop $8 T9550 CPU for 2.66GHz boosting to 2.8GHz. You could even use quads, but its beyond uneconomical with cpus being more than whole used thinkpad T420.
Intel introduced a 65 nm (0.065 micron) process in 2006. The "Cedar Mill" Pentium 4 processor ran at 3.6 GHz at a whopping 1.3V although a small double-pumped part of the processor ran at 7.2 GHz. It could be overclocked to 4.5/9.0 GHz at 1.4V.
The discrepancy between 0.85V and 1.3V was caused by the end of Dennard scaling. Basically transistors require much more voltage than predicted and thus consume far more power than predicted. Although the transistors can technically run at 9 GHz, the resulting power density is very difficult to cool.