Article says: *Obviously this 8 – 10GHz clock range would be based on Intel’s 0....

ZeljkoS · on Feb 8, 2017

Yes, more about the end of Dennard scaling: https://en.wikipedia.org/wiki/Dennard_scaling#Breakdown_of_D...

kristianp · on Feb 8, 2017

It appears that the double-pumped adds aren't used any more. I thought they would have been good to keep, Adds being so common.

Tuna-Fish · on Feb 8, 2017

They are not used because with transistors that keep getting smaller but that don't get much faster, it's better to just have more add units in parallel than it is to run the ones you have faster. A modern Intel CPU effectively has 4 add units to the 2 of the P4.

amelius · on Feb 8, 2017

> Although the transistors can technically run at 9 GHz, the resulting power density is very difficult to cool.

But nowadays we have processors with multiple cores, where sometimes you need only 1 core (and it needs to be fast). So would it be an idea to increase clock frequency for those cores, but multiplex them quickly to allow them to cool?

brandmeyer · on Feb 8, 2017

While the clock speed has largely stagnated, the actual work done per cycle, even on just one core, has gone up significantly. Consider: A double-precision fused multiply-add consumes only 4 cycles of result latency today. The number of memory operations (and other instructions) in flight at any given moment has gone up dramatically, the number of execution units has gone up a little bit (so that the maximum instruction-level parallelism is higher), and so on.

Its not the rapid growth of the 90's and early 2000's, but it is still growth.

rphlx · on Feb 9, 2017

I would argue that we pretty much reached peak single-core x86_64 scalar instruction stream concurrency ~6 years ago with Sandy Bridge. SIMD has gotten wider since then (AVX2, etc) and there are occasional new instructions for certain workloads (including FMA as you noted), but general purpose scalar (non-SIMD) workloads have not gotten much of an IPC boost. Actually, to the extent that those workloads have gotten faster, it's been mostly from the return of clock frequency scaling - from low 3GHz to low 4GHz on the desktop SKUs.

PaulHoule · on Feb 8, 2017

Nope. It is not just thermals but also memory latency. If you have four cores and each has two register files you can get 8x the bandwidth at the same latency.

That 10GHz talk was a lie on the part of Intel to intimidate people away from AMD, not only would a 10GHZ P4 melt down, but it would be stalled all the time from memory latency. So many things did not work that it was not an honest mistake.

Today there is talk of a big clock rate bump (to 200 GHZ or so) if they go to a different semiconductor, but at that point you probably need a fiber optic or terahertz wave link to memory to keep the pipeline full.

_m7bj · on Feb 8, 2017

>Today there is talk of a big clock rate bump (to 200 GHZ or so) if they go to a different semiconductor, but at that point you probably need a fiber optic or terahertz wave link to memory to keep the pipeline full.

You talk as if there couldn't possibly be a benefit to an increase in speed without a corresponding increase in memory bandwidth. Whilst it wouldn't be an optimally efficient system, if we /could/ bump to 9GHz (or 200GHz), wouldn't it be worth doing so for at least some kinds of calculations, even if the memory can't keep up?

edit: Both responses were super-interesting. Don't wanna reply to both, but thanks all :)

eslaught · on Feb 8, 2017

There's a word for this: computational intensity, i.e. the ratio of useful compute operations per memory load in an app.

Are there are apps that have high computational intensity? Sure, matrix multiply is one of them. That's one of the reasons why dense linear algebra serves as the standard benchmark to determine the top 500 supercomputers in the world.

But even in HPC (high performance computing), many if not most apps actually have relatively low computational intensity (i.e. in the range of one or so compute operations per word of memory loaded). In this regime, it really doesn't make sense to grow compute out of proportion with memory bandwidth because you'll just be idling the processors.

And while I have no proof, I'd expect HPC applications to generally be more computationally intense than general consumer computing tasks. So I'd expect that computational intensity goes mostly down from here.

PaulHoule · on Feb 8, 2017

Maybe you could do cryptography more quickly, but for general-purpose computing, and even most specialized tasks, memory latency and bandwidth are critical. For instance, look at the use of GDDR5 and HDM together with GPUs.

Most of the market is for things that are generalizable; maybe you could make some kind of hyper-DSP for millimeter wave base stations or something like that, but you have to spread out the development cost across a low number of units.

ColanR · on Feb 8, 2017

Any source on the clock rate jump? That's more than I had heard.

PaulHoule · on Feb 8, 2017

That is a long term target: definitely you can get transistors to switch that fast. 200GHz would not be a short-term target, but would it happen in 20 years? maybe.

mschaef · on Feb 8, 2017

Other people have covered Intel's TurboBoost feature, but this reminds me of something GM did with the Cadillac Northstar V8 back in the early 90's.

https://en.wikipedia.org/wiki/Northstar_engine_series

When the engine did NOT have coolant, it would run only one bank of cylinders at a time and alternate between the two to let them cool.

The system was at least somewhat effective. There's a story from the time about a journalist that was testing the feature in the desert. He stopped at a truck stop after conducting the test and amazed the folks at the stop by opening the hood, filling the engine with coolant, and driving off like nothing was amiss.

jzwinck · on Feb 8, 2017

That's just a bit different to Intel Turbo Boost which has been widely deployed for a while. When fewer cores are needed it will increase clock speed on a small number of cores if there is work to do.

Some BIOSes have settings to completely disable a bunch of cores to enable more turbo boosting.

This isn't as sophisticated in thermal as what you outlined, but it saves on cache coherence.

n00b101 · on Feb 8, 2017

The general term for that is Dark Silicon [1]. It helps a little and as, others have pointed out, Intel CPUs already have a similar feature called "Turbo Boost." NVIDIA processors also have a similar "GPU Boost" feature. But I don't know if that can enable a single core to run at 10GHz. Shutting off other cores does not lower the local power density / thermal dissipation of the single core at 10GHz. You still have to work hard to cool that single core, in order to prevent the silicon from being damaged.

[1] https://en.wikipedia.org/wiki/Dark_silicon

mikepurvis · on Feb 8, 2017

Right, but the suggestion is that when "one core" is running, you switch around which core it is that's actually hot, so they're taking turns generating the heat.

Obviously there's some cost there in sharing registers, cache, etc, but it's an interesting notion.

n00b101 · on Feb 8, 2017

I see, that is an interesting notion. But how do you "share registers" at 10 GHz, without massive stalls during context switches? The suggestion seems require that the cores are separated by some considerable physical distance, in order to allow for heat dissipation. However this distance also means that data sharing between the cores would be slow and also very power intensive in itself. I'm not a chip designer though, just wildly guessing.

gnode · on Feb 8, 2017

I don't think you'd need to. I expect you'd be able to run the core very fast for a number of milliseconds before having to switch.

If performed at the hardware level, you may also get some improvement by having the core push its register and low level cache state directly to the next processor, rather than have them pull the data through shared cache or RAM. Unlike a typical context switch, the process is not being resumed from idle, but has an active cache state.

PaulKeeble · on Feb 8, 2017

The Windows Kernel already does this automatically, at a period of around 10-100ms. If you look at a single threaded program it will actually utilise all cores not one, by default its not locked affinity and the thread skips around.

usefulcat · on Feb 8, 2017

Interesting, do you have a source for this? I'm curious to know more about the rationale for doing such a thing.

ingenium · on Feb 8, 2017

The Linux kernel does the same thing. Run htop and run a single threaded workload. You'll see it migrate around between all the cores.

lmm · on Feb 8, 2017

There's a limited form of this already - on many multicore processors if you're running a single-threaded workload then "turbo boost" or similar will allow the one running core to clock higher than normal. Not aware of the multiplexing idea being implemented, but I'm not sure how much value that would give - I suspect the overall amount of heat to be dissipated is the limiting factor more than where exactly it is in the package.

rasz_pl · on Feb 8, 2017

https://en.wikipedia.org/wiki/Intel_Dynamic_Acceleration

introduced in ~2007 in Core 2 mobile CPUs (Penryns I think). You can load Middleton modded bios into your oldshool Thinkpad R61/T61 and it will enable dual IDA[1] in top CPUs.

http://forum.notebookreview.com/threads/t61-x61-sata-ii-1-5-...

You can even load modded(by chinese) modded middleton bios, solder one wire and plop $8 T9550 CPU for 2.66GHz boosting to 2.8GHz. You could even use quads, but its beyond uneconomical with cpus being more than whole used thinkpad T420.

http://forum.thinkpads.com/viewtopic.php?f=29&t=110620

This was Intels first steps and not a lot of OEMs enabled IDA. Turbo Boost was next, introduced in Nahalems.

https://en.wikipedia.org/wiki/Intel_Turbo_Boost

[1] dual IDA is a trick forcing IDA on both cores. Throttlestop utility is also able to turn it on.

yuhong · on Feb 8, 2017

The fun thing is that they were reducing the power with each stepping by that time.