Fifty or Sixty Years of Processor Development for This?

nostrademons · on April 4, 2018

Wonder what this means for system software and application development.

There's a factor of 10-40x speedup by going from an interpreted language like Python/Ruby/PHP to a tight compiled one like C++/Rust/Ocaml. 2-4x going from a good JIT like V8 or Hotspot (or Go's runtime, though technically not a JIT). Probably another 10-100x by cutting out bloated middleware like most web frameworks or the contents of your node_modules.

All this was irrelevant when you could get your 2-4x speedup by waiting 18 months, and your 10x speedup by waiting 5 years. It's very relevant when your 2x now takes 20 years and 10x takes a lifetime. Maybe this is why Rust gets so much attention recently.

jashmatthews · on April 4, 2018

I run a production Rust web service. The speedup for this service over using slightly stripped Rails was only about 5x. As you said, you can gain like 50-100x performance improvements from not using the default Rails JSON serialization and skipping ActiveRecord.

After that, you're lucky to gain 5x performance from re-writing the whole thing in Rust. Most of the hot spots of serving web applications using Ruby are already written as native extensions.

I think Rust is fantastic. I'm writing a tinyrb like "Ruby" VM in Rust at the moment. But... it's just not worth the hassle for plugging web services together. Maybe if you're at Google scale and already have web services in C++ it'd be a good choice.

hardwaresofton · on April 4, 2018

Let's not forget that 5x is actually a big difference, especially in terms of the end user... You can also avoid much more near-the-edge performance issues/instability.

As you alude to it may not be worth a 10x increase in developer effort to try and do a bunch of rewriting when you could do something (as you said) like just replacing the bloatiest, slowest parts of rails and get large performance gains.

Have you written anything about this? I'd love to read more. As a person that is neutral/dislike on ruby and full-on hates rails, I'd love to hear from someone who's experienced running something as new as Rust in production alongside a rails application, especially from someone who likes ruby.

rozap · on April 4, 2018

Right, but you can also get that same 5x increase from writing something using a higher level language running on the JVM or BEAM. You don't need to go full on "systems programming" to have something that outperforms Ruby.

jashmatthews · on April 4, 2018

MRI Ruby and BEAM are both bytecode VMs. There's no simple 5x increase in performance there. BEAM is register based, and YARV is stack based, so BEAM tends to be a little faster.

The JVM is a completely different beast. Thanks to improvements in the JVM for hosting dynamic languages, JRuby is able to offer some serious performance advantages over MRI Ruby, and I'm really excited about TruffleRuby.

hardwaresofton · on April 5, 2018

Just to confirm why this is getting downvoted -- I'm guessing the MRI vs BEAM statement is super wrong? I can't say I know much about erlang's VM but being "bytecode VMs" I don't think MRI the same as BEAM, nevermind the primary paradigms they service being completely different.

I assume JRuby is faster than MRI Ruby (at least for parallelizable tasks), so I can't imagine the second statement being wrong per say... Also I'd never heard of TruffleRuby, and call me prejudiced, but I don't want to be anywhere near anything developed by Oracle, whether it's oracle labs or not (I hadn't heard about GraalVM before).

Speaking of fast ruby... I'm surprised no efforts to make ruby faster/swap out MRI haven't panned out. In python land I know there are efforts like pypy and pypy+stm tackling the slowness and GIL problems while maintaining axiomatic use of the language.

jashmatthews · on April 5, 2018

I'm not sure why this is downvoted so hard because it's true. BEAM has massive advantages over MRI Ruby/Cython for concurrency and parallelism but not really much in sheer throughput.

For example, Sinatra + Sequel on MRI has a 50% lead in throughput over Phoenix + Elixir on BEAM here: https://www.techempower.com/benchmarks/#section=data-r15&hw=...

Bytecode VMs all have the same fundamental problem with instruction dispatch overhead, regardless of what language or paradigm they're supporting, which is why JIT is so important even though it's so much additional complexity.

JRuby is now 2-3x faster in general as the JVM improved support for hosting dynamic languages, and it's in heavy production use at places like TalkDesk, so in that sense it is the "faster Ruby".

There's also Topaz, which is Ruby built using the underlying framework of PyPy, but performance is disappointing AFAIK.

0xffff2 · on April 4, 2018

I find the fact that anyone can speak dismissively about a 5x speedup disheartening. Has anyone ever done a study on how much CO2 we are emitting in the name of "developer productivity"?

nostrademons · on April 4, 2018

It's probably less than you think. Humans - just by virtue of existence - produce a huge amount of CO2, both through the air they breathe, the meat they eat, the automobiles they get to work in, the heavy machinery used to build those roads & buildings, the manufactured goods they consume, etc. And the CO2 cost of a developer isn't just that one developer's emissions; it's also those of all the support staff needed, from managers/admins/HR at work to the food service workers that serve them meals out to the doctors/lawyers/therapists and other service providers they visit to the parents that raised them.

It's almost certain one developer generates more CO2 than any reasonable number of servers that run their code. Anything that reduces manpower costs is a net positive for emissions. Besides, when the equation changes (say, when the software enters maintenance mode but the servers stay up), they'll be a strong economic incentive to spend the developer time to rewrite it more efficiently.

nkurz · on April 5, 2018

It's almost certain one developer generates more CO2 than any reasonable number of servers that run their code.

I'm not so sure. Let's do a back-of-the-envelope estimate.

Assume a single really hefty server that consumes 1 kilowatt. Over one year, this is about 10,000 kw hr. 1 kw hr of electricity produced by a coal fired plant generates about 1 kg of CO2 (https://carbonpositivelife.com/co2-per-kwh-of-electricity/). Thus that big server running for a year produces about 10 metric tons of CO2.

An average American lifestyle (all in, total country production divided by population, https://www.theguardian.com/environment/datablog/2009/sep/02...) involves the production about about 20 tons of CO2 per year. So if you write code that full-time on more than 2 really big servers per year, your code might be producing more CO2 than the rest of your lifestyle.

I'm guessing that most of the errors in this are probably overestimating the code's CO2 (probably not coal fired, probably less than 1 kw, a year is less than 10,000 hours), so more realistically maybe it's 4-8 servers to be break-even? Still, I think it's fair to say that there are some participants in this forum whose running code probably generates more CO2 than the rest of their lifestyle.

klodolph · on April 4, 2018

Something like 2% of the US power consumption goes to data centers.

But think about this... how much CO2 would you spend getting developers to work every day so they can reduce CO2 somewhere else? I suspect the biggest users of DC power, in terms of code, are already written in languages like C++.

gameswithgo · on April 4, 2018

5x is an amazing speedup! How much latency did that strip out for end users?

jashmatthews · on April 4, 2018

Almost nothing in the context of the long-running analytical queries the web layer has to wait on the DB for.

api · on April 4, 2018

How much did this save you in energy and hosting fees?

jashmatthews · on April 4, 2018

Not at a large enough scale for either to be relevant.

skybrian · on April 4, 2018

Not having used Rust, I'm wondering how it makes writing a VM easier or harder?

bhaak · on April 4, 2018

But if you are talking about a stripped down Rails without ActiveRecord, we are looking at something on par with a Sinatra service. That is pretty lean already. Compared to Rails you already have a big performance improvement.

5x overhead for a Ruby webservice on that level is decent. I wouldn't have expected much more.

lou1306 · on April 4, 2018

On the other hand, software often follows the Pareto principle: 80% of the time is spent executing 20% of the code [^1]. Optimizing/migrating just that 20% would yield most of the benefits of a complete rewrite. (Good luck if that 20% resides in your framework of choice, though).

[^1]: https://www.codeproject.com/Articles/49023/The-impact-of-the...

GuB-42 · on April 5, 2018

The evils of premature optimization are well known but from my experience, execution often isn't the bottleneck, memory is.

Which mean you can get a lot of benefits by optimizing your data structures. Making them smaller is a good start. Unfortunately, changing data structures can have far reaching implications, sometimes justifying a complete rewrite.

For example, going from string based hashes to raw C structures can yield massive improvements, but you need to rewrite most of the code that access it, even parts that are rarely used.

Another performance killer are deep copies. This is also hard to optimize out because if you switch to a more efficient reference, you need to mess up data for the rest of the program in every part that uses it.

jononor · on April 4, 2018

Price for compute is still going steadily down for web services. Which I think is more relevant than individual core/CPU performance. CPUs are still getting cheaper, blades are getting more compact. Virtualization, containers and function-as-a-service are improving utilization. Modern cloud service providers have massive economies of scale.

roymurdock · on April 4, 2018

Yep, we are seeing computers evolve from hybrid media consumption/creation/business tools into a bifurcated model where IaaS providers rent out the majority of the world's compute power, and average consumers own a (powerful) thin client for UI/display/consumption.

There will be still sizable markets (real-time industrial/automotive/medical/etc., gaming, some business apps) where individuals will still want to own and upgrade their own beefy on-prem hardware, but in the consumer realm I can't think of many new applications that have taxed my 2012 MBP, or even my older Dell/Win7 laptop. AWS/Azure/GCP shoulder the burden for me.

hinkley · on April 4, 2018

That only helps for embarrassingly parallel tasks, like rendering hundreds of distinct web pages per second (which in many cases might just as well have been partially precalculated to trade space for time).

As was pointed out in the chart, we’re already hitting Amdahl’s law with half a dozen cores.

sn · on April 4, 2018

I believe what you're seeing is market competition, not reduction in hardware costs.

The total cost to offer compute has actually gone up significantly over the last year or so due to memory prices shooting up. I believe the cost of processors has gone up too, but not as drastically.

drej · on April 4, 2018

It's all about implementation. I got a ~100x speedup by going from bad Fortran to average Go (notice, both compiled). And a ~10x speedup by going from bad Go to excellent Python (compiled -> basically interpreted).

So yes, compiled languages can give you performance benefit, but it's not guaranteed, you need to work for it.

DSingularity · on April 4, 2018

What is your app?

ubernostrum · on April 4, 2018

There's such a thing as being penny-wise and pound-foolish, though.

Imagine if I told you I had a desktop application that was slow, and I'd achieved a 2-4x speedup in common tasks by rewriting the whole thing in a different language. But then you discovered that I'd written it in such a way that it had to hit the (spinning-platters, just to drive the analogy home) disk to do even the most basic things. You would, I hope, tell me that I'd wasted a ton of effort optimizing entirely the wrong thing, and that my 2-4x speedup from choosing a new language would be blown away by the likely multiple-orders-of-magnitude speedups I could get from better memory management and I/O patterns.

In the web world, most discussion of backend languages is like this. Sure, you could get your impressive-sounding speedup from switching to a "faster" language. But the time spent executing application code is so utterly dwarfed by the time you spend idle while waiting on a database, or by the time it takes to send things over the network to the client, that "switch to a faster application language" should probably be the 1000th item on the list of the first 1000 things you do to try to improve performance.

nostrademons · on April 4, 2018

The idea that a webapp should be a frontend to a RDBMS is itself an artifact of the "time to market is way more important than computational efficiency" culture of the mid-00s.

You can get 1000x speedups by ditching the database and serving out of RAM. Sites like Hacker News, PlentyOfFish, Mailinator and Google (back when the webserver was written by just Craig Silverstein rather than a team of hundreds) serve thousands of queries per second off a single box. In most cases the actual access patterns of apps don't map particularly well to either an RDBMS or a key-value store, so if you're willing to put in the time to develop a custom datastore, there are still large efficiencies available.

bluGill · on April 4, 2018

Maybe, maybe not.

What does the profiler say. This is always the first question in optimization, if the profiler comes up with anything you can fix, that fix is always your biggest bang for your buck. If the profiler doesn't really say anything we are in a tough spot. There are things that are worth doing anyway, you have to be careful, in most cases even if the change is really faster it probably will not be fast enough.

Back to your example, maybe we cannot fix the database, but even if the database query is 10 seconds, going from 10.5 seconds to 10.001 seconds response is an improvement.

Your scale is an important question. Facebook can save tens of thousands per month with an optimization so small no user will notice it - between not having to buy as many servers, and not having to pay as much power to run the CPU for those extra cycles. (Facebook will not give you the actual numbers, but they can tell you they have measured and you can read between the lines to guess how much it must be given they have a few people employed in small optimizations)

ubernostrum · on April 4, 2018

In web apps, the first thing you do is start looking at DB queries. How many, how complex, etc. Reducing the number of queries and making those queries as simple/fast as possible accounts for at least the first quarter of my hypothetical "first 1000 things you do to try to improve performance" list.

Also, regarding scale: the number of entities operating at AmaFaceGoog scale is small. The number of entities operating even within an order of magnitude or two of them is small. The odds are very much against any advice relevant only at that scale being relevant to the average Hacker News reader.

api · on April 4, 2018

I think the death of conventional Dennard scaling and Moore's law is part of why interpreted super-dynamic languages like Ruby have lost their luster. We can no longer just ride CPU performance increases to get increasing application performance, so these languages' implicit slowness becomes a liability that won't just go away magically.

My optimistic hope is that the death of easy CPU performance gains will lead to a round of serious evolutionary improvements in software architecture. Rust is probably a sign of that happening.

sedachv · on April 4, 2018

> interpreted language like Python/Ruby/PHP

There is no such thing as an "interpreted language." An interpreter is a class of programming language implementation. The languages you listed do not even have interpreters as their standard implementations; they have bytecode virtual machines. There are alternative compiler implementations for Python and PHP.

nostrademons · on April 4, 2018

Everybody says this - believe me, I was one of them when I was younger - but if you've ever tried to write a compiler for Python or Ruby, you'll understand why they're often called "interpreted languages".

(Interpretation vs. compilation is a continuum, anyway; even a compiled language like C uses a runtime for several operations like malloc or strings, while modern JITs like V8 or PyPy can compile a trace of methods and then fall back on an interpreter for uncommon cases. Nevertheless, there's still an important distinction there: in a "compiled language" like C or Rust common syntactic forms like addition, property access, or function calls can semantically map to machine instructions or memory locations fairly easily, while in an "interpreted language" like Python or Javascript, even a property access or arithmetic operation may invoke an arbitrary, not-predictable-at-compile-time piece of code, and hence require runtime dispatch.)

sedachv · on April 5, 2018

> but if you've ever tried to write a compiler for Python or Ruby, you'll understand why they're often called "interpreted languages".

Why, I happen to have hacked on Lisp implementations. Including Thinlisp, which is a compiler for a non-garbage collected real time subset of Common Lisp. And what you say is still wrong.

An interpreter is not the same as a bytecode VM, and is not the same as native code compilation. And none of those things are a property of programming languages. The only programming languages where that might remotely be true are purely string-based ones like TRAC and maybe some of the term rewriting ones.

> even a property access or arithmetic operation may invoke an arbitrary, not-predictable-at-compile-time piece of code, and hence require runtime dispatch.)

If that's the criterion, then vtables make C++ "interpreted."

kazinator · on April 5, 2018

> If that's the criterion, then vtables make C++ "interpreted."

So do function calls; you don't know until the .o is linked where a read call will go. Unix C library? Or some local override of read?

And, speaking of vtables, modern shared lib calls give obj->virtual(args) a run for its money in terms of overhead. It's worse because a vtable's structure is determined at compile time, so it's just positional referencing; a shared lib call has the referencing through a table plus string name lookup to figure out the offsets at run-time (at least the first time through).

nostrademons · on April 5, 2018

There's a reason why C++ requires the "virtual" keyword on methods that may be overridden...

xaedes · on April 4, 2018

What that means for software development? I hope the lazy excuses to NOT write performant software will die out and software gets fast again. remove the hundreds layers of libraries and frameworks and build good fast software again. we will find other ways to manage complexity and software reuse.

Death of Moore's Law really is the best that could happen to us...

runeks · on April 5, 2018

This assumes compute costs comprise a sizable part of the budget of a software company. If a company can save a couple of percent per year by doing a complete rewrite in Rust or C++, the return doesn’t make the cost worthwhile.

narrator · on April 4, 2018

Speaking of Moores law being dead this time, check out this old article from 2012 predicting we would be at 7nm Intel chips with 5nm on the way: http://www.tomshardware.com/news/intel-cpu-processor-5nm,175...

Intel is still trying to figure out 10nm because it is rumored that there are material science problems that are causing yield issues. Remember the 1960s when rapid gains in space tech made everyone think we'd be travelling around the solar system by 2000? The tech hit a plateau and stopped. Maybe we're in that situation with chip technology...

hardwaresofton · on April 4, 2018

Just want to let you know, as a person that worked at intel at one point -- 10nm is already plenty hard. @ 10nm and below, engineers/designers at intel (and a bunch of smaller more agile companies intel works with) are at this point just fighting physics as we know it at this point.

There are promising technologies coming about, but since they're so different the amount of time it takes to perfect the process is super unforgiving, and intel is trying to navigate all that while putting out products that actually appeal to customers in different markets, planning them out at least a year in advance despite not having any idea what the market will look like in a year.

Moore's law is super dead. If any company were to manage to keep up it would be nothing short of a miraculous revival.

fallingfrog · on April 4, 2018

How are GPUs still apparently increasing their power exponentially without using more power? It looks like the best 10 series NVIDIA GPU has about twice the power of the best 9 series, about 2 years earlier, which lines up with Moore's Law pretty closely. How are they doing it? Maybe they were actually lagging Intel- looks like the 9 series was a 28nm process, and the 10 series is a 14 or 16nm process.

Retric · on April 4, 2018

Yea, GPU's are a few nodes behind which means they will take longer to hit the same wall. Even the GTX 1080 Ti just hit 16nm.

Compare a 95 W (22nm) i7 2600k from January 2011 vs 65 W (14nm) Core i7-8706G from February 2018, that's over 7 years of progress for 30% lower power consumption and a fair speed boost.

That's what GPU's are facing starting now, though they can directly trade lower power consumption for more speed as they are embarrassingly parallel.

coryfklein · on April 4, 2018

I can't answer authoritatively, but I would speculate it has to do with the highly parallel nature of GPUs.

Nanite · on April 4, 2018

That and dark silicon: https://en.wikipedia.org/wiki/Dark_silicon

Running a modern CPU with an acceptable TDP, means only so much of the transistors can be utilized simultaneously .

pi-squared · on April 4, 2018

> Remember the 1960s when rapid gains in space tech made everyone think we'd be travelling around the solar system by 2000?

I think space tech is a different beast economically - the intensive had been more to compete for global dominance in the 1960s and not so much demand (from the general public) to travel in space. I believe there is a bigger intensive to have faster processors now but we might be hitting physical/engineering limitations so we go distributed.

ribasushi · on April 4, 2018

An amazing 2014 talk[1]/article[2] by @idlewords explores exactly this premise.

[1] https://www.youtube.com/watch?v=nwhZ3KEqUlw [2] http://idlewords.com/talks/web_design_first_100_years.htm

mcv · on April 4, 2018

Isn't nanometer scale also getting to the point that you're dealing with individual atoms? There is undoubtedly a point where you can't progress any further, and it seems reasonable that progress gets a lot harder as you approach that point.

Tuna-Fish · on April 4, 2018

In the biz people already measure some widths as "monolayers", which refers to a single layer of atoms. Some process features have error margins of less than a monolayer -- that is, a dieletric needs to be 4 monolayers thick, and if it only has 3, the device is going to short, and if it has 5, the device will not work.

jjoonathan · on April 4, 2018

What kind of CVD/PVD process do you use to achieve those kinds of tolerances? Or is the trick to use protect/deprotect steps like in conventional chemistry?

deepnotderp · on April 4, 2018

ALD is commonly used for these ultra fine tolerances.

T-A · on April 4, 2018

Atoms in a silicon crystal are about 0.54 nm apart, so we are not quite there yet.

dr_zoidberg · on April 4, 2018

So you're saying a 10nm structure is about ~20 atoms long? So basically proving parents point...

T-A · on April 5, 2018

No, saying that we are still an order of magnitude away from dealing with individual atoms. The current slowdown is more about light than about atoms; the move to EUV lithography [1] has proved to be very challenging [2].

[1] https://en.wikipedia.org/wiki/Extreme_ultraviolet_lithograph...

[2] https://spectrum.ieee.org/semiconductors/nanotechnology/euv-...

deepnotderp · on April 4, 2018

In theory Coulomb transistors can be a single molecule. I believe single molecule transistors have been experimentally demonstrated.

slx26 · on April 4, 2018

Anyway, around 5nm you start to have quantum tunneling effects. Current technology can't be scaled down much further. You can keep optimizing things, but you are still near the technology limits and increasing complexity.

Tuna-Fish · on April 4, 2018

Even though the numbers in process node names keep going down, this no longer directly corresponds to line widths. For example, the largest difference between TSMC 20nm and 16nm is that the way the transistor is built was completely changed.

There are a couple of transitions like this left that buy real performance and density, even if the actual line widths stay as they are, so there likely will be a "3nm" process node. After that, who knows.

slx26 · on April 4, 2018

Agree it might be possible, but still, my point is that gains become smaller and smaller, more and more complex, and you still don't end up with transistors an order of magnitude smaller.

duncan_bayne · on April 4, 2018

I thought it was more that the money dried up, and the nuclear test ban prohibited testing of the sorts of nuclear rockets that were being prototyped at the time.

orbital-decay · on April 4, 2018

It's the same with semiconductors though, the barrier is mostly economical.

Dylan16807 · on April 4, 2018

Has the budget for semiconductor research ever decreased? It's a tautology that we could be spending more money on it, but we are deeply investing in this field in a way that we stopped doing with space.

donquichotte · on April 4, 2018

Is it? I was under the impression that we are hitting physical barriers, like transistor size getting closer and closer to Silicon atom size.

orbital-decay · on April 4, 2018

It's still technically possible to make chips with smaller nodes than the current mainstream processes allow, although this is uneconomical due to low yields and extremely thin margins.

duncan_bayne · on April 4, 2018

Right, so "Moore's law is dead" really scoped to "... affordable, conventional materials and designs that we've almost finished wringing every scrap of ROI out of".

(That's not meant to be pejorative. Just saying I understand why companies would be loathe to ditch proven tooling w/ a lot of sunk costs).

deepnotderp · on April 4, 2018

To be more specific: the rumor is that integration of cobalt is the issue.

tpurves · on April 4, 2018

Can youbelaborate? What does cobalt have to do with the current nodes? And killer as in good or killer as in bad?

deepnotderp · on April 4, 2018

Sure, so Intel is using cobalt at low dimension interconnects (e.g. M0-3) because cobalt has lower resistance than copper at low dimensions due to electron mean free path issues.

The rumor is that integration of the cobalt material is what's causing yield issues at Intel for 10nm.

atomicnumber1 · on April 4, 2018

Why do we have to go below 10nm at all? We'll have hard physics limitations. Can't we improve on other frontiers? Say more cache?, better design ? More cores? Etc. I don't know.

tremon · on April 4, 2018

Probably the #1 area that can produce results is avoiding/conquering the processor-memory gap: while processor performance has been growing exponentially, memory (bandwidth) performance has basically grown linearly. There is now a factor 1,000 difference between processor/memory speeds compared to 1980.

One of the areas that I have much hope for is near-data processing: since processors scale so much better, pretty much every peripheral device already has its own microcontroller. The idea behind NDP is basically to offload some data-heavy processing to the data layer. What if your disk layer could already preselect your data so the database wouldn't have to read and discard so many rows for each query? What if the network controller could evaluate your firewall rules itself, so dropped packets wouldn't have to interrupt the main CPU?

marshray · on April 4, 2018

> What if your disk layer could already preselect your data so the database wouldn't have to read and discard so many rows for each query?

My impression is that that the process of filtering DB rows is sufficiently complex to need a full libc type execution environment. But taking a big step back in perspective, a famous example of filtering on processors connected to disks is Map-Reduce, aka Hadoop.

> What if the network controller could evaluate your firewall rules itself, so dropped packets wouldn't have to interrupt the main CPU?

Yes, this is a real thing: https://duckduckgo.com/?q=nic+packet+filter+offload

CountHackulus · on April 4, 2018

So NDP is essentially what Commodore did with their 1541 disk drive. That disk drive had a 6502 in there to complement the 6502 in the actual VIC-20.

From what I remember, the IBM System Z mainframes also do this sort of thing and have dedicated IO processors that can decode XML on the fly for you and other fun things like that.

michrassena · on April 4, 2018

As a matter of fact, I think this concept dates at least as far back as the IBM 360.

https://en.wikipedia.org/wiki/Channel_I/O

And you can find similar concepts in a standard PC today. The GPU is an example of offloading a workload to a specialized processor.

bshanks · on April 15, 2018

In addition, if each individual unit of memory hardware gained mini-CPU-like data processing capability, then you would have additional faux-CPU power that scaled linearly with memory; this would allow you to do some (embarassingly parallel) things much faster than just having one additional CPU per peripheral.

jacquesm · on April 4, 2018

Every modern hard drive is a computer in its own right.

sliken · on April 5, 2018

Seagate had a cool project where each hard drive ran linux and they used the physical sas cable to run a 2.5 Gbit network (or two actually) per drive.

So you could use that as block storage for luster, hadoop, or similar and enable things like direct disk to disk copies.

Cool idea, seems unlikely to hit a reasonable price point though.

atomicnumber1 · on April 4, 2018

This is first time I'm hearing it but NDP seems really cool. I like the idea of augmenting the memory and peripheral devices with intelligence.

th-ai · on April 4, 2018

Is Kryder's Law also dead? (2x memory every 13 months) If not, we could plan for petabyte local memory by 2030 and exabytes 2040s.

zanny · on April 4, 2018

Making features smaller is far and away the most cost efficient way to make processors faster.

It is also relevant to remember fan size and processor design are at least two independent divisions in the same company (Intel, Samsung) and are often two separate companies (AMD, TSMC). Its not like the chip makers arent investing engineers in design.

atomicnumber1 · on April 4, 2018

I see. I wonder what there strategy would be like? To double the efforts in reducing fab size or in other factors.

ghettoimp · on April 4, 2018

We can of course do both, and there is a lot of work on improving the designs. But scaling down the transistor size is generally a wonderful thing: smaller transistors use less power, produce less heat, and you can fit way more of them in the same area (which enables things like larger caches, etc.)

deepnotderp · on April 4, 2018

Dennard scaling has been dead for a while, smaller transistors don't necessarily by default use less power anymore.

atomicnumber1 · on April 4, 2018

ok so I see that reducing fab size gives lot of benefits and they have steadily done so for past couple of decades. In doing so maybe they overlooked on design or other factors? And now that we've hit the wall on fab sizes (purportedly), maybe they can double their efforts on other factors.

T-A · on April 4, 2018

That reminds me of this talk from 2013:

https://www.youtube.com/watch?v=JpgV6rCn5-g

The gist of it, as I remember it, is that radical design ideas were a bad investment while Moore's law ruled, because they were likely to be outperformed by simply shrinking the standard architecture; after Moore, design gains in importance, but don't expect anything like the performance improvements of the past half century.

atomicnumber1 · on April 4, 2018

Makes sense. Thanks for the video.

ghettoimp · on April 11, 2018

I think there really is quite a lot of innovation going on under the radar in the SoC space. Phones now have “neural engines” and all kinds of image-related processors for their cameras. Desktop CPUs come with integrated GPUs and there’s a lot of memory system consequences of that...

Gravityloss · on April 4, 2018

We are software limited also. AFAIK you could design a high level easy to use programming language with high performance vectorized array operations, but such languages at least are not in wide use. By high level I mean something like Ruby where you could specify lambdas for enumerables. But the vm would take care of vectorization and/or parallelization for you.

api · on April 4, 2018

The thing that really killed plain vanilla RISC is memory latency. Compared to on-die registers and cache memory might as well be disk. True RISC is more efficient to execute but it results in more instructions and hence more code that has to be read from RAM.

Modern CISC chips that immediately unpack CISC into RISC micro-ops are really something that I've termed "ZISC" -- Zipped Instruction Set Computing. Think of CISC ISA's like the byzantine x86_64 ISA with all its extensions as a custom data compression codec for the instruction stream.

We got ZISC accidentally and IMHO without us realizing what we'd actually done. The x86_64 "codec" was not explicitly designed as such but resulted from a very path-dependent "evolutionary walk" through ISA design space. I wonder what would happen if we explicitly embraced ZISC and designed a custom codec for a RISC stream that can be decompressed very efficiently in hardware? Maybe the right approach would be a CPU with hundreds of "macro registers" that store RISC micro-op chunks. The core instruction set would be very parsimonious, but almost immediately you'd start defining macros. Of course multitasking would require saving and restoring these macros which would be expensive, so a work-around for that might be to have one or maybe a few codecs system-wide that are managed by the OS rather than by each application. This would make macro redefinition rare. Apps are compiled into domain specific instruction codec streams using software-defined codec definitions managed by the OS.

The neat thing about this hypothetical ZISC is that while 99% of apps might use the standard macro set you could have special apps that did define their own. These could be things like cryptographic applications, neural networks, high performance video encoders, genetic algorithms, graphics renderers, cryptocurrency miners, etc. Maybe the OS would reserve a certain number of macros for user application use.

deepnotderp · on April 4, 2018

I agree with a lot of what you said, but ZISC already stands for zero instruction set computing.

Also, RISC and CISC instruction cache hitrates are pretty similar.

api · on April 5, 2018

Ahh I forgot about zero instruction set computing. Maybe CISC should just stand for Compressed Instruction Stream Computing because on today's chips that's exactly what it is.

Cache hit-rates being similar may just show that the ad-hoc evolved compression codecs represented by CISC instruction sets are sub-optimal, hence my point about what might happen if we intentionally designed a CPU with on-board compression codec support for the instruction stream.

snovv_crash · on April 4, 2018

This is basically what THUMB(2) was for ARM.

hinkley · on April 4, 2018

At the end of this he says transistors are now doubling every twenty years(!?) and it reminded me of another law Patterson doesn’t include in his graph:

    Proebsting’s Law: improvements to compiler technology double the performance of typical programs every 18 years.

marvy · on April 4, 2018

The derivation of that law is very suspect:

http://proebsting.cs.arizona.edu/law.html

(go on, it's just a paragraph.)

The key issue that this ignores in my opinion, is that a compiler optimization will rarely make last year's program faster, but it will make next year's program faster. Why? Because if the compiler can't make an optimization, programmers will do it by hand, even if it makes the code worse in some way.

For instance, if your C compiler can't inline small functions, you would use a macro instead. When it finally starts learns to inline, your program won't get any faster, but the next version will be able to use functions in places where macros are a bad fit.

Pile up enough of these optimizations, and eventually it starts to feel as if you're coding in a higher-level language than before, even though the syntax that's accepted by the compiler never changed.

hyperpallium · on April 4, 2018

> programmers will do it by hand

Only if better performance is needed.

Thus, corrollary: compiler technology will double program performance every 18 years, but only if it doesn't matter.

hinkley · on April 4, 2018

Developers have a nasty habit of convincing themselves that things aren’t needed when they see them as too difficult. Even if the rest of the world thinks your code is too slow you can convince yourself it’s good enough.

And in a world where we rely more and more on libraries, my ability to improve on a piece of code is greatly curtailed. Sending in the compiler to help might be my best option.

taeric · on April 4, 2018

That is an amusingly awesome claim.

I thought it was accepted that algorithm improvements have sped things up more than processor advances. I suspect there is a strong argument that memory sizes has been key, but processor speeds themselves haven't necessarily advanced at the same rate as the speed we complete problems.

That said, I tried quickly googling for this, but just came up with https://cstheory.stackexchange.com/questions/12905/speedup-f.... Looks like a good answer, but basically points out that it is complicated.

For my part, it is frustrating to see so many folks rediscovering things that used to just be too expensive to do and think they have rediscovered alchemy. I say this as someone that constantly thinks to have discovered a key method. :)

hinkley · on April 4, 2018

My experiences don’t jive with your thesis.

When people aren’t looking at a performance chart that is flat, they stop. No matter how loud the business is about the app being too slow people are too quick to announce that everything that can be done has been done.

Really in this situation there might be another order of magnitude hidden in there but it takes a special set of skills and a very special kind of perserverence to continue digging into a pile like that. A compiler has no such problem and I’m sure it could continue to shave off time for quite a while.

hinkley · on April 4, 2018

Correction: when people are looking at a chart that is flat.

civility · on April 4, 2018

> compiler optimization will rarely make last year's program faster, but it will make next year's program faster.

This doesn't follow. From your argument, last years and next years programs will run the same speed (about as fast as they can). It's just that next year's programs can be cleaner in some sense...

Which is interesting, because Dr. Proebsting's page also says his current interests include improving programmer productivity by removing syntactic baggage from statically typed languages.

SolarNet · on April 4, 2018

Also plenty of optimizations happen at different layers than those effected by compilation flags. Constant folding, smarter register spilling, improved implementation strategies (e.g. virtual function tables). These are all techniques that still happen at -O0.

pjmlp · on April 4, 2018

Doing by hand it is only worthwhile if the application does not meat the expected deadlines and only on spots validated through the use of a profiler.

I never cared about the C culture of speed before correctness, because type safety never impacted the expected use of my applications.

carlmr · on April 4, 2018

I care a lot about speed, but correctness is my foremost concern. I agree that I only look to optimize bottlenecks, so compiler technology would still optimize some of the less bottlenecky code from yesteryear.

pjmlp · on April 4, 2018

Great!

Sadly not everyone does that.

I do agree there are domains where every ms and byte counts, they are however a small niche.

mannykannot · on April 4, 2018

I think Proebsting's law [1] overlooks a number of points. Firstly, it is my understanding that compiler optimization made RISC feasible, and if so, then some of the hardware performance improvement over that period is attributable to compiler technology (at the very least, perhaps he should have compared today's optimizing compiler to a decades-old optimizing compiler, not today's compiler, optimized vs. unoptimized?) Secondly, he is extrapolating from at most two points (assuming his 'before' numbers are valid), and thirdly, with the (comparatively) low-hanging hardware fruit all gone, compiler optimizations have become relatively more useful (especially the work on optimizing concurrent and parallel software.)

Nevertheless, I tend to agree that programmer productivity is a worthy goal (and more specifically, those that improve productivity through making it easier to understand programs, so that programmers can more quickly produce programs that work properly.)

[1] See Marvy's comment for a link to the law: https://news.ycombinator.com/item?id=16751813

usrusr · on April 4, 2018

A compiler can only reduce the overhead relative to the hypothetical perfect program for a given task/source. That's asymptotic growth, not exponential.

Animats · on April 4, 2018

Yes, we're kind of stuck on individual CPU power. Clocks have been around 3GHz for a decade now.

There are now architectures other than CPUs that matter. GPUs, mostly. "AI chips" are coming. And, of course, Bitcoin miners. All are massively parallel. What hasn't taken off are non-shared-memory multiprocessors. The Cell was the only one ever to become a mass market product, and it was a dud as a game console machine.

PostOnce · on April 4, 2018

Perhaps it (Cell) would not have been a "dud" as you put it had IBM not been a morally bankrupt villain.

I've read that Sony was under the impression that the licensing agreement meant that IBM would market Cell tech to other customers, those customers being in other computer markets like datacenters and stuff, rather than to Microsoft, for the 360, at the same time that the PS3 was still in development.

"As the book relates, the Power core used in the Xbox 360 and the PS3 was originally developed in a joint venture between Sony, Toshiba and IBM. While development was still ongoing, IBM–which retained the rights to use the chip in products for other clients–contracted with Microsoft to use the new Power core in their console. This arrangement left Sony engineers in an IBM facility unknowingly working on features to support Sony’s biggest competitor, and left Shippy and other IBM engineers feeling conflicted in their loyalties."

from http://gamearchitect.net/2009/03/01/the-race-for-a-new-game-...

(it's a book, and worth reading)

Tuna-Fish · on April 4, 2018

I have written a substantial amount of code for the cell. I honestly believe that the entire approach it uses is a dead end -- no-one will voluntarily ever use such a machine again if there are any alternatives. The fundamental problem is that when you are writing code for cell, it is very hard to divide a large task into many subtasks where you do not have to constantly think of the entire problem when implementing small details. This means that writing code for a cell-like architecture doesn't scale. If you program fits on a whiteboard, a cell implementation can be easy to do and very fast to run. The moment you start building complex systems, things break down hard.

In the end, people just made each SPE do a single task, like dedicate one to audio, one to geometry, etc. There is not really enough parallelism like this in most software to support a cell-like approach now that even consoles get >4 real cpu cores. Real caches are the norm because they are extremely useful for programmers.

The villainy of IBM was just a slight additional problem over just how terrible cell was from a software standpoint.

slavik81 · on April 4, 2018

I have not read the book, but I have a hard time imagining any world in which the Cell could possibly be successful. Its heterogeneous architecture thrust a huge amount of complexity onto software developers in exchange for meager gains. Writing good code for it was difficult and expensive compared to other platforms. Sony was just completely out of touch with reality.

In 2007, Gabe Newell famously complained that the Cell was "a waste of everybody's time. Investing in the Cell, investing in the SPE gives you no long-term benefits. There's nothing there that you're going to apply to anything else. You're not going to gain anything except a hatred of the architecture they've created."

wolfgke · on April 4, 2018

> I have not read the book, but I have a hard time imagining any world in which the Cell could possibly be successful. Its heterogeneous architecture thrust a huge amount of complexity onto software developers in exchange for meager gains. Writing good code for it was difficult and expensive compared to other platforms. Sony was just completely out of touch with reality.

This was a different time. At that time researchers tried to build clusters out of PS3s - because the speed advantages of the Cell made it worth and "regular" Cell clusters were much more expensive. Some years later GPGPU became feasible and one could forsee that it will become faster than the Cell, too, in near future - and at that time the same kind of researchers dropped their PS3 clusters and built GPGPU clusters. Don't tell me that particular in the beginning GPGPU was easier to program for than the Cell.

It was also the time when Apple switched to Intel CPUs. I know at that time IBM was also trying to sell the Cell to Apple, but Steve Jobs refused and decided for Intel instead.

This decision of Apple and the decisions of researchers to stop tinkering with PS3 clusters and build GPGPU clusters instead were in my opinion the two landslides after which the fate of the Cell was destinied.

Animats · on April 5, 2018

It's not that it's heterogeneous. It's that each Cell processor only had 256K of RAM. K, not M. For code and data. You can pump data in and out from main memory in bulk, but random access is very slow.

So it's only useful for tasks that work like an assembly line - data flows in sequentially, gets processed, and output is pumped out. Great for audio. Lousy for everything else in games.

If you had 16MB on each CPU, the little CPUs might be useable. You might be able to run physics or pathfinding or NPCs in one.

slavik81 · on April 4, 2018

I'm sorry, but I pretty much entirely disagree.

> Don't tell me that particular in the beginning GPGPU was easier to program for than the Cell.

The alternative was to use bog-standard homogeneous cores.

Yes, the air force bought a compute cluster of PS3s for some specialized calculations. I wouldn't read too much into that. It says little about the suitability of the architecture for more general purpose computing. Supercomputers were always weird.

> It was also the time when Apple switched to Intel CPUs.

I don't believe there was much chance of Apple moving to Cell. Their switch to Intel was because IBM could no longer seriously compete outside of a few niches. There's nothing positive to infer from IBM's unsuccessful pitch to Jobs.

> This decision of Apple and the decisions of researchers to stop tinkering with PS3 clusters and build GPGPU clusters instead were in my opinion the two landslides after which the fate of the Cell was destinied.

You're assigning far more importance to research group purchases than I think is warranted. They don't buy enough to create economies of scale. That's why researchers so frequently adopt consumer products already manufactured at scale, like the Novint Falcon, Microsoft Kinect, and gaming graphics cards.

The Cell was best-in-class for a few specialized use cases, but it was never going to take the world by storm. If we turn to a heterogeneous architecture in the future, it will be begrudgingly, after all simpler alternatives have been exhausted.

wolfgke · on April 4, 2018

> You're assigning far more importance to research group purchases than I think is warranted. They don't buy enough to create economies of scale. That's why researchers so frequently adopt consumer products already manufactured at scale, like the Novint Falcon, Microsoft Kinect, and gaming graphics cards.

This is true, but in the consequences I have to disagree: Very often from this kind of "abusing" consumer products for research purposes there emerge quite interesting applications that do become quite popular and economically important. For example from such research there came the idea to use the Kinect as a 3D scanner - from this commercial applications emerged. Or from GPGPU (which at the beginning NVidia was quite the opposite of enthusiastic about) CUDA and later OpenCL emerged (which is much better to program for than abusing vertex and fragment shaders).

That is why I considered it is quite important for the future of the Cell when researchers went from tinkered PS3 clusters to GPGPU and called this a "landslide event for the future of the Cell".

davedx · on April 4, 2018

Yep. One of the game devs I worked with on the PS3 was charged with the task of optimizing our physics and game code on the XBox 360 and PS3 platforms. The Cell architecture was bad enough, but the real icing on the cake was the XB 360 profiling/debugging tooling was a lot better too. Needless to say, he wasn't a fan of the PS3.

Tobba_ · on April 4, 2018

I've heard a few times that the Cell wasn't all that bad in terms of performance, just very difficult to program. Not sure how true that is, but ostensibly the useability is just a tooling issue. Probably not a tooling issue that can be solved short-term though.

exDM69 · on April 4, 2018

I did a university project with a Cell back in the day and it was awfully difficult to program. It was similar to GPGPU programming but instead of gigabytes, the SPUs had kilobytes of memory. Orchestrating the movement of data to/from SPUs was very hard to get right.

usefulcat · on April 4, 2018

I never worked with cell or ps3, but that description reminds me a lot of working on the ps2.

slavik81 · on April 4, 2018

The problem's more fundamental than that. The core of it is that parallelism requires explicit consideration in algorithm design.

The Cell demanded you to structure your program around small tasks that could be run in parallel across its seven vector cores. There's no getting around the fact that it's the programmer who has to break down problems to be small enough to fit on those cores without letting coordination overhead get out of control.

ThrowawayR2 · on April 4, 2018

The quote says that it was the PowerPC core (PPE) was all that was used in the Xbox 360 processor and all that the PPE is is an implementation of IBM's Power ISA, one of several (see https://en.wikipedia.org/wiki/Power_Architecture#Specificati... under Power ISA v.2.03). It's no different than an ARM processor vendor re-using a core design for multiple customers.

The thing that made the Cell what it was, namely the 7 SPE units, were not used in the Xbox 360.

lowbloodsugar · on April 4, 2018

The Xbox did not have a cell. It had a power core without the cell.

The cell processor was intended to be the graphics solution for the PS3 [1]. The story I heard was that Sony was a hardware company and its engineers wanted technology that would work for new digital television applications and believed that the cell was going to be the perfect solution for everything, with its 8 "cpus". Except nVidia. Turns out a GPU is better at graphics than even eight very fast cpus. The PS3 wasn't going to have a GPU, and then they saw some Xbox 360 demos and had a brown pants moment. So they added a GPU at the last minute.

  [1] https://www.criticalhit.net/gaming/a-brief-history-of-the-playstation-3/

icebraining · on April 4, 2018

Sounds like Sony blundered when writing the agreement. If the chip is good for consoles, of course IBM will try to sell it to consoles.

std_throwaway · on April 4, 2018

> There are now architectures other than CPUs that matter.

You would make a killing with a CPU twice or ten times as fast. Many algorithms are only suited for single-core operation. I don't know if this will ever change. The focus has shifted to other architectures mostly because we've reached a ceiling for single-core CPU.

Tobba_ · on April 4, 2018

It'll change if someone can manage to take the "central" out of the CPU internals. You don't necessarily need software to see anything other than a monolithic core, but having to plumb everything through one central execution unit is hugely inefficient, if anything due to the latency involved. For example, if you're performing an indirect load and hit DRAM while loading the pointer, that result has to be brought into the core, then all way back to the memory controller the same way it came. So far that's just been worked around by throwing in bigger and bigger caches, but the size of first-level cache is at a dead end for now (due to needing physical proximity).

Heck, current x86 chips could be juiced quite a bit if you could take out the requirement for backwards compatibility. Instruction encoding being the obvious thing (not that it's not hip and RISC, but that it's an absolute mess that a huge proportion of the chips power has to be wasted on, and is pretty space-inefficient due to how horribly allocated things are). Less obviously just removing things like the data stack instructions (which, at least on Intel, have a dedicated "stack engine" to optimize them), the ability to read/write instruction memory directly (creates a mess of self-modifying code detection to maintain correct behaviour, and complicates L1 cache coherency a bit). Trimming transistors reduces the power consumption, which in turn means you can raise the voltage without the chip melting, and can clear up space in your critical data path.

gpderetta · on April 4, 2018

on an high end x86, decoder takes only a tiny proportion of the area and a power budget.

On smaller low power cpus it is more significant of course.

The stack engine is necessary anyway even if you have no specific stack instructions, as it removes the dependence of the top of stack manipulation from local variable accesses which is critical. Explicit stack manipulation instructions might actually make the stack engine simpler.

Coherent instruction cache and pipeline are super relevant in this age of pervasive self modifying code (a.k.a JIT).

Modern CPUs are complex for a reason.

ryanpetrich · on April 4, 2018

It's relatively straightforward for self modifying code to manually flush instruction caches when necessary and JIT compilers that target other architectures already satisfy this requirement. Only backwards compatibility with existing x86 software requires a coherent instruction cache.

gpderetta · on April 4, 2018

> It's relatively straightforward for self modifying code to manually flush instruction caches when necessary

Barriers are expensive. JITs might need to issue lots of them. Then the next generation of CPUs start tracking modified lines to make barriers cheaper. Then you end up with all the hardware complexity of implicit barriers without simplifying the software side.

> JIT compilers that target other architectures already satisfy this requirement.

they might have different tuning parameters to take into account the cost of the barrier when deciding profitability of JITing a region of code.

> Only backwards compatibility with existing x86 software requires a coherent instruction cache.

Far from it, IIRC the coherency guarantee has actually been strengthened recently. It used to require a far jump as a barrier.

Explicit vs implicit barriers are just an architectural tradeoff.

AnimalMuppet · on April 4, 2018

I'm having a hard time with "straightforward" and "self modifying code" in the same sentence. In fact, I think it's a syntax error.

Tuna-Fish · on April 4, 2018

Processing in memory has real promise for the cases where your work can be distributed. Specifically, I think it can have a great future in AI. However, for general purpose code I doubt it can do anything. Your example of indirect load would be greatly sped up if the target of the pointer is on the same device as the pointer. However, the second it isn't, the speed of moving things from one ram chip to another isn't any faster than from ram chip to cpu, and at that point defining a single central location that tries to be close to everything just makes sense. If your operation needs 8 values from 8 different places, having a central location means doing 8 transfers, while PIM can mean forwarding each value/intermediate values multiple times to go the the next location.

None of the changes to x86 people have thought of over the years really helps enough to break backcompat. Simply because they aren't on the fast path on the critical execution stage. The limit imposed on frequency by power in current cpus is not really the total amount of power consumed, it's the amount of power consumed in the <0.25mm of chip that houses the register file, forwarding network and alus. That is, the place were things actually happen during the most important pipeline stage. This is why a 8-core cpu running just a single thread cannot make one of the cores consume as much power as all the 8 would if running 8 threads -- the register file of the running core would just melt, even if the total power would stay below chip limits.

x86 decoding is hairy and takes a long time and a lot of transistors. However, it is placed in it's own pipeline stages, that are ran parallel to the execute and only slow it down by making a branch miss a little more expensive. And the power is limited today by caching the decoded uops in their own cache, so during any tight loop, the decode hardware is idle and consumes no power. The same sort of goes for the stack engine -- as it runs early in the pipeline, it is basically a way to compress instructions a little that saves power by making code more compact when it is running, and does nothing when it is not used. Removing it would not really help, even if all code instantly changed to accommodate. Much of the rest of the ugly warts of the x86 architecture is handled in the time-honored CISC way: just punt it to microcode, performance be damned. Today, self-modifying code technically works, but you never want to do it because invalidating lines in the L1i has been implemented in the way that is the fastest and cheapest way to make the common case of code that does not modify itself. (And which has to exists even if you don't support self-modifying code, because there has to be some way of invalidating L1i entries.) Similarly, a lot of the CISC instructions that make more sense to implement as software routines (fpu sin/cos for example) are today just abandoned ucode routines that are slower than rolling your own.

Tobba_ · on April 4, 2018

I'm not talking about the fundamentally misguided memory-distributed computing stuff, I mean "improve flexibility enough that you can bolt some additional units on as offload" (address translation in this case would take some work though). The magic of presenting software with a more or less monolithic core in this case is that you don't have that problem, since you can simply do it the usual way.

Also, I don't think the trouble with added complexity out of the hot path is any added latency, it's that they're needlessly burning up the thermal budget. Not that raising the voltage is the best way of increasing frequency, but it's sure to do so.

justin66 · on April 4, 2018

> Yes, we're kind of stuck on individual CPU power. Clocks have been around 3GHz for a decade now.

It's worse than that. A 3GHz version of Northwood was released a little over fifteen years ago. I doubt it was the first 3GHz processor (that'd be a PPC or something?), but it's definitely symbolic...

fulafel · on April 6, 2018

IBM shipped the 4.7/5.0 GHz POWER6s in 2007/2008.

(And a 5.2 GHz S/390-arch CPU in 2012 too)

justin66 · on April 6, 2018

Those are interesting but they're rare exceptions. IBM do these amazing things on super high end chips that very few buy, but the commodity chips that power the PCs and the datacenters of the world (stuck at a little faster than 3GHz, give or take) just don't rely on that stuff very much.

fulafel · on April 6, 2018

Well, there isn't a sufficiently big difference between "super high end stuff" and volume parts that it would explain a 10 year lead.

Note that the following generation of POWER was clocked lower. Clock frequency != performance.

justin66 · on April 6, 2018

Quite. The more important example would be Pentium. The main thing that was impressive about Northwood was the clock rate. AMD of the same generation, and the Intel Core processors that followed, did more with a lower clock rate.

fulafel · on April 4, 2018

It's in the eye of the beholder... A CPU + GPU system is arguably a non-shared-memory multiprocessor.

Systems running on AWS Lambda or Kubernetes or Kafka might also count as non-shared-memory multiprocessors.

ghaff · on April 4, 2018

Yep. The processors are spread around different boxes in a supercomputer or a Google datacenter but, at some level, that's just an implementation detail.

There was a time about--maybe a bit over--10 years ago when there were a whole lot of distributed memory processors/systems/hybrids coming onto the market. SciCortex, BlueGene, Cell, Azul, Tigera (I think) coming onto the market, as well as SMP chips like Sun's Niagara. The general problem is that, by the time these specialized designs would get to market, Moore's Law would have turned another crank and made all the work moot.

I do think with CMOS scaling slowing down/dying, we'll see more specialized designs even if they're a pain for programmers and system architects because what choice do we have? We already see it with GPUs, FPGAs, and so forth.

rjbwork · on April 4, 2018

You can now get a 6 core 4.7Ghz processor in a laptop. I'd say that's a hell of an improvement over 3Ghz.

bitL · on April 4, 2018

How many cores at 4.7GHz and for how long?

rjbwork · on April 4, 2018

I went to research....you're right, it's actually pretty unclear. Base is 3.7 but it doesn't say the specs on the 4.7 turbo.

maaark · on April 4, 2018

laptop turbos generally depend on the cooling solution. A 4.7Ghz quad core laptop will be LOUD.

AstralStorm · on April 4, 2018

Specifically, it would have to dissipate some 100W of heat. That not counting the GPU.

There are ways to do it even passively but none of them are light weight. Fast small fan(s) and heat pipes are probably the lightest.

washadjeffmad · on April 4, 2018

Usually it's a single core with the peak boost clock and a multi-core boost that's a bit lower. The i7-8700K for instance is 3.7GHz stock, 4.3 all-core boost, and 4.7 single-core turbo.

monochromatic · on April 4, 2018

Mirror: https://web.archive.org/web/20180404023027/https://www.eejou...

jaytaylor · on April 4, 2018

Also here: https://archive.is/tY5Cl

tfmkevin · on April 4, 2018

The problem is that we have designed ourselves into an architectural cul-de-sac when it comes to processors. We have fifty-plus years of evolution on programming methodologies built on top of von Neumann architectures. Moore's Law has given us decades of exponential gain without significant challenge to that architecture, and now that Moore's Law is reaping diminishing returns in terms of compute performance we are in the situation where we'd have to go backward forty years on our programming model in order to take advantage of a superior (given today's technology) architecture. For example, FPGAs can in many cases outperform von Neumann machines by orders of magnitude in terms of compute performance and (more importantly) performance per watt. However, the programming model and ecosystem for FPGAs is worse than primitive. Something you could write in a couple hundred lines of C code could take months to get up and running on an FPGA. We need a way to transition from von Neumann computing to alternative architectures without starting over on computer science. Or, perhaps recent trends in neural networks will eliminate the need for that?

scroot · on April 4, 2018

Just this afternoon I finished reading David Harland's 1988 book "Rekursiv: Object-Oriented Computer Architecture". It describes a completely different way of designing machines at the low level that can support better programming environments at the high level. You might want to check it out.

I believe we are going to see further balkanization between different operating systems / programming systems and computers based upon what they are use for. Cloud services will be the domain of what today we call "systems programmers" who work in compiled languages and care about speed. In contrast, we might now be able to get real "personal computers" running environments that teach their users how to peel back the layers and manipulate them — the long sought personal computing medium. This all could have happened back in the 80s, but we didn't have widespread or fast use of the Internet. Now it's different, and both of these types of systems can interop together in the blink of an eye because of it.

Both will require completely new computing architectures.

mcjiggerlog · on April 4, 2018

It's not all bad - one upside is that you don't need to upgrade your hardware anywhere nearly as often as 10 or 20 years ago.

I put this PC together in 2013 for maybe £500-600 total and apart from adding some RAM I haven't needed to upgrade anything and can still run games on highish settings.

criley2 · on April 4, 2018

You can probably run 2016-2018 games on medium settings if you are not interested in 60fps. I imagine you can play no graphically intense game @60fps at any respectable resolution.

I say this because building a computer which can play say Assassin's Creed Origins or Far Cry 5 at 1080p60 High Settings would easy run you over a $1000 right now, due in no small part to the extravagantly over-priced GPUs.

Heck, it costs $400-600 to get a GPU to play those games on medium to medium high right now. Not a computer, JUST the graphics chip to get 60fps on medium.

Crypto has destroyed affordable PC gaming and it makes me so sad. I can recommend Alienwares on sale that are dramatically cheaper than self-built. What happened to this industry :(

mort96 · on April 4, 2018

60 FPS? I had a desktop I built in 2014, with a 4770k and an r9 290x, run Overwatch at 144 FPS at 1440p with low settings. The machine could still play Just Case 3 and GTA 5 (2015 games, but I didn't really play any graphically intensive 2016+ games on it) at 1080p 60 FPS with decent graphics settings, if I recall correctly.

I have since upgraded to a 6700k and 1080Ti, but that 2014 hardware lasted well into 2017 - and the current GPU cost just under 3/4 the price of the entire 2014 computer, despite the r9 290x being a top of the line GPU. High end PC gaming definitely isn't affordable anymore.

AstralStorm · on April 4, 2018

That is mostly due to cost of GPUs having been inflated by miners and perhaps the expense of having a huge monitor.

Neither CPU nor GPU are progressing as fast as some predicted anymore.

Additionally the shift to consoles as stable hardware platforms over time has put a damper on computing power required by economically viable games.

The remaining outlets are VR and huge resolution (same thing actually) - and high quality and fidelity simulations. (Including AI.)

criley2 · on April 4, 2018

I have a R9 390 and it gets about 30-40fps in Assassin's Creed Origins at 1080p on high, and 40-60fps on medium

It's 3 year old tech whose performance/$ is still around $300 today and it struggles. Gotta turn those settings down with my r9 390 in every modern intensive game!

mcjiggerlog · on April 4, 2018

Well I guess I don't play that many super graphically-intensive games. I can, for example, play Overwatch on high settings at 1920x1080. I don't really monitor framerates, but I'm ok with about 40fps.

steve_musk · on April 4, 2018

What is the limit on creating bigger chips? If some of the money/effort was focused on being able to fab larger chips instead of decreasing feature size... I don't know much about lithography so maybe the answer is obvious to those that do.

exDM69 · on April 4, 2018

The "aperture size" in semiconductor manufacturing is a the practical limiting factor in chip size, and some chips (GPUs in particular) are already at the limit. It's basically the maximum size of the photolithography process (my understanding of semiconductor manufacturing is a bit weak).

Aperture size can be increased to some degree for future manufacturing nodes, but there's a limit to how practical it is.

Other factors also come into play. The distance light travels in a clock cycle has already been mentioned as a hard limit.

Cost is another matter: chips are rectangular but wafers are round. The larger the chips, the more there is wasted area the lower the yield per wafer. Intel has an advantage here, because they use 12" wafers when the rest of the industry uses 10" (this was a few years ago, things may have changed). Historically speaking, these wafers are huge compared to the 4" and 6" wafers of the past. Making the silicon ingots to cut the wafers from is another form of art, modern ingots are HUGE blobs of pure silicon.

I recommend paying a visit to Intel museum if you're around Silicon Valley. It's not a huge museum but has lots of interesting information, nice guides and a great photo-op to take a selfie with a big honkin' chunk of silicon (they've got full ingots and wafers on display).

dzdt · on April 4, 2018

Any reason why chips couldn't be hexagonal or triangular? Either one would still tile a wafer, and go closer to a round edge.

exDM69 · on April 4, 2018

AFAIK silicon has a crystalline structure which causes wafers to break on straight lines and at right angles. This might make things difficult for non-rectangular parts.

There's also a ton of practical issues regarding the process. Photolithography equipment, cutting the wafers, the machines handling and packaging the chips. All would have to be redesigned and I'm sure that would cost more than the saved silicon during the lifetime of the whole plant.

The current chip sizes are already getting pretty close to hard limits (speed of light, etc). 900 mm^2 is 30 mm across, which is about half of the distance light travels in a clock cycle (async/clockless circuitry could help here).

Larger chips are also less efficient. A while ago I was discussing power efficiency with a HW designer working on memory controllers and he was using a fancy unit called "nanojoules per bit-millimeter", ie. how much energy it takes to flip a bit that is physically located a certain distance away. The efficiency gets much worse as distance increases.

Increasing wafer size also improves the yield, which is why Intel has an advantage with 12" wafers (others have 10").

Instead of creating larger chips, the current trend seems to be about packaging more chips in a single package and connecting them more efficiently. In particular, it's about bringing memory closer to the CPU to get those bit-millimeters down.

deepnotderp · on April 4, 2018

Fwiw, electrical signals in global interconnects move at about half the speed of light. Near speed of light signaling techniques like transmission lines can be used, but aren't due to a variety of reasons.

slededit · on April 4, 2018

You can't start a saw in the middle of a wafer. You need to cut from end to end.

yjftsjthsd-h · on April 4, 2018

Triangles would work for that.

slededit · on April 4, 2018

The area around the acute angles would be almost useless due to lack of routing space.

jonhendry18 · on April 7, 2018

What about laser or a waterjet cutter?

deepnotderp · on April 4, 2018

Silicon wafers are much easier to manufacture in a circle shape.

Tobba_ · on April 4, 2018

Seems like most people here are bringing up speed of light problems, which are a concern, but it's not what stops you. The problem is that your yield goes down exponentially with die size, and binning them becomes a clusterfuck. The opposite direction of making smaller dies is fairly attractive though. For example, AMD split Threadripper into multiple dies on an MCM and seem to be saving a fortune on it, at the expense of some die area for interconnects. That way they can test and bin dies individually an assemble an MCM of known-good dies from the same bin.

I remember reading that GPUs are getting to fairly monsterous die sizes though - and they're paying for it.

KaiserPro · on April 4, 2018

Chips are fabb'd on a large wafer, which is then split up (see here: https://s3.amazonaws.com/ksr/assets/003/150/280/9b6a64c4d8ed... )

Now, the process isnt perfect, and you hear a lot about "yield" Which is basically how many chips on a wafer are not working to spec. Now, as you make a chip bigger, you increase the chance of a mistake. This reduces the "yield" and drives up the cost. (I'm not sure if its actually possible to make a full sized wafer without a mistake, I'll defer that to someone who knows)

In some cases those broken chips arn't all that bad, so they are shipped with the broken bits deactivated (This could be lies, but I think some AMD procs were done like this )

yes, there are other factors like propagation time, but thats solved by not having chip wide cache coherency.

sp332 · on April 4, 2018

You don't increase the chance of a mistake, you increase the cost of a mistake because each little defect means you're throwing out a whole chip. The larger each chip is, the more expensive each little defect is.

Sony had a hard time when they were ramping up Cell processor production, so they designed the chips with 8 SPEs but only shipped them with 7 activated. That way if a defect happened to be in one of the SPEs, they could just turn it off and still ship the chip.

colejohnson66 · on April 4, 2018

> In some cases those broken chips arn't all that bad, so they are shipped with the broken bits deactivated (This could be lies, but I think some AMD procs were done like this )

That’s called “binning”

cma · on April 4, 2018

Say each platter has N defects spread uniformly. Double the area of each chip on the platter (less total chips) and you ~ double the defect rate per chip. Make the whole platter one chip and it always has a defect.

GPUs and CPUs somewhat work around it by being able to disable cores or a part of cache and sell the chip for less.

Zardoz84 · on April 4, 2018

Perhaps it's time to resurrect wafer scale integration : https://en.wikipedia.org/wiki/Wafer-scale_integration

exDM69 · on April 4, 2018

Maybe it's a viable research project, but still far off from reality. The largest chips that can be manufactured with current technology are about 900 mm^2 in size, but a normal 10" wafer is about 51000 mm^2, which is 50x larger than the state of the art today.

An attempt to produce entire wafers in one go is also a big gamble. Any defects in the manufacturing could ruin the entire wafer rather than individual chips. Parts of the wafer could be disabled if defective, but it would result in a combinatorial explosion of different configurations.

Also worth noting that in 1980 when Wafer Scale Integration was researched (without success), wafers were 4" diameter. Current wafers are 10-12" in size, which makes the process much more difficult and error prone.

jecel · on April 5, 2018

The two big waferscale efforts of the 1980s were by Gene Amdahl and Clive Sinclair.

Amdahl did run into technical problems connecting multiple wafers to build mainframes and using lasers to swap out bad subparts (both currently solved problems and normal industry practice).

But in Sinclair's case investors pulled out despite technical success because hard disk prices started falling exponentially after having stayed stable for many years. The irony was that he had done the "silicon disk" waferscale RAM just to not scare of investors with his real goal of a manycore processor on a wafer.

AstralStorm · on April 4, 2018

Which is used in the new Intel and AMD CPUs. IGPU on the same wafer. HBM for memory...

exDM69 · on April 4, 2018

You've misunderstood. They put many chips inside one package, but the individual chips are still from different wafers.

AstralStorm · on April 4, 2018

Which is actually better as it means you can excise the faulty ones easily and/or fuse the broken subunits.

The main reason to integrate things is performance and it runs counter to yields.

If you can have a specialized process to produce whole wafers of HBM or GPU cores with decent yields then another good one to plug them together you have a winning combination.

Not understanding this is why the microprocessors from many subunits tended to fail badly. Either the design was not performant or it ran into yield issues or it has to use the exact same technology to integrate modules which was not invented or perfected at the time.

So you would need a process that does everything well which is much harder than having a specialized process plus integration.

deepnotderp · on April 4, 2018

Cerebras is.

Ma8ee · on April 4, 2018

One problem with bigger chips is that the speed of light actually starts to be a problem. At 1 GHz one clock cycle is 1 ns which means that a light pulse in vacuum reaches maximum one foot in that time.

_chris_ · on April 4, 2018

Cost of the wafer is fixed. The smaller the chip, the more chips per wafer, and thus, the cheaper each chip is.

Then add defects to that, which means you have to throw away more silicon per defect the bigger the chip.

carlmr · on April 4, 2018

IIRC it's about clock speed. Your electrical signals travel at the speed of light. If you have a 4GHz clock the signal travels at 300,000,000 m/s. Now one clock cycle is 0.00000000025s, so the light has time to travel 0.075m. That's only 7.5cm! And you need everything to be synchronized inside and get into a steady state before the next clock cycle.

AstralStorm · on April 4, 2018

The synchronization can be avoided somewhat by the use of (also power efficient) asynchronous electronics. You can also employ lower frequency clock them multiply it on site. A bit more cost because now you have to handle clock jitter.

And then we have computing separated by meters of wire and network fabrics in clusters.

So size is not that much of a problem. It is just that free lunch for programmers has ended some 10 years ago.

blauditore · on April 4, 2018

Electrical signals don't travel at the speed of light, but your argument is still valid.

jaytaylor · on April 4, 2018

Actually, they very nearly do.

    The speed of electricity in a 12-gauge copper wire is
    299,792,458 meters per second x 0.951 or 285,102,627
    meters per second. This is about 280,000,000 meters
    per second which is not very much different from the
    speed of electromagnetic waves (light) in vacuum.

tlb · on April 4, 2018

Not wires across silicon, which have higher capacitance and much higher resistance. For millimeter-long connections you might get 1/10 of C, ie 33 ps/mm instead of 3.3.

jaytaylor · on April 4, 2018

Hey, that's actually really interesting, thanks for the followup!

This corroborates what you're saying:

https://www.reddit.com/r/askscience/comments/204hl2/at_what_...

deepnotderp · on April 4, 2018

Transmission line signalling can get near speed of light in silicon. But it has its own issues :)

deepnotderp · on April 4, 2018

The reticle limit of 193i immersion steppers is ~33x26mm which means to go larger you need multiple exposures to cross them. This is very slow, expensive and terrible for yield.

blauditore · on April 4, 2018

Bigger processors means longer distances for current to travel, thus longer delays, and therefore lower possible clock frequencies.

Dylan16807 · on April 4, 2018

Not at all. In modern many-core chips, latency and complexity are already mitigated by having cores talk to only their neighbors. A meter-wide version would run at the same clock speed, if you could make it.

fouc · on April 4, 2018

Found an older youtube video that touches on the same topic: https://www.youtube.com/watch?v=1FtEGIp3a_M

rstuart4133 · on April 6, 2018

There is a lot of focus on the end of Moore's law here, but it isn't main driver of what's happening. The slow down we are seeing is driven by the end of Dennard scaling.

There is a thermal dissipation limit of 200W per chip for air cooling. We hit that decades ago of course, but it didn't matter while Dennard scaling kept dropping the power consumption. Once that stopped we squeezed a bit more of stuff by being more power efficient, which boiled down to two things - turning stuff off when it wasn't needed and devoting the transistors Moore's law gave us to specialised tasks (like silicon dedicated to encryption or h264 encoding) that did the job more efficiently. However, that doesn't get you very far.

Which is probably why he didn't mention 3D, even though we have it 32 layers of it now and they are talking about 256 layers. What is the point of having 128 CPU's on a single die just running 4 of them exceeds your power budget? Indeed, what's the point of spending billions perusing Moore's further?

Or to put it another way again, the human brain fits roughly the same number of synapses per unit volume as modern 3D silicon has transistors. The brain's raw switching speed is roughly 1,000,000 times slower than silicon (1ms vs 1ns), but power consumption of a synapse vs a transistor is roughly 100,000 times better.

So while AlphaZero learnt to play Go between than any human in a few days, it used more energy that an entire human (not just their brain) would use in several life times to do it.

jtbayly · on April 4, 2018

I wonder what will happen to the CPU, especially if it’s not speeding up much anymore. Perhaps with Apple doing its own chips, the CPU will just have less and less of the work assigned to it.

blueline · on April 4, 2018

if someone figures out a way to bring EUV to mass production scale, we would afaik essentially have a clear path all the way to 3nm

martinpw · on April 4, 2018

Which still really isn't that much further considering how far we have come so far.

abritinthebay · on April 4, 2018

That’s a BIG if.

ianai · on April 4, 2018

At some point it was going to be germanium to replace silicon.

jeffreyrogers · on April 4, 2018

The problem is silicon is really cheap compared to germanium, so germanium only makes sense if the increased speed is worth the cost (which it probably isn't for almost all applications);

ianai · on April 4, 2018

What’s the cost factor? Ie why not have a UI germanium chip and silicon the rest of it?

ScottBurson · on April 4, 2018

Germanium could still come back: https://spectrum.ieee.org/semiconductors/materials/germanium...

resource0x · on April 4, 2018

Remember GaAs? It was the Future in 1980.

deepnotderp · on April 4, 2018

He's talking about germanium channel FinFETs, GaAs was intended to entirely replace silicon, which probably was never going to happen.

Tobba_ · on April 4, 2018

GaAs logic is probably happening at some point, just not yet. All improvements like that which would be incredibly expensive to develop will be held off on until all cheaper options have been exhausted. It does seem to be slowly moving though.

woadwarrior01 · on April 4, 2018

GaAs is gallium arsenide, which still has applications in high frequency transistors.

hyperpallium · on April 4, 2018

So Moore's Law really is dead this time. And TPU's are only faster than GPU's through lower precision.

Will compute at least keep getting cheaper, perhaps through economies of scale?

Is Kurzweil's magical next information technology, to carry on the exponential, anywhere in sight?

nootropicat · on April 4, 2018

Imagine asics for everything. I mean everything, like implementing a javascript engine directly. The energy efficiency could go up 2-3 orders of magnitude (looking at the difference in bitcoin mining between gpus and asics).

Old gaming consoles had cartridges (with memory); I can imagine a future in which complex software is transported in the same manner, except cartridges contain specialized asics. Or perhaps a step forward - a chip making device in every home, an equivalent of sorts of burning music to cd.

_chris_ · on April 4, 2018

I assume you meant to reference this, but just to make it clear, those old gaming console cartridges sometimes DID have special chips in them. Much to the horror of emulator writers a decade later, hehe.

Wiki: https://en.wikipedia.org/wiki/List_of_Super_NES_enhancement_...