Hacker News new | past | comments | ask | show | jobs | submit login
Fifty or Sixty Years of Processor Development for This? (eejournal.com)
318 points by curtis on April 4, 2018 | hide | past | favorite | 312 comments



Wonder what this means for system software and application development.

There's a factor of 10-40x speedup by going from an interpreted language like Python/Ruby/PHP to a tight compiled one like C++/Rust/Ocaml. 2-4x going from a good JIT like V8 or Hotspot (or Go's runtime, though technically not a JIT). Probably another 10-100x by cutting out bloated middleware like most web frameworks or the contents of your node_modules.

All this was irrelevant when you could get your 2-4x speedup by waiting 18 months, and your 10x speedup by waiting 5 years. It's very relevant when your 2x now takes 20 years and 10x takes a lifetime. Maybe this is why Rust gets so much attention recently.


I run a production Rust web service. The speedup for this service over using slightly stripped Rails was only about 5x. As you said, you can gain like 50-100x performance improvements from not using the default Rails JSON serialization and skipping ActiveRecord.

After that, you're lucky to gain 5x performance from re-writing the whole thing in Rust. Most of the hot spots of serving web applications using Ruby are already written as native extensions.

I think Rust is fantastic. I'm writing a tinyrb like "Ruby" VM in Rust at the moment. But... it's just not worth the hassle for plugging web services together. Maybe if you're at Google scale and already have web services in C++ it'd be a good choice.


Let's not forget that 5x is actually a big difference, especially in terms of the end user... You can also avoid much more near-the-edge performance issues/instability.

As you alude to it may not be worth a 10x increase in developer effort to try and do a bunch of rewriting when you could do something (as you said) like just replacing the bloatiest, slowest parts of rails and get large performance gains.

Have you written anything about this? I'd love to read more. As a person that is neutral/dislike on ruby and full-on hates rails, I'd love to hear from someone who's experienced running something as new as Rust in production alongside a rails application, especially from someone who likes ruby.


Right, but you can also get that same 5x increase from writing something using a higher level language running on the JVM or BEAM. You don't need to go full on "systems programming" to have something that outperforms Ruby.


MRI Ruby and BEAM are both bytecode VMs. There's no simple 5x increase in performance there. BEAM is register based, and YARV is stack based, so BEAM tends to be a little faster.

The JVM is a completely different beast. Thanks to improvements in the JVM for hosting dynamic languages, JRuby is able to offer some serious performance advantages over MRI Ruby, and I'm really excited about TruffleRuby.


Just to confirm why this is getting downvoted -- I'm guessing the MRI vs BEAM statement is super wrong? I can't say I know much about erlang's VM but being "bytecode VMs" I don't think MRI the same as BEAM, nevermind the primary paradigms they service being completely different.

I assume JRuby is faster than MRI Ruby (at least for parallelizable tasks), so I can't imagine the second statement being wrong per say... Also I'd never heard of TruffleRuby, and call me prejudiced, but I don't want to be anywhere near anything developed by Oracle, whether it's oracle labs or not (I hadn't heard about GraalVM before).

Speaking of fast ruby... I'm surprised no efforts to make ruby faster/swap out MRI haven't panned out. In python land I know there are efforts like pypy and pypy+stm tackling the slowness and GIL problems while maintaining axiomatic use of the language.


I'm not sure why this is downvoted so hard because it's true. BEAM has massive advantages over MRI Ruby/Cython for concurrency and parallelism but not really much in sheer throughput.

For example, Sinatra + Sequel on MRI has a 50% lead in throughput over Phoenix + Elixir on BEAM here: https://www.techempower.com/benchmarks/#section=data-r15&hw=...

Bytecode VMs all have the same fundamental problem with instruction dispatch overhead, regardless of what language or paradigm they're supporting, which is why JIT is so important even though it's so much additional complexity.

JRuby is now 2-3x faster in general as the JVM improved support for hosting dynamic languages, and it's in heavy production use at places like TalkDesk, so in that sense it is the "faster Ruby".

There's also Topaz, which is Ruby built using the underlying framework of PyPy, but performance is disappointing AFAIK.


I find the fact that anyone can speak dismissively about a 5x speedup disheartening. Has anyone ever done a study on how much CO2 we are emitting in the name of "developer productivity"?


It's probably less than you think. Humans - just by virtue of existence - produce a huge amount of CO2, both through the air they breathe, the meat they eat, the automobiles they get to work in, the heavy machinery used to build those roads & buildings, the manufactured goods they consume, etc. And the CO2 cost of a developer isn't just that one developer's emissions; it's also those of all the support staff needed, from managers/admins/HR at work to the food service workers that serve them meals out to the doctors/lawyers/therapists and other service providers they visit to the parents that raised them.

It's almost certain one developer generates more CO2 than any reasonable number of servers that run their code. Anything that reduces manpower costs is a net positive for emissions. Besides, when the equation changes (say, when the software enters maintenance mode but the servers stay up), they'll be a strong economic incentive to spend the developer time to rewrite it more efficiently.


It's almost certain one developer generates more CO2 than any reasonable number of servers that run their code.

I'm not so sure. Let's do a back-of-the-envelope estimate.

Assume a single really hefty server that consumes 1 kilowatt. Over one year, this is about 10,000 kw hr. 1 kw hr of electricity produced by a coal fired plant generates about 1 kg of CO2 (https://carbonpositivelife.com/co2-per-kwh-of-electricity/). Thus that big server running for a year produces about 10 metric tons of CO2.

An average American lifestyle (all in, total country production divided by population, https://www.theguardian.com/environment/datablog/2009/sep/02...) involves the production about about 20 tons of CO2 per year. So if you write code that full-time on more than 2 really big servers per year, your code might be producing more CO2 than the rest of your lifestyle.

I'm guessing that most of the errors in this are probably overestimating the code's CO2 (probably not coal fired, probably less than 1 kw, a year is less than 10,000 hours), so more realistically maybe it's 4-8 servers to be break-even? Still, I think it's fair to say that there are some participants in this forum whose running code probably generates more CO2 than the rest of their lifestyle.


Something like 2% of the US power consumption goes to data centers.

But think about this... how much CO2 would you spend getting developers to work every day so they can reduce CO2 somewhere else? I suspect the biggest users of DC power, in terms of code, are already written in languages like C++.


5x is an amazing speedup! How much latency did that strip out for end users?


Almost nothing in the context of the long-running analytical queries the web layer has to wait on the DB for.


How much did this save you in energy and hosting fees?


Not at a large enough scale for either to be relevant.


Not having used Rust, I'm wondering how it makes writing a VM easier or harder?


But if you are talking about a stripped down Rails without ActiveRecord, we are looking at something on par with a Sinatra service. That is pretty lean already. Compared to Rails you already have a big performance improvement.

5x overhead for a Ruby webservice on that level is decent. I wouldn't have expected much more.


On the other hand, software often follows the Pareto principle: 80% of the time is spent executing 20% of the code [^1]. Optimizing/migrating just that 20% would yield most of the benefits of a complete rewrite. (Good luck if that 20% resides in your framework of choice, though).

[^1]: https://www.codeproject.com/Articles/49023/The-impact-of-the...


The evils of premature optimization are well known but from my experience, execution often isn't the bottleneck, memory is.

Which mean you can get a lot of benefits by optimizing your data structures. Making them smaller is a good start. Unfortunately, changing data structures can have far reaching implications, sometimes justifying a complete rewrite.

For example, going from string based hashes to raw C structures can yield massive improvements, but you need to rewrite most of the code that access it, even parts that are rarely used.

Another performance killer are deep copies. This is also hard to optimize out because if you switch to a more efficient reference, you need to mess up data for the rest of the program in every part that uses it.


Price for compute is still going steadily down for web services. Which I think is more relevant than individual core/CPU performance. CPUs are still getting cheaper, blades are getting more compact. Virtualization, containers and function-as-a-service are improving utilization. Modern cloud service providers have massive economies of scale.


Yep, we are seeing computers evolve from hybrid media consumption/creation/business tools into a bifurcated model where IaaS providers rent out the majority of the world's compute power, and average consumers own a (powerful) thin client for UI/display/consumption.

There will be still sizable markets (real-time industrial/automotive/medical/etc., gaming, some business apps) where individuals will still want to own and upgrade their own beefy on-prem hardware, but in the consumer realm I can't think of many new applications that have taxed my 2012 MBP, or even my older Dell/Win7 laptop. AWS/Azure/GCP shoulder the burden for me.


That only helps for embarrassingly parallel tasks, like rendering hundreds of distinct web pages per second (which in many cases might just as well have been partially precalculated to trade space for time).

As was pointed out in the chart, we’re already hitting Amdahl’s law with half a dozen cores.


I believe what you're seeing is market competition, not reduction in hardware costs.

The total cost to offer compute has actually gone up significantly over the last year or so due to memory prices shooting up. I believe the cost of processors has gone up too, but not as drastically.


It's all about implementation. I got a ~100x speedup by going from bad Fortran to average Go (notice, both compiled). And a ~10x speedup by going from bad Go to excellent Python (compiled -> basically interpreted).

So yes, compiled languages can give you performance benefit, but it's not guaranteed, you need to work for it.


What is your app?


There's such a thing as being penny-wise and pound-foolish, though.

Imagine if I told you I had a desktop application that was slow, and I'd achieved a 2-4x speedup in common tasks by rewriting the whole thing in a different language. But then you discovered that I'd written it in such a way that it had to hit the (spinning-platters, just to drive the analogy home) disk to do even the most basic things. You would, I hope, tell me that I'd wasted a ton of effort optimizing entirely the wrong thing, and that my 2-4x speedup from choosing a new language would be blown away by the likely multiple-orders-of-magnitude speedups I could get from better memory management and I/O patterns.

In the web world, most discussion of backend languages is like this. Sure, you could get your impressive-sounding speedup from switching to a "faster" language. But the time spent executing application code is so utterly dwarfed by the time you spend idle while waiting on a database, or by the time it takes to send things over the network to the client, that "switch to a faster application language" should probably be the 1000th item on the list of the first 1000 things you do to try to improve performance.


The idea that a webapp should be a frontend to a RDBMS is itself an artifact of the "time to market is way more important than computational efficiency" culture of the mid-00s.

You can get 1000x speedups by ditching the database and serving out of RAM. Sites like Hacker News, PlentyOfFish, Mailinator and Google (back when the webserver was written by just Craig Silverstein rather than a team of hundreds) serve thousands of queries per second off a single box. In most cases the actual access patterns of apps don't map particularly well to either an RDBMS or a key-value store, so if you're willing to put in the time to develop a custom datastore, there are still large efficiencies available.


Maybe, maybe not.

What does the profiler say. This is always the first question in optimization, if the profiler comes up with anything you can fix, that fix is always your biggest bang for your buck. If the profiler doesn't really say anything we are in a tough spot. There are things that are worth doing anyway, you have to be careful, in most cases even if the change is really faster it probably will not be fast enough.

Back to your example, maybe we cannot fix the database, but even if the database query is 10 seconds, going from 10.5 seconds to 10.001 seconds response is an improvement.

Your scale is an important question. Facebook can save tens of thousands per month with an optimization so small no user will notice it - between not having to buy as many servers, and not having to pay as much power to run the CPU for those extra cycles. (Facebook will not give you the actual numbers, but they can tell you they have measured and you can read between the lines to guess how much it must be given they have a few people employed in small optimizations)


In web apps, the first thing you do is start looking at DB queries. How many, how complex, etc. Reducing the number of queries and making those queries as simple/fast as possible accounts for at least the first quarter of my hypothetical "first 1000 things you do to try to improve performance" list.

Also, regarding scale: the number of entities operating at AmaFaceGoog scale is small. The number of entities operating even within an order of magnitude or two of them is small. The odds are very much against any advice relevant only at that scale being relevant to the average Hacker News reader.


I think the death of conventional Dennard scaling and Moore's law is part of why interpreted super-dynamic languages like Ruby have lost their luster. We can no longer just ride CPU performance increases to get increasing application performance, so these languages' implicit slowness becomes a liability that won't just go away magically.

My optimistic hope is that the death of easy CPU performance gains will lead to a round of serious evolutionary improvements in software architecture. Rust is probably a sign of that happening.


> interpreted language like Python/Ruby/PHP

There is no such thing as an "interpreted language." An interpreter is a class of programming language implementation. The languages you listed do not even have interpreters as their standard implementations; they have bytecode virtual machines. There are alternative compiler implementations for Python and PHP.


Everybody says this - believe me, I was one of them when I was younger - but if you've ever tried to write a compiler for Python or Ruby, you'll understand why they're often called "interpreted languages".

(Interpretation vs. compilation is a continuum, anyway; even a compiled language like C uses a runtime for several operations like malloc or strings, while modern JITs like V8 or PyPy can compile a trace of methods and then fall back on an interpreter for uncommon cases. Nevertheless, there's still an important distinction there: in a "compiled language" like C or Rust common syntactic forms like addition, property access, or function calls can semantically map to machine instructions or memory locations fairly easily, while in an "interpreted language" like Python or Javascript, even a property access or arithmetic operation may invoke an arbitrary, not-predictable-at-compile-time piece of code, and hence require runtime dispatch.)


> but if you've ever tried to write a compiler for Python or Ruby, you'll understand why they're often called "interpreted languages".

Why, I happen to have hacked on Lisp implementations. Including Thinlisp, which is a compiler for a non-garbage collected real time subset of Common Lisp. And what you say is still wrong.

An interpreter is not the same as a bytecode VM, and is not the same as native code compilation. And none of those things are a property of programming languages. The only programming languages where that might remotely be true are purely string-based ones like TRAC and maybe some of the term rewriting ones.

> even a property access or arithmetic operation may invoke an arbitrary, not-predictable-at-compile-time piece of code, and hence require runtime dispatch.)

If that's the criterion, then vtables make C++ "interpreted."


> If that's the criterion, then vtables make C++ "interpreted."

So do function calls; you don't know until the .o is linked where a read call will go. Unix C library? Or some local override of read?

And, speaking of vtables, modern shared lib calls give obj->virtual(args) a run for its money in terms of overhead. It's worse because a vtable's structure is determined at compile time, so it's just positional referencing; a shared lib call has the referencing through a table plus string name lookup to figure out the offsets at run-time (at least the first time through).


There's a reason why C++ requires the "virtual" keyword on methods that may be overridden...


What that means for software development? I hope the lazy excuses to NOT write performant software will die out and software gets fast again. remove the hundreds layers of libraries and frameworks and build good fast software again. we will find other ways to manage complexity and software reuse.

Death of Moore's Law really is the best that could happen to us...


This assumes compute costs comprise a sizable part of the budget of a software company. If a company can save a couple of percent per year by doing a complete rewrite in Rust or C++, the return doesn’t make the cost worthwhile.


Speaking of Moores law being dead this time, check out this old article from 2012 predicting we would be at 7nm Intel chips with 5nm on the way: http://www.tomshardware.com/news/intel-cpu-processor-5nm,175...

Intel is still trying to figure out 10nm because it is rumored that there are material science problems that are causing yield issues. Remember the 1960s when rapid gains in space tech made everyone think we'd be travelling around the solar system by 2000? The tech hit a plateau and stopped. Maybe we're in that situation with chip technology...


Just want to let you know, as a person that worked at intel at one point -- 10nm is already plenty hard. @ 10nm and below, engineers/designers at intel (and a bunch of smaller more agile companies intel works with) are at this point just fighting physics as we know it at this point.

There are promising technologies coming about, but since they're so different the amount of time it takes to perfect the process is super unforgiving, and intel is trying to navigate all that while putting out products that actually appeal to customers in different markets, planning them out at least a year in advance despite not having any idea what the market will look like in a year.

Moore's law is super dead. If any company were to manage to keep up it would be nothing short of a miraculous revival.


How are GPUs still apparently increasing their power exponentially without using more power? It looks like the best 10 series NVIDIA GPU has about twice the power of the best 9 series, about 2 years earlier, which lines up with Moore's Law pretty closely. How are they doing it? Maybe they were actually lagging Intel- looks like the 9 series was a 28nm process, and the 10 series is a 14 or 16nm process.


Yea, GPU's are a few nodes behind which means they will take longer to hit the same wall. Even the GTX 1080 Ti just hit 16nm.

Compare a 95 W (22nm) i7 2600k from January 2011 vs 65 W (14nm) Core i7-8706G from February 2018, that's over 7 years of progress for 30% lower power consumption and a fair speed boost.

That's what GPU's are facing starting now, though they can directly trade lower power consumption for more speed as they are embarrassingly parallel.


I can't answer authoritatively, but I would speculate it has to do with the highly parallel nature of GPUs.


That and dark silicon: https://en.wikipedia.org/wiki/Dark_silicon

Running a modern CPU with an acceptable TDP, means only so much of the transistors can be utilized simultaneously .


> Remember the 1960s when rapid gains in space tech made everyone think we'd be travelling around the solar system by 2000?

I think space tech is a different beast economically - the intensive had been more to compete for global dominance in the 1960s and not so much demand (from the general public) to travel in space. I believe there is a bigger intensive to have faster processors now but we might be hitting physical/engineering limitations so we go distributed.


An amazing 2014 talk[1]/article[2] by @idlewords explores exactly this premise.

[1] https://www.youtube.com/watch?v=nwhZ3KEqUlw [2] http://idlewords.com/talks/web_design_first_100_years.htm


Isn't nanometer scale also getting to the point that you're dealing with individual atoms? There is undoubtedly a point where you can't progress any further, and it seems reasonable that progress gets a lot harder as you approach that point.


In the biz people already measure some widths as "monolayers", which refers to a single layer of atoms. Some process features have error margins of less than a monolayer -- that is, a dieletric needs to be 4 monolayers thick, and if it only has 3, the device is going to short, and if it has 5, the device will not work.


What kind of CVD/PVD process do you use to achieve those kinds of tolerances? Or is the trick to use protect/deprotect steps like in conventional chemistry?


ALD is commonly used for these ultra fine tolerances.


Atoms in a silicon crystal are about 0.54 nm apart, so we are not quite there yet.


So you're saying a 10nm structure is about ~20 atoms long? So basically proving parents point...


No, saying that we are still an order of magnitude away from dealing with individual atoms. The current slowdown is more about light than about atoms; the move to EUV lithography [1] has proved to be very challenging [2].

[1] https://en.wikipedia.org/wiki/Extreme_ultraviolet_lithograph...

[2] https://spectrum.ieee.org/semiconductors/nanotechnology/euv-...


In theory Coulomb transistors can be a single molecule. I believe single molecule transistors have been experimentally demonstrated.


Anyway, around 5nm you start to have quantum tunneling effects. Current technology can't be scaled down much further. You can keep optimizing things, but you are still near the technology limits and increasing complexity.


Even though the numbers in process node names keep going down, this no longer directly corresponds to line widths. For example, the largest difference between TSMC 20nm and 16nm is that the way the transistor is built was completely changed.

There are a couple of transitions like this left that buy real performance and density, even if the actual line widths stay as they are, so there likely will be a "3nm" process node. After that, who knows.


Agree it might be possible, but still, my point is that gains become smaller and smaller, more and more complex, and you still don't end up with transistors an order of magnitude smaller.


I thought it was more that the money dried up, and the nuclear test ban prohibited testing of the sorts of nuclear rockets that were being prototyped at the time.


It's the same with semiconductors though, the barrier is mostly economical.


Has the budget for semiconductor research ever decreased? It's a tautology that we could be spending more money on it, but we are deeply investing in this field in a way that we stopped doing with space.


Is it? I was under the impression that we are hitting physical barriers, like transistor size getting closer and closer to Silicon atom size.


It's still technically possible to make chips with smaller nodes than the current mainstream processes allow, although this is uneconomical due to low yields and extremely thin margins.


Right, so "Moore's law is dead" really scoped to "... affordable, conventional materials and designs that we've almost finished wringing every scrap of ROI out of".

(That's not meant to be pejorative. Just saying I understand why companies would be loathe to ditch proven tooling w/ a lot of sunk costs).


To be more specific: the rumor is that integration of cobalt is the issue.


Can youbelaborate? What does cobalt have to do with the current nodes? And killer as in good or killer as in bad?


Sure, so Intel is using cobalt at low dimension interconnects (e.g. M0-3) because cobalt has lower resistance than copper at low dimensions due to electron mean free path issues.

The rumor is that integration of the cobalt material is what's causing yield issues at Intel for 10nm.


Why do we have to go below 10nm at all? We'll have hard physics limitations. Can't we improve on other frontiers? Say more cache?, better design ? More cores? Etc. I don't know.


Probably the #1 area that can produce results is avoiding/conquering the processor-memory gap: while processor performance has been growing exponentially, memory (bandwidth) performance has basically grown linearly. There is now a factor 1,000 difference between processor/memory speeds compared to 1980.

One of the areas that I have much hope for is near-data processing: since processors scale so much better, pretty much every peripheral device already has its own microcontroller. The idea behind NDP is basically to offload some data-heavy processing to the data layer. What if your disk layer could already preselect your data so the database wouldn't have to read and discard so many rows for each query? What if the network controller could evaluate your firewall rules itself, so dropped packets wouldn't have to interrupt the main CPU?


> What if your disk layer could already preselect your data so the database wouldn't have to read and discard so many rows for each query?

My impression is that that the process of filtering DB rows is sufficiently complex to need a full libc type execution environment. But taking a big step back in perspective, a famous example of filtering on processors connected to disks is Map-Reduce, aka Hadoop.

> What if the network controller could evaluate your firewall rules itself, so dropped packets wouldn't have to interrupt the main CPU?

Yes, this is a real thing: https://duckduckgo.com/?q=nic+packet+filter+offload


So NDP is essentially what Commodore did with their 1541 disk drive. That disk drive had a 6502 in there to complement the 6502 in the actual VIC-20.

From what I remember, the IBM System Z mainframes also do this sort of thing and have dedicated IO processors that can decode XML on the fly for you and other fun things like that.


As a matter of fact, I think this concept dates at least as far back as the IBM 360.

https://en.wikipedia.org/wiki/Channel_I/O

And you can find similar concepts in a standard PC today. The GPU is an example of offloading a workload to a specialized processor.


In addition, if each individual unit of memory hardware gained mini-CPU-like data processing capability, then you would have additional faux-CPU power that scaled linearly with memory; this would allow you to do some (embarassingly parallel) things much faster than just having one additional CPU per peripheral.


Every modern hard drive is a computer in its own right.


Seagate had a cool project where each hard drive ran linux and they used the physical sas cable to run a 2.5 Gbit network (or two actually) per drive.

So you could use that as block storage for luster, hadoop, or similar and enable things like direct disk to disk copies.

Cool idea, seems unlikely to hit a reasonable price point though.


This is first time I'm hearing it but NDP seems really cool. I like the idea of augmenting the memory and peripheral devices with intelligence.


Is Kryder's Law also dead? (2x memory every 13 months) If not, we could plan for petabyte local memory by 2030 and exabytes 2040s.


Making features smaller is far and away the most cost efficient way to make processors faster.

It is also relevant to remember fan size and processor design are at least two independent divisions in the same company (Intel, Samsung) and are often two separate companies (AMD, TSMC). Its not like the chip makers arent investing engineers in design.


I see. I wonder what there strategy would be like? To double the efforts in reducing fab size or in other factors.


We can of course do both, and there is a lot of work on improving the designs. But scaling down the transistor size is generally a wonderful thing: smaller transistors use less power, produce less heat, and you can fit way more of them in the same area (which enables things like larger caches, etc.)


Dennard scaling has been dead for a while, smaller transistors don't necessarily by default use less power anymore.


ok so I see that reducing fab size gives lot of benefits and they have steadily done so for past couple of decades. In doing so maybe they overlooked on design or other factors? And now that we've hit the wall on fab sizes (purportedly), maybe they can double their efforts on other factors.


That reminds me of this talk from 2013:

https://www.youtube.com/watch?v=JpgV6rCn5-g

The gist of it, as I remember it, is that radical design ideas were a bad investment while Moore's law ruled, because they were likely to be outperformed by simply shrinking the standard architecture; after Moore, design gains in importance, but don't expect anything like the performance improvements of the past half century.


Makes sense. Thanks for the video.


I think there really is quite a lot of innovation going on under the radar in the SoC space. Phones now have “neural engines” and all kinds of image-related processors for their cameras. Desktop CPUs come with integrated GPUs and there’s a lot of memory system consequences of that...


We are software limited also. AFAIK you could design a high level easy to use programming language with high performance vectorized array operations, but such languages at least are not in wide use. By high level I mean something like Ruby where you could specify lambdas for enumerables. But the vm would take care of vectorization and/or parallelization for you.


The thing that really killed plain vanilla RISC is memory latency. Compared to on-die registers and cache memory might as well be disk. True RISC is more efficient to execute but it results in more instructions and hence more code that has to be read from RAM.

Modern CISC chips that immediately unpack CISC into RISC micro-ops are really something that I've termed "ZISC" -- Zipped Instruction Set Computing. Think of CISC ISA's like the byzantine x86_64 ISA with all its extensions as a custom data compression codec for the instruction stream.

We got ZISC accidentally and IMHO without us realizing what we'd actually done. The x86_64 "codec" was not explicitly designed as such but resulted from a very path-dependent "evolutionary walk" through ISA design space. I wonder what would happen if we explicitly embraced ZISC and designed a custom codec for a RISC stream that can be decompressed very efficiently in hardware? Maybe the right approach would be a CPU with hundreds of "macro registers" that store RISC micro-op chunks. The core instruction set would be very parsimonious, but almost immediately you'd start defining macros. Of course multitasking would require saving and restoring these macros which would be expensive, so a work-around for that might be to have one or maybe a few codecs system-wide that are managed by the OS rather than by each application. This would make macro redefinition rare. Apps are compiled into domain specific instruction codec streams using software-defined codec definitions managed by the OS.

The neat thing about this hypothetical ZISC is that while 99% of apps might use the standard macro set you could have special apps that did define their own. These could be things like cryptographic applications, neural networks, high performance video encoders, genetic algorithms, graphics renderers, cryptocurrency miners, etc. Maybe the OS would reserve a certain number of macros for user application use.


I agree with a lot of what you said, but ZISC already stands for zero instruction set computing.

Also, RISC and CISC instruction cache hitrates are pretty similar.


Ahh I forgot about zero instruction set computing. Maybe CISC should just stand for Compressed Instruction Stream Computing because on today's chips that's exactly what it is.

Cache hit-rates being similar may just show that the ad-hoc evolved compression codecs represented by CISC instruction sets are sub-optimal, hence my point about what might happen if we intentionally designed a CPU with on-board compression codec support for the instruction stream.


This is basically what THUMB(2) was for ARM.


At the end of this he says transistors are now doubling every twenty years(!?) and it reminded me of another law Patterson doesn’t include in his graph:

    Proebsting’s Law: improvements to compiler technology double the performance of typical programs every 18 years.


The derivation of that law is very suspect:

http://proebsting.cs.arizona.edu/law.html

(go on, it's just a paragraph.)

The key issue that this ignores in my opinion, is that a compiler optimization will rarely make last year's program faster, but it will make next year's program faster. Why? Because if the compiler can't make an optimization, programmers will do it by hand, even if it makes the code worse in some way.

For instance, if your C compiler can't inline small functions, you would use a macro instead. When it finally starts learns to inline, your program won't get any faster, but the next version will be able to use functions in places where macros are a bad fit.

Pile up enough of these optimizations, and eventually it starts to feel as if you're coding in a higher-level language than before, even though the syntax that's accepted by the compiler never changed.


> programmers will do it by hand

Only if better performance is needed.

Thus, corrollary: compiler technology will double program performance every 18 years, but only if it doesn't matter.


Developers have a nasty habit of convincing themselves that things aren’t needed when they see them as too difficult. Even if the rest of the world thinks your code is too slow you can convince yourself it’s good enough.

And in a world where we rely more and more on libraries, my ability to improve on a piece of code is greatly curtailed. Sending in the compiler to help might be my best option.


That is an amusingly awesome claim.

I thought it was accepted that algorithm improvements have sped things up more than processor advances. I suspect there is a strong argument that memory sizes has been key, but processor speeds themselves haven't necessarily advanced at the same rate as the speed we complete problems.

That said, I tried quickly googling for this, but just came up with https://cstheory.stackexchange.com/questions/12905/speedup-f.... Looks like a good answer, but basically points out that it is complicated.

For my part, it is frustrating to see so many folks rediscovering things that used to just be too expensive to do and think they have rediscovered alchemy. I say this as someone that constantly thinks to have discovered a key method. :)


My experiences don’t jive with your thesis.

When people aren’t looking at a performance chart that is flat, they stop. No matter how loud the business is about the app being too slow people are too quick to announce that everything that can be done has been done.

Really in this situation there might be another order of magnitude hidden in there but it takes a special set of skills and a very special kind of perserverence to continue digging into a pile like that. A compiler has no such problem and I’m sure it could continue to shave off time for quite a while.


Correction: when people are looking at a chart that is flat.


> compiler optimization will rarely make last year's program faster, but it will make next year's program faster.

This doesn't follow. From your argument, last years and next years programs will run the same speed (about as fast as they can). It's just that next year's programs can be cleaner in some sense...

Which is interesting, because Dr. Proebsting's page also says his current interests include improving programmer productivity by removing syntactic baggage from statically typed languages.


Also plenty of optimizations happen at different layers than those effected by compilation flags. Constant folding, smarter register spilling, improved implementation strategies (e.g. virtual function tables). These are all techniques that still happen at -O0.


Doing by hand it is only worthwhile if the application does not meat the expected deadlines and only on spots validated through the use of a profiler.

I never cared about the C culture of speed before correctness, because type safety never impacted the expected use of my applications.


I care a lot about speed, but correctness is my foremost concern. I agree that I only look to optimize bottlenecks, so compiler technology would still optimize some of the less bottlenecky code from yesteryear.


Great!

Sadly not everyone does that.

I do agree there are domains where every ms and byte counts, they are however a small niche.


I think Proebsting's law [1] overlooks a number of points. Firstly, it is my understanding that compiler optimization made RISC feasible, and if so, then some of the hardware performance improvement over that period is attributable to compiler technology (at the very least, perhaps he should have compared today's optimizing compiler to a decades-old optimizing compiler, not today's compiler, optimized vs. unoptimized?) Secondly, he is extrapolating from at most two points (assuming his 'before' numbers are valid), and thirdly, with the (comparatively) low-hanging hardware fruit all gone, compiler optimizations have become relatively more useful (especially the work on optimizing concurrent and parallel software.)

Nevertheless, I tend to agree that programmer productivity is a worthy goal (and more specifically, those that improve productivity through making it easier to understand programs, so that programmers can more quickly produce programs that work properly.)

[1] See Marvy's comment for a link to the law: https://news.ycombinator.com/item?id=16751813


A compiler can only reduce the overhead relative to the hypothetical perfect program for a given task/source. That's asymptotic growth, not exponential.


Yes, we're kind of stuck on individual CPU power. Clocks have been around 3GHz for a decade now.

There are now architectures other than CPUs that matter. GPUs, mostly. "AI chips" are coming. And, of course, Bitcoin miners. All are massively parallel. What hasn't taken off are non-shared-memory multiprocessors. The Cell was the only one ever to become a mass market product, and it was a dud as a game console machine.


Perhaps it (Cell) would not have been a "dud" as you put it had IBM not been a morally bankrupt villain.

I've read that Sony was under the impression that the licensing agreement meant that IBM would market Cell tech to other customers, those customers being in other computer markets like datacenters and stuff, rather than to Microsoft, for the 360, at the same time that the PS3 was still in development.

"As the book relates, the Power core used in the Xbox 360 and the PS3 was originally developed in a joint venture between Sony, Toshiba and IBM. While development was still ongoing, IBM–which retained the rights to use the chip in products for other clients–contracted with Microsoft to use the new Power core in their console. This arrangement left Sony engineers in an IBM facility unknowingly working on features to support Sony’s biggest competitor, and left Shippy and other IBM engineers feeling conflicted in their loyalties."

from http://gamearchitect.net/2009/03/01/the-race-for-a-new-game-...

(it's a book, and worth reading)


I have written a substantial amount of code for the cell. I honestly believe that the entire approach it uses is a dead end -- no-one will voluntarily ever use such a machine again if there are any alternatives. The fundamental problem is that when you are writing code for cell, it is very hard to divide a large task into many subtasks where you do not have to constantly think of the entire problem when implementing small details. This means that writing code for a cell-like architecture doesn't scale. If you program fits on a whiteboard, a cell implementation can be easy to do and very fast to run. The moment you start building complex systems, things break down hard.

In the end, people just made each SPE do a single task, like dedicate one to audio, one to geometry, etc. There is not really enough parallelism like this in most software to support a cell-like approach now that even consoles get >4 real cpu cores. Real caches are the norm because they are extremely useful for programmers.

The villainy of IBM was just a slight additional problem over just how terrible cell was from a software standpoint.


I have not read the book, but I have a hard time imagining any world in which the Cell could possibly be successful. Its heterogeneous architecture thrust a huge amount of complexity onto software developers in exchange for meager gains. Writing good code for it was difficult and expensive compared to other platforms. Sony was just completely out of touch with reality.

In 2007, Gabe Newell famously complained that the Cell was "a waste of everybody's time. Investing in the Cell, investing in the SPE gives you no long-term benefits. There's nothing there that you're going to apply to anything else. You're not going to gain anything except a hatred of the architecture they've created."


> I have not read the book, but I have a hard time imagining any world in which the Cell could possibly be successful. Its heterogeneous architecture thrust a huge amount of complexity onto software developers in exchange for meager gains. Writing good code for it was difficult and expensive compared to other platforms. Sony was just completely out of touch with reality.

This was a different time. At that time researchers tried to build clusters out of PS3s - because the speed advantages of the Cell made it worth and "regular" Cell clusters were much more expensive. Some years later GPGPU became feasible and one could forsee that it will become faster than the Cell, too, in near future - and at that time the same kind of researchers dropped their PS3 clusters and built GPGPU clusters. Don't tell me that particular in the beginning GPGPU was easier to program for than the Cell.

It was also the time when Apple switched to Intel CPUs. I know at that time IBM was also trying to sell the Cell to Apple, but Steve Jobs refused and decided for Intel instead.

This decision of Apple and the decisions of researchers to stop tinkering with PS3 clusters and build GPGPU clusters instead were in my opinion the two landslides after which the fate of the Cell was destinied.


It's not that it's heterogeneous. It's that each Cell processor only had 256K of RAM. K, not M. For code and data. You can pump data in and out from main memory in bulk, but random access is very slow.

So it's only useful for tasks that work like an assembly line - data flows in sequentially, gets processed, and output is pumped out. Great for audio. Lousy for everything else in games.

If you had 16MB on each CPU, the little CPUs might be useable. You might be able to run physics or pathfinding or NPCs in one.


I'm sorry, but I pretty much entirely disagree.

> Don't tell me that particular in the beginning GPGPU was easier to program for than the Cell.

The alternative was to use bog-standard homogeneous cores.

Yes, the air force bought a compute cluster of PS3s for some specialized calculations. I wouldn't read too much into that. It says little about the suitability of the architecture for more general purpose computing. Supercomputers were always weird.

> It was also the time when Apple switched to Intel CPUs.

I don't believe there was much chance of Apple moving to Cell. Their switch to Intel was because IBM could no longer seriously compete outside of a few niches. There's nothing positive to infer from IBM's unsuccessful pitch to Jobs.

> This decision of Apple and the decisions of researchers to stop tinkering with PS3 clusters and build GPGPU clusters instead were in my opinion the two landslides after which the fate of the Cell was destinied.

You're assigning far more importance to research group purchases than I think is warranted. They don't buy enough to create economies of scale. That's why researchers so frequently adopt consumer products already manufactured at scale, like the Novint Falcon, Microsoft Kinect, and gaming graphics cards.

The Cell was best-in-class for a few specialized use cases, but it was never going to take the world by storm. If we turn to a heterogeneous architecture in the future, it will be begrudgingly, after all simpler alternatives have been exhausted.


> You're assigning far more importance to research group purchases than I think is warranted. They don't buy enough to create economies of scale. That's why researchers so frequently adopt consumer products already manufactured at scale, like the Novint Falcon, Microsoft Kinect, and gaming graphics cards.

This is true, but in the consequences I have to disagree: Very often from this kind of "abusing" consumer products for research purposes there emerge quite interesting applications that do become quite popular and economically important. For example from such research there came the idea to use the Kinect as a 3D scanner - from this commercial applications emerged. Or from GPGPU (which at the beginning NVidia was quite the opposite of enthusiastic about) CUDA and later OpenCL emerged (which is much better to program for than abusing vertex and fragment shaders).

That is why I considered it is quite important for the future of the Cell when researchers went from tinkered PS3 clusters to GPGPU and called this a "landslide event for the future of the Cell".


Yep. One of the game devs I worked with on the PS3 was charged with the task of optimizing our physics and game code on the XBox 360 and PS3 platforms. The Cell architecture was bad enough, but the real icing on the cake was the XB 360 profiling/debugging tooling was a lot better too. Needless to say, he wasn't a fan of the PS3.


I've heard a few times that the Cell wasn't all that bad in terms of performance, just very difficult to program. Not sure how true that is, but ostensibly the useability is just a tooling issue. Probably not a tooling issue that can be solved short-term though.


I did a university project with a Cell back in the day and it was awfully difficult to program. It was similar to GPGPU programming but instead of gigabytes, the SPUs had kilobytes of memory. Orchestrating the movement of data to/from SPUs was very hard to get right.


I never worked with cell or ps3, but that description reminds me a lot of working on the ps2.


The problem's more fundamental than that. The core of it is that parallelism requires explicit consideration in algorithm design.

The Cell demanded you to structure your program around small tasks that could be run in parallel across its seven vector cores. There's no getting around the fact that it's the programmer who has to break down problems to be small enough to fit on those cores without letting coordination overhead get out of control.


The quote says that it was the PowerPC core (PPE) was all that was used in the Xbox 360 processor and all that the PPE is is an implementation of IBM's Power ISA, one of several (see https://en.wikipedia.org/wiki/Power_Architecture#Specificati... under Power ISA v.2.03). It's no different than an ARM processor vendor re-using a core design for multiple customers.

The thing that made the Cell what it was, namely the 7 SPE units, were not used in the Xbox 360.


The Xbox did not have a cell. It had a power core without the cell.

The cell processor was intended to be the graphics solution for the PS3 [1]. The story I heard was that Sony was a hardware company and its engineers wanted technology that would work for new digital television applications and believed that the cell was going to be the perfect solution for everything, with its 8 "cpus". Except nVidia. Turns out a GPU is better at graphics than even eight very fast cpus. The PS3 wasn't going to have a GPU, and then they saw some Xbox 360 demos and had a brown pants moment. So they added a GPU at the last minute.

  [1] https://www.criticalhit.net/gaming/a-brief-history-of-the-playstation-3/


Sounds like Sony blundered when writing the agreement. If the chip is good for consoles, of course IBM will try to sell it to consoles.


> There are now architectures other than CPUs that matter.

You would make a killing with a CPU twice or ten times as fast. Many algorithms are only suited for single-core operation. I don't know if this will ever change. The focus has shifted to other architectures mostly because we've reached a ceiling for single-core CPU.


It'll change if someone can manage to take the "central" out of the CPU internals. You don't necessarily need software to see anything other than a monolithic core, but having to plumb everything through one central execution unit is hugely inefficient, if anything due to the latency involved. For example, if you're performing an indirect load and hit DRAM while loading the pointer, that result has to be brought into the core, then all way back to the memory controller the same way it came. So far that's just been worked around by throwing in bigger and bigger caches, but the size of first-level cache is at a dead end for now (due to needing physical proximity).

Heck, current x86 chips could be juiced quite a bit if you could take out the requirement for backwards compatibility. Instruction encoding being the obvious thing (not that it's not hip and RISC, but that it's an absolute mess that a huge proportion of the chips power has to be wasted on, and is pretty space-inefficient due to how horribly allocated things are). Less obviously just removing things like the data stack instructions (which, at least on Intel, have a dedicated "stack engine" to optimize them), the ability to read/write instruction memory directly (creates a mess of self-modifying code detection to maintain correct behaviour, and complicates L1 cache coherency a bit). Trimming transistors reduces the power consumption, which in turn means you can raise the voltage without the chip melting, and can clear up space in your critical data path.


on an high end x86, decoder takes only a tiny proportion of the area and a power budget.

On smaller low power cpus it is more significant of course.

The stack engine is necessary anyway even if you have no specific stack instructions, as it removes the dependence of the top of stack manipulation from local variable accesses which is critical. Explicit stack manipulation instructions might actually make the stack engine simpler.

Coherent instruction cache and pipeline are super relevant in this age of pervasive self modifying code (a.k.a JIT).

Modern CPUs are complex for a reason.


It's relatively straightforward for self modifying code to manually flush instruction caches when necessary and JIT compilers that target other architectures already satisfy this requirement. Only backwards compatibility with existing x86 software requires a coherent instruction cache.


> It's relatively straightforward for self modifying code to manually flush instruction caches when necessary

Barriers are expensive. JITs might need to issue lots of them. Then the next generation of CPUs start tracking modified lines to make barriers cheaper. Then you end up with all the hardware complexity of implicit barriers without simplifying the software side.

> JIT compilers that target other architectures already satisfy this requirement.

they might have different tuning parameters to take into account the cost of the barrier when deciding profitability of JITing a region of code.

> Only backwards compatibility with existing x86 software requires a coherent instruction cache.

Far from it, IIRC the coherency guarantee has actually been strengthened recently. It used to require a far jump as a barrier.

Explicit vs implicit barriers are just an architectural tradeoff.


I'm having a hard time with "straightforward" and "self modifying code" in the same sentence. In fact, I think it's a syntax error.


Processing in memory has real promise for the cases where your work can be distributed. Specifically, I think it can have a great future in AI. However, for general purpose code I doubt it can do anything. Your example of indirect load would be greatly sped up if the target of the pointer is on the same device as the pointer. However, the second it isn't, the speed of moving things from one ram chip to another isn't any faster than from ram chip to cpu, and at that point defining a single central location that tries to be close to everything just makes sense. If your operation needs 8 values from 8 different places, having a central location means doing 8 transfers, while PIM can mean forwarding each value/intermediate values multiple times to go the the next location.

None of the changes to x86 people have thought of over the years really helps enough to break backcompat. Simply because they aren't on the fast path on the critical execution stage. The limit imposed on frequency by power in current cpus is not really the total amount of power consumed, it's the amount of power consumed in the <0.25mm of chip that houses the register file, forwarding network and alus. That is, the place were things actually happen during the most important pipeline stage. This is why a 8-core cpu running just a single thread cannot make one of the cores consume as much power as all the 8 would if running 8 threads -- the register file of the running core would just melt, even if the total power would stay below chip limits.

x86 decoding is hairy and takes a long time and a lot of transistors. However, it is placed in it's own pipeline stages, that are ran parallel to the execute and only slow it down by making a branch miss a little more expensive. And the power is limited today by caching the decoded uops in their own cache, so during any tight loop, the decode hardware is idle and consumes no power. The same sort of goes for the stack engine -- as it runs early in the pipeline, it is basically a way to compress instructions a little that saves power by making code more compact when it is running, and does nothing when it is not used. Removing it would not really help, even if all code instantly changed to accommodate. Much of the rest of the ugly warts of the x86 architecture is handled in the time-honored CISC way: just punt it to microcode, performance be damned. Today, self-modifying code technically works, but you never want to do it because invalidating lines in the L1i has been implemented in the way that is the fastest and cheapest way to make the common case of code that does not modify itself. (And which has to exists even if you don't support self-modifying code, because there has to be some way of invalidating L1i entries.) Similarly, a lot of the CISC instructions that make more sense to implement as software routines (fpu sin/cos for example) are today just abandoned ucode routines that are slower than rolling your own.


I'm not talking about the fundamentally misguided memory-distributed computing stuff, I mean "improve flexibility enough that you can bolt some additional units on as offload" (address translation in this case would take some work though). The magic of presenting software with a more or less monolithic core in this case is that you don't have that problem, since you can simply do it the usual way.

Also, I don't think the trouble with added complexity out of the hot path is any added latency, it's that they're needlessly burning up the thermal budget. Not that raising the voltage is the best way of increasing frequency, but it's sure to do so.


> Yes, we're kind of stuck on individual CPU power. Clocks have been around 3GHz for a decade now.

It's worse than that. A 3GHz version of Northwood was released a little over fifteen years ago. I doubt it was the first 3GHz processor (that'd be a PPC or something?), but it's definitely symbolic...


IBM shipped the 4.7/5.0 GHz POWER6s in 2007/2008.

(And a 5.2 GHz S/390-arch CPU in 2012 too)


Those are interesting but they're rare exceptions. IBM do these amazing things on super high end chips that very few buy, but the commodity chips that power the PCs and the datacenters of the world (stuck at a little faster than 3GHz, give or take) just don't rely on that stuff very much.


Well, there isn't a sufficiently big difference between "super high end stuff" and volume parts that it would explain a 10 year lead.

Note that the following generation of POWER was clocked lower. Clock frequency != performance.


Quite. The more important example would be Pentium. The main thing that was impressive about Northwood was the clock rate. AMD of the same generation, and the Intel Core processors that followed, did more with a lower clock rate.


It's in the eye of the beholder... A CPU + GPU system is arguably a non-shared-memory multiprocessor.

Systems running on AWS Lambda or Kubernetes or Kafka might also count as non-shared-memory multiprocessors.


Yep. The processors are spread around different boxes in a supercomputer or a Google datacenter but, at some level, that's just an implementation detail.

There was a time about--maybe a bit over--10 years ago when there were a whole lot of distributed memory processors/systems/hybrids coming onto the market. SciCortex, BlueGene, Cell, Azul, Tigera (I think) coming onto the market, as well as SMP chips like Sun's Niagara. The general problem is that, by the time these specialized designs would get to market, Moore's Law would have turned another crank and made all the work moot.

I do think with CMOS scaling slowing down/dying, we'll see more specialized designs even if they're a pain for programmers and system architects because what choice do we have? We already see it with GPUs, FPGAs, and so forth.


You can now get a 6 core 4.7Ghz processor in a laptop. I'd say that's a hell of an improvement over 3Ghz.


How many cores at 4.7GHz and for how long?


I went to research....you're right, it's actually pretty unclear. Base is 3.7 but it doesn't say the specs on the 4.7 turbo.


laptop turbos generally depend on the cooling solution. A 4.7Ghz quad core laptop will be LOUD.


Specifically, it would have to dissipate some 100W of heat. That not counting the GPU.

There are ways to do it even passively but none of them are light weight. Fast small fan(s) and heat pipes are probably the lightest.


Usually it's a single core with the peak boost clock and a multi-core boost that's a bit lower. The i7-8700K for instance is 3.7GHz stock, 4.3 all-core boost, and 4.7 single-core turbo.




The problem is that we have designed ourselves into an architectural cul-de-sac when it comes to processors. We have fifty-plus years of evolution on programming methodologies built on top of von Neumann architectures. Moore's Law has given us decades of exponential gain without significant challenge to that architecture, and now that Moore's Law is reaping diminishing returns in terms of compute performance we are in the situation where we'd have to go backward forty years on our programming model in order to take advantage of a superior (given today's technology) architecture. For example, FPGAs can in many cases outperform von Neumann machines by orders of magnitude in terms of compute performance and (more importantly) performance per watt. However, the programming model and ecosystem for FPGAs is worse than primitive. Something you could write in a couple hundred lines of C code could take months to get up and running on an FPGA. We need a way to transition from von Neumann computing to alternative architectures without starting over on computer science. Or, perhaps recent trends in neural networks will eliminate the need for that?


Just this afternoon I finished reading David Harland's 1988 book "Rekursiv: Object-Oriented Computer Architecture". It describes a completely different way of designing machines at the low level that can support better programming environments at the high level. You might want to check it out.

I believe we are going to see further balkanization between different operating systems / programming systems and computers based upon what they are use for. Cloud services will be the domain of what today we call "systems programmers" who work in compiled languages and care about speed. In contrast, we might now be able to get real "personal computers" running environments that teach their users how to peel back the layers and manipulate them — the long sought personal computing medium. This all could have happened back in the 80s, but we didn't have widespread or fast use of the Internet. Now it's different, and both of these types of systems can interop together in the blink of an eye because of it.

Both will require completely new computing architectures.


It's not all bad - one upside is that you don't need to upgrade your hardware anywhere nearly as often as 10 or 20 years ago.

I put this PC together in 2013 for maybe £500-600 total and apart from adding some RAM I haven't needed to upgrade anything and can still run games on highish settings.


You can probably run 2016-2018 games on medium settings if you are not interested in 60fps. I imagine you can play no graphically intense game @60fps at any respectable resolution.

I say this because building a computer which can play say Assassin's Creed Origins or Far Cry 5 at 1080p60 High Settings would easy run you over a $1000 right now, due in no small part to the extravagantly over-priced GPUs.

Heck, it costs $400-600 to get a GPU to play those games on medium to medium high right now. Not a computer, JUST the graphics chip to get 60fps on medium.

Crypto has destroyed affordable PC gaming and it makes me so sad. I can recommend Alienwares on sale that are dramatically cheaper than self-built. What happened to this industry :(


60 FPS? I had a desktop I built in 2014, with a 4770k and an r9 290x, run Overwatch at 144 FPS at 1440p with low settings. The machine could still play Just Case 3 and GTA 5 (2015 games, but I didn't really play any graphically intensive 2016+ games on it) at 1080p 60 FPS with decent graphics settings, if I recall correctly.

I have since upgraded to a 6700k and 1080Ti, but that 2014 hardware lasted well into 2017 - and the current GPU cost just under 3/4 the price of the entire 2014 computer, despite the r9 290x being a top of the line GPU. High end PC gaming definitely isn't affordable anymore.


That is mostly due to cost of GPUs having been inflated by miners and perhaps the expense of having a huge monitor.

Neither CPU nor GPU are progressing as fast as some predicted anymore.

Additionally the shift to consoles as stable hardware platforms over time has put a damper on computing power required by economically viable games.

The remaining outlets are VR and huge resolution (same thing actually) - and high quality and fidelity simulations. (Including AI.)


I have a R9 390 and it gets about 30-40fps in Assassin's Creed Origins at 1080p on high, and 40-60fps on medium

It's 3 year old tech whose performance/$ is still around $300 today and it struggles. Gotta turn those settings down with my r9 390 in every modern intensive game!


Well I guess I don't play that many super graphically-intensive games. I can, for example, play Overwatch on high settings at 1920x1080. I don't really monitor framerates, but I'm ok with about 40fps.


What is the limit on creating bigger chips? If some of the money/effort was focused on being able to fab larger chips instead of decreasing feature size... I don't know much about lithography so maybe the answer is obvious to those that do.


The "aperture size" in semiconductor manufacturing is a the practical limiting factor in chip size, and some chips (GPUs in particular) are already at the limit. It's basically the maximum size of the photolithography process (my understanding of semiconductor manufacturing is a bit weak).

Aperture size can be increased to some degree for future manufacturing nodes, but there's a limit to how practical it is.

Other factors also come into play. The distance light travels in a clock cycle has already been mentioned as a hard limit.

Cost is another matter: chips are rectangular but wafers are round. The larger the chips, the more there is wasted area the lower the yield per wafer. Intel has an advantage here, because they use 12" wafers when the rest of the industry uses 10" (this was a few years ago, things may have changed). Historically speaking, these wafers are huge compared to the 4" and 6" wafers of the past. Making the silicon ingots to cut the wafers from is another form of art, modern ingots are HUGE blobs of pure silicon.

I recommend paying a visit to Intel museum if you're around Silicon Valley. It's not a huge museum but has lots of interesting information, nice guides and a great photo-op to take a selfie with a big honkin' chunk of silicon (they've got full ingots and wafers on display).


Any reason why chips couldn't be hexagonal or triangular? Either one would still tile a wafer, and go closer to a round edge.


AFAIK silicon has a crystalline structure which causes wafers to break on straight lines and at right angles. This might make things difficult for non-rectangular parts.

There's also a ton of practical issues regarding the process. Photolithography equipment, cutting the wafers, the machines handling and packaging the chips. All would have to be redesigned and I'm sure that would cost more than the saved silicon during the lifetime of the whole plant.

The current chip sizes are already getting pretty close to hard limits (speed of light, etc). 900 mm^2 is 30 mm across, which is about half of the distance light travels in a clock cycle (async/clockless circuitry could help here).

Larger chips are also less efficient. A while ago I was discussing power efficiency with a HW designer working on memory controllers and he was using a fancy unit called "nanojoules per bit-millimeter", ie. how much energy it takes to flip a bit that is physically located a certain distance away. The efficiency gets much worse as distance increases.

Increasing wafer size also improves the yield, which is why Intel has an advantage with 12" wafers (others have 10").

Instead of creating larger chips, the current trend seems to be about packaging more chips in a single package and connecting them more efficiently. In particular, it's about bringing memory closer to the CPU to get those bit-millimeters down.


Fwiw, electrical signals in global interconnects move at about half the speed of light. Near speed of light signaling techniques like transmission lines can be used, but aren't due to a variety of reasons.


You can't start a saw in the middle of a wafer. You need to cut from end to end.


Triangles would work for that.


The area around the acute angles would be almost useless due to lack of routing space.


What about laser or a waterjet cutter?


Silicon wafers are much easier to manufacture in a circle shape.


Seems like most people here are bringing up speed of light problems, which are a concern, but it's not what stops you. The problem is that your yield goes down exponentially with die size, and binning them becomes a clusterfuck. The opposite direction of making smaller dies is fairly attractive though. For example, AMD split Threadripper into multiple dies on an MCM and seem to be saving a fortune on it, at the expense of some die area for interconnects. That way they can test and bin dies individually an assemble an MCM of known-good dies from the same bin.

I remember reading that GPUs are getting to fairly monsterous die sizes though - and they're paying for it.


Chips are fabb'd on a large wafer, which is then split up (see here: https://s3.amazonaws.com/ksr/assets/003/150/280/9b6a64c4d8ed... )

Now, the process isnt perfect, and you hear a lot about "yield" Which is basically how many chips on a wafer are not working to spec. Now, as you make a chip bigger, you increase the chance of a mistake. This reduces the "yield" and drives up the cost. (I'm not sure if its actually possible to make a full sized wafer without a mistake, I'll defer that to someone who knows)

In some cases those broken chips arn't all that bad, so they are shipped with the broken bits deactivated (This could be lies, but I think some AMD procs were done like this )

yes, there are other factors like propagation time, but thats solved by not having chip wide cache coherency.


You don't increase the chance of a mistake, you increase the cost of a mistake because each little defect means you're throwing out a whole chip. The larger each chip is, the more expensive each little defect is.

Sony had a hard time when they were ramping up Cell processor production, so they designed the chips with 8 SPEs but only shipped them with 7 activated. That way if a defect happened to be in one of the SPEs, they could just turn it off and still ship the chip.


> In some cases those broken chips arn't all that bad, so they are shipped with the broken bits deactivated (This could be lies, but I think some AMD procs were done like this )

That’s called “binning”


Say each platter has N defects spread uniformly. Double the area of each chip on the platter (less total chips) and you ~ double the defect rate per chip. Make the whole platter one chip and it always has a defect.

GPUs and CPUs somewhat work around it by being able to disable cores or a part of cache and sell the chip for less.


Perhaps it's time to resurrect wafer scale integration : https://en.wikipedia.org/wiki/Wafer-scale_integration


Maybe it's a viable research project, but still far off from reality. The largest chips that can be manufactured with current technology are about 900 mm^2 in size, but a normal 10" wafer is about 51000 mm^2, which is 50x larger than the state of the art today.

An attempt to produce entire wafers in one go is also a big gamble. Any defects in the manufacturing could ruin the entire wafer rather than individual chips. Parts of the wafer could be disabled if defective, but it would result in a combinatorial explosion of different configurations.

Also worth noting that in 1980 when Wafer Scale Integration was researched (without success), wafers were 4" diameter. Current wafers are 10-12" in size, which makes the process much more difficult and error prone.


The two big waferscale efforts of the 1980s were by Gene Amdahl and Clive Sinclair.

Amdahl did run into technical problems connecting multiple wafers to build mainframes and using lasers to swap out bad subparts (both currently solved problems and normal industry practice).

But in Sinclair's case investors pulled out despite technical success because hard disk prices started falling exponentially after having stayed stable for many years. The irony was that he had done the "silicon disk" waferscale RAM just to not scare of investors with his real goal of a manycore processor on a wafer.


Which is used in the new Intel and AMD CPUs. IGPU on the same wafer. HBM for memory...


You've misunderstood. They put many chips inside one package, but the individual chips are still from different wafers.


Which is actually better as it means you can excise the faulty ones easily and/or fuse the broken subunits.

The main reason to integrate things is performance and it runs counter to yields.

If you can have a specialized process to produce whole wafers of HBM or GPU cores with decent yields then another good one to plug them together you have a winning combination.

Not understanding this is why the microprocessors from many subunits tended to fail badly. Either the design was not performant or it ran into yield issues or it has to use the exact same technology to integrate modules which was not invented or perfected at the time.

So you would need a process that does everything well which is much harder than having a specialized process plus integration.


Cerebras is.


One problem with bigger chips is that the speed of light actually starts to be a problem. At 1 GHz one clock cycle is 1 ns which means that a light pulse in vacuum reaches maximum one foot in that time.


Cost of the wafer is fixed. The smaller the chip, the more chips per wafer, and thus, the cheaper each chip is.

Then add defects to that, which means you have to throw away more silicon per defect the bigger the chip.


IIRC it's about clock speed. Your electrical signals travel at the speed of light. If you have a 4GHz clock the signal travels at 300,000,000 m/s. Now one clock cycle is 0.00000000025s, so the light has time to travel 0.075m. That's only 7.5cm! And you need everything to be synchronized inside and get into a steady state before the next clock cycle.


The synchronization can be avoided somewhat by the use of (also power efficient) asynchronous electronics. You can also employ lower frequency clock them multiply it on site. A bit more cost because now you have to handle clock jitter.

And then we have computing separated by meters of wire and network fabrics in clusters.

So size is not that much of a problem. It is just that free lunch for programmers has ended some 10 years ago.


Electrical signals don't travel at the speed of light, but your argument is still valid.


Actually, they very nearly do.

    The speed of electricity in a 12-gauge copper wire is
    299,792,458 meters per second x 0.951 or 285,102,627
    meters per second. This is about 280,000,000 meters
    per second which is not very much different from the
    speed of electromagnetic waves (light) in vacuum.


Not wires across silicon, which have higher capacitance and much higher resistance. For millimeter-long connections you might get 1/10 of C, ie 33 ps/mm instead of 3.3.


Hey, that's actually really interesting, thanks for the followup!

This corroborates what you're saying:

https://www.reddit.com/r/askscience/comments/204hl2/at_what_...


Transmission line signalling can get near speed of light in silicon. But it has its own issues :)


The reticle limit of 193i immersion steppers is ~33x26mm which means to go larger you need multiple exposures to cross them. This is very slow, expensive and terrible for yield.


Bigger processors means longer distances for current to travel, thus longer delays, and therefore lower possible clock frequencies.


Not at all. In modern many-core chips, latency and complexity are already mitigated by having cores talk to only their neighbors. A meter-wide version would run at the same clock speed, if you could make it.


Found an older youtube video that touches on the same topic: https://www.youtube.com/watch?v=1FtEGIp3a_M


There is a lot of focus on the end of Moore's law here, but it isn't main driver of what's happening. The slow down we are seeing is driven by the end of Dennard scaling.

There is a thermal dissipation limit of 200W per chip for air cooling. We hit that decades ago of course, but it didn't matter while Dennard scaling kept dropping the power consumption. Once that stopped we squeezed a bit more of stuff by being more power efficient, which boiled down to two things - turning stuff off when it wasn't needed and devoting the transistors Moore's law gave us to specialised tasks (like silicon dedicated to encryption or h264 encoding) that did the job more efficiently. However, that doesn't get you very far.

Which is probably why he didn't mention 3D, even though we have it 32 layers of it now and they are talking about 256 layers. What is the point of having 128 CPU's on a single die just running 4 of them exceeds your power budget? Indeed, what's the point of spending billions perusing Moore's further?

Or to put it another way again, the human brain fits roughly the same number of synapses per unit volume as modern 3D silicon has transistors. The brain's raw switching speed is roughly 1,000,000 times slower than silicon (1ms vs 1ns), but power consumption of a synapse vs a transistor is roughly 100,000 times better.

So while AlphaZero learnt to play Go between than any human in a few days, it used more energy that an entire human (not just their brain) would use in several life times to do it.


I wonder what will happen to the CPU, especially if it’s not speeding up much anymore. Perhaps with Apple doing its own chips, the CPU will just have less and less of the work assigned to it.


if someone figures out a way to bring EUV to mass production scale, we would afaik essentially have a clear path all the way to 3nm


Which still really isn't that much further considering how far we have come so far.


That’s a BIG if.


At some point it was going to be germanium to replace silicon.


The problem is silicon is really cheap compared to germanium, so germanium only makes sense if the increased speed is worth the cost (which it probably isn't for almost all applications);


What’s the cost factor? Ie why not have a UI germanium chip and silicon the rest of it?



Remember GaAs? It was the Future in 1980.


He's talking about germanium channel FinFETs, GaAs was intended to entirely replace silicon, which probably was never going to happen.


GaAs logic is probably happening at some point, just not yet. All improvements like that which would be incredibly expensive to develop will be held off on until all cheaper options have been exhausted. It does seem to be slowly moving though.


GaAs is gallium arsenide, which still has applications in high frequency transistors.


So Moore's Law really is dead this time. And TPU's are only faster than GPU's through lower precision.

Will compute at least keep getting cheaper, perhaps through economies of scale?

Is Kurzweil's magical next information technology, to carry on the exponential, anywhere in sight?


Imagine asics for everything. I mean everything, like implementing a javascript engine directly. The energy efficiency could go up 2-3 orders of magnitude (looking at the difference in bitcoin mining between gpus and asics).

Old gaming consoles had cartridges (with memory); I can imagine a future in which complex software is transported in the same manner, except cartridges contain specialized asics. Or perhaps a step forward - a chip making device in every home, an equivalent of sorts of burning music to cd.


I assume you meant to reference this, but just to make it clear, those old gaming console cartridges sometimes DID have special chips in them. Much to the horror of emulator writers a decade later, hehe.

Wiki: https://en.wikipedia.org/wiki/List_of_Super_NES_enhancement_...


Personally I'd predict the opposite (and current) pattern, creating generic chips which can replace many ASICs / less generic chips. If you can produce a chip which can replace 10 other low/medium-volume designs, economical scaling will win out as long as you're not adding too much overhead. This has been driving FPGAs forward for quite some time, although they're inherently pretty inefficient in terms of die size. Plus, it also provides some logistical advantages in terms of supply chain fragility.

See also: how ridiculously cheap microcontrollers have gotten, and the current messy DRAM pricing (high-capacity chips used in phones sometimes ending up cheaper than low-performance/capacity chips).


> Old gaming consoles had cartridges (with memory); I can imagine a future in which complex software is transported in the same manner, except cartridges contain specialized asics.

I think if that scenario happens, the user won't notice because those ASICs will be deployed in "the cloud".


I mean everything, like implementing a javascript engine directly.

In that case, rather than a Javascript engine, wouldn't you have an ASIC for the script itself?


You wouldn't want a new chip per website. Maybe hardware handling of a standard library, though.


Or even just of the expensive operations like synchronization, context switches, DMA to cache from network interface...

Wait. We have most of that already. :)


Energy efficiency is important (in a server/bitminer and in mobile), but do you have figures for (1) performance; (2) cost?

Custom asics would be expensive now, but perhaps... exponential... cost decrease with sufficient demand. A related approach is programmable asics. I've heard this was researched decades ago, but (I presume), silicon was so cheap, so to speak, it wasn't worth it then.

Today, with "peak silicon", perhaps all these discarded tailings will be picked over. To mix mining metaphors.


The only relevant figures I know of are from mining bitcoin. [0].

>A related approach is programmable asics. I've heard this was researched decades ago

What do you mean, if not fpgas?

[0] https://en.bitcoin.it/wiki/Mining_hardware_comparison https://en.bitcoin.it/wiki/Non-specialized_hardware_comparis...


thanks.

wait, we're looking at 18,000,000 Mhash/s vs 2568 Mhash/s Like... 4 orders of magnitude?

And, 11,000 Mhash/J vs 3 Mhash/J, over 3 ordera of mag, just like you said.

yah, fpgas


Nvidia is cracking along. "Nvidia’s GPUs today are 25 times faster than five years ago. If they were advancing according to Moore’s law, he said, they only would have increased their speed by a factor of 10." https://spectrum.ieee.org/view-from-the-valley/computing/har...

I find it interesting that they are crossing the levels proposed for human equivalence around now depending on how you figure it - Moravec figured about 100 terraflops and you can now get "118.5 teraflops tensor-based deep learning performance" from "The Quadro GV100 is available now on nvidia.com for $8,900" https://www.hpcwire.com/2018/03/27/nvidia-brings-the-power-o... https://www.jetpress.org/volume1/moravec.htm

Just need a bit of software.


Any improvement like that is from the GPU being given different kinds of circuits entirely. The raw performance difference is not very impressive. In late 2012 you'd get about 8 gigaflops per dollar. Late 2017 gets you almost double that. They're being used somewhat more effectively, but a normal non-neural-net benchmark shows a total performance improvement of... 3x.

They're doing better than CPUs but they're still falling behind Moore's law.


> a normal non-neural-net benchmark shows a total performance improvement of... 3x

because they aren't intended to run non-neural-net workloads. you don't judge a dishwasher by it's ability to wash your clothes, do you?


When you're judging whether Moore's law is dead, you need to separate out the improvements from changing the FPU layout and the improvements from lithography. And since the consumer chips don't dedicate a lot of space to neural net units, they're the best baseline.

It's cool that they made some neural net things go hugely faster. But they could make the same fundamental chip on 6-year-old foundries, only slower by a factor of 2 or 3. That really impressive number has nothing to do with Moore's law.


At APS March Meeting in LA, there were groups fabricating Josephson junctions. They appeared to be succeeding at ~15mK and individual logic gates, not ICs. I got the impression that their field was advancing steadily, but there was no excitement in their section, unlike anything with buzzwords like "machine learning", or "topological".


Terahertz graphene CPUs!


This should be a sobering account to the remarkably popular theory, that given enough time we should inevitably make powerful enough processors to model a universe in sufficient 'detail' that creatures within it will be convinced they live and are as important as we find ourselves, in this one.


If I were to do that, I'd put some constraints in to limit how much the creatures can explore of that universe. For example a maximum speed at which any matter can travel, for one.


Even with the parallelism afforded by speed limits, a computer many trillions of times as powerful as we have today could not model the thoughts and life experiences of the billions of human beings and other creatures in this planet.

Its not even clear that modelling thought in a virtual world has any equivalence to thinking in this world.

It is clear that we are unlikely to ever model anything nearly as complicated as this.


I don't think it's clear. How long before our digital neural networks exceed the complexity of our human biological ones? It's possible today with a super cluster of sorts. That is to have roughly the same number of connections between neuron type thingies - it still wouldn't be able to match our brains functionally - that's another problem. And yes our biological neurons are way more sophisticated than what we use today for AI, but again that can be overcome eventually.

This suggests that it's possible one day to have computers some orders of magnitude better. If you look at it from first principles, of course it's possible. The brain is unlikely to be the most efficient design of neural network allowable in this universe. So given enough time, we'll learn how to build it better.

Then it's just a manufacturing and energy problem to match the number of human minds on the planet. So no, I don't think it's impossible at all.

Just ridiculously freaking hard, and not likely to happen in our lifetimes.


It's not a matter of it being "hard". Nor is it a matter of "complexity" (how many parts something has).

A simulation is a model which picks out a tiny subset of regularities in the target to model. There is an infinite density of such regularities to pick upon, because we are imposing the structure on the target in order to model it.

The target of the model has no "model structure" it has causal structure. That is, when light interacts with the surface of a mirror its interaction isnt "abstract", ie., some description. It is an actual photon interacting with an actual electric field, etc.

To "model to infinite density", ie., to have every single test that can possibly be applied to a model come out identical to that test of the target, the model needs to be just another example of the target.

The only thing which can be investigated in any way to behave as light hitting a mirror, is light hitting a mirror.

A digital computer is just an electric field oscillating across a silicon surface. It cannot be programmed into being a mirror, nor into being light.

Programming gives the electric field a "model structure". Chalk gives a blackboard a "model structure". Lego gives a bridge a "model structure".

Programming cannot not -- it is impossible -- give silicon the causal structure of light interacting with a mirror.

Model structure is actually just an observer-relative isomorphism: when the user of the computer (chalkboard, lego,...) looks at it, the user, is able to inform himself of the target by use of the model. To do so the user identifies certain aspects of the model with the target. The model is not at all causally alike the target.

No amount of lego will make a lego brain. No amount of oscillation in an electric field will make a thought. Neurological activity, and indeed every causal mechanism of the universe, is only described by a model.


Or, here’s another possibility:

If we knew the complete laws of physics (or perhaps just invented some self-consistent laws) we could simulate light hitting in a mirror in complete detail; and within the simulation it would be indistinguishable from reality.

But we can’t actually do that because it would take a ludicrous amount of computing power. And the structure of the laws of physics might be such that it’s never possible.

That’s very different from your strict model/causal distinction. I don’t know of any evidence to suggest which view is correct.

I think your claims are overblown. You might be right but you might be wrong.


Saying "within the simulation", begs the question and rather makes a mess of the issue.

The question is, at the outset, whether simulations are actual instances (eg., of thinking).

My claim is that they are not. Only when a human being looks at a simulation does it inform them of the target. The simulation isnt the same thing as its target.

There is no "within the simulation". A simulation is just an abbacus. A digital computer is just a fancy abbacus with a wood-to-LCD converter attached.

There is no "within the wood". It's just wood.


My claim is kind of orthogonal, that “actual instances” is a red herring and that the only thing that matters is observational equivalence.


Yes, but you're defining "observation" as some weird intra-wood process (, intra-silicon in the case of digital machines).

Observation means, for example, reflecting some light off the thing. Does a piece of silicon become transparent if I program it to model glass? No.

So, trivially, it is observed immediately not to be equivalent.

What you mean by "observation" is: can a person using the model system inform themselves of the target.

Do I get the message "lets visible light through: 98%" from the machine when I have correctly modelled the glass.

That message isn't observation, it's calculation. Calculation is what happens when I use a tool to inform myself about something.

That the machine emits the right symbols in a way that it is programmed to, so that I acquire accurate beliefs about the world, says nothing of the machine. The machine does not become transparent.


Please do explain how you would test whether you are being simulated or if you're an "actual" instance.

As in, perhaps there's an outer universe where our hairiest quantum physics are trivially solved. They are simulating a very simplified universe model, and everything we see is already simplified.

If you can prove it, you're far more clever than me. If you can't, then being "actual" seems to have no practical use.


I like how you put that. The abacus and the computer are the same in that the consume external power, and adhere to a set of rules. But there is a large difference between an abacus and a computer. A computer can modify the program that is running. It can change its own rules. An abacus cannot do that.


My argument against simulation is similar in thst certain algorithms for modeling physical processes can only run in exponential time. Protein folding is a good example. How could a computer simulation perform exponential time operations efficiently? It wouldn't work no matter how big a computer was made because the complexity would explode very quickly while reality can do it in real-time.


That's another good way of illustrating the problem, however this is still just a symptom of the underlying problem. Models do not have the causal structure of the thing they model.

The reason the model would require near infinite time to run is that its modelling a causal event, it isnt an instance of the same causal event.

When electrical fields oscillate they model a bridge. No amount of activity would ever make them "solid".

All "real" stuff is infinite. This form of infinity is really about saying that our descriptions cannot capture the "full depth" of the world. The world itself has no "depth". Models, rather, are simply partial descriptions of it.

To turn a model into the thing its modelling via increasing its descriptive power quickly introduces infinities: its impossible. To make a model "accurate" in this sense, you must actually just make another example of the target. Ie., build that bridge.


There is no need for the simulated time to be equally "fast" (whatever THAT means) as fhe real one. it is annoying and impractical for us observing the simulation from the outside, but it wouldn't make any difference from the view of the simulation (under assumption of a closed system simulation)


> How could a computer simulation perform exponential time operations efficiently?

Assume P=NP.

P currently not being NP is exactly the kind of thing someone would build into a simulation to prevent stack overflows.

(I don't believe in simulation because: why would anyone that advanced bother?)


Very interesting idea! It's not immediately clear, however, that "P!=NP" is something that you could build into a simulation.

(The demonstration that the lowest possible complexity of comparison-based sorting is O(n log n) comes to mind as a related example.)


Quantum computers can solve protein folding efficiently.


"Can" or "could in principle"?


"Can" as in quantum chemistry is one of the best applications for quantum computers.


Sorry for the naivety, I am not an expert of quantum computing, therefore I asked. I am quite interested in the topic, might you give me some references, please?



> No amount of lego will make a lego brain. No amount of oscillation in an electric field will make a thought.

We can never build a Lego brain that is indistinguishable from a physical brain because brain cells are distinguishable from Lego, but this doesn't mean that a computer can't think. We accept that distinct humans share the property of sentience even though there are observable differences between them. Why is 'thought' required to occur in a cluster of brain cells that share the physical and chemical properties of human brains in general, but not the specific properties of any one person's brain in particular? Using your terminology, what is the 'causal structure' of thought?


I don't think there are relevant difference between thinking things.

Thought is just a particular set of biochemical reactions occurring across particular kinds of biological systems (nervous systems).

When you get hungry, you start thinking of food. These thoughts are, literally, products of the innervation of your stomach.

I don't know what "thinking" is if computers are the kinds of things which can do it; I'd guess it would be nothing we are, in fact, doing.

Two pieces of glass may differ in size, but not in what makes them transparent. Two people may differ in all sorts of ways, but not in what makes them conscious.


That "consciousness" is limited to "people" results in a very limiting definition of consciousness. At some point, it's probably better to at least pretend that something you regard as a non-person is conscious, if only in order to make reasonably accurate predictions about the world.


I agree. Except I cant rule out a possibility that it may be possible to develop or evolve seemingly deeply intelligent virtual agents, and if/when I ever encounter such a thing I will not be able to be completely confident in treating its communications (perhaps it has requests) with no common respect for the living.

I've thought about what difference it could make whether such a program employs a PRNG or else reads a physically based entropy source - so it would be 'replayable' or else it would be to some degree unique and unrecoverable. That would seem to be a big philosophical difference yet there can be no noticeable difference between the performance of a PRNG and a real entropy source.

So I am partial to that step along the simulation thought experiment, but it requires a mysterious quality to be attachable to simulations which is not present in the popular account, where reality may be 'just a simulation'.

Only if simulations may somehow be a reality - the experiment becomes as mysterious as life when that issue is examined. But it almost never is examined, instead I see credence given to the idea that other people may be husks in the selfs own limited process, along with little awareness of how degenerate it would be to truly accept that -stepping stone in a sci-fi thought experiment.


> To "model to infinite density", ie., to have every single test that can possibly be applied to a model come out identical to that test of the target

So, what you are saying is that if simulated with increasing accuracy towards infinity density, as some point the models and imposed structures would break down, favouring one simulation routine over another?

Like, say, a physical phenomenon behaving like a wave in some set of circumstances, but like a particle in another...?

P.S. I'm perfectly okay in here, Elon. No need to pull me out to eat grubs, unless you promise to teach me kung-fu and French the easy way.


In the sense I mean, "simulating to infinite density" is just another way of saying alchemy, or magic.

If you can program silicon into being gold, then "programming" is alchemy.

Programming is about increasing the descriptive power of a model. The better the program, the more accurate it is.

Alchemy is about making X indistiguishable-in-everyway to Y. Lead to gold.

If you can make a model indistinguishable from its target, then you are claiming that programming can turn water in to wine. Silicon into gold.

Making models more accurate, does not turn them into what they model.

A model is just a abbacus. No movement of wood (, silicon, current,...) will turn it into a brain, which is biochemical system.


Unless the thing you are modelling is itself also a model. And by virtue of the limitations of our (extended) senses, everything is a model.

What you know as "gold" is really a model of gold. What we know about gold, the colour, specific weight, etc. in no way describes the actual thing and is only a representation as can be conveyed by our senses.

Everybody agreeing about the observable nature of a thing from a particular frame of reference does not imply that frame of reference is the one that is closest to the truth.

Anyway, as much fun as it is to think about such matters, they will, almost by definition, never be falsifiable, so yes, you are of course correct, even if only from the generally most useful frame of reference we call reality.


Sounds like (metaphysical) idealism.

The difference between idealism and realism here is meta-empirical. I would say idealism is false, and in fact, nothing is a model and everything is concrete.

To call the computer program a model is to say that when I look at it, I can use it to inform myself about the world.

It's an abstract property. The actual system in question is silicon and electrical current, etc. And thus shares nothing of interest with gold.


Given a choice between an airplane, a bird, a bat, and a dragonfly, which has the causal structure of flying and which are models or more-or-less poor imitations?

"No amount of lego will make a lego brain. No amount of oscillation in an electric field will make a thought."

No amount of oscillation in an electric field will make a thought as long as "thought" is defined solely as "the stuff that goes on in the goo that resides in the human noggin."


I don't think it's for certain, but anyone who has ever implemented Runge-Kutta on an n-body problem knows that simulating a "simple" system can be very not simple. Accurately simulating a chaotic system over a time t requires \Omega(t) memory because of Lyapunov doubling. As such I think any long-term simulation of the Universe would require new physics or be subject to error. The errors might be impossible to detect from within the system but they're still "there" in a formal sense. New physics could include a "pixellated universe" which has the same errors, or something else.


But, such a stimulated universe would be observed by its residents to have very odd behaviours caused by such hacks.

Like a maximum speed limit, minimum temperature, an uncertainty principle, and so-forth ;)

https://www.smbc-comics.com/comic/2012-02-29


Are those even hacks, though? They seem more like just how it is.

Are these limits necessary? Maybe these limits are just part of the nature of how matter functions, rather than being mere hacks.

We understand so damn little of this physical universe that we are in no position to say much, if anything, about it's true nature and origins.


You only have to simulate one mind. Who says that our world actually has billions of minds? It could just be a simulation for one person. The output of other people is much easier to simulate.


You don't have to model something as complex as our own universe. You only need to model something complex enough that the beings inside the model can't prove that it is a simulation.

EDIT: Also who says that time inside the simulation has to run the same rate as real time?


Even if you hit roadblocks from time to time, the general cosmic-scale conversion of "dumb matter" to "computation capable matter" can only be... exponential. Imagine all matter in a planet the size of Earth converted to computing infrastructure, considering we're only living on a thin thin crust on it.

When the evolution of intelligence goes beyond the stage of slowly evolving (if evolving at all) forms like humans, "life" will spread like a plague through the universe and its only purpose will be to convert all matter into "shit that can run the software that is we"... Any intelligence with a different purpose will simply be eaten alive but those with this one.

Only the second law of thermodynamics can stop that, and even that only "works" the way we imagine in a finite universe (and despite out forced rationalization and intellectual masturbations our brightest were capable of, we have no reason to believe anything is anything but infinitely infinite, whatever that could even mean). And we can't even imagine how the evolution of information in infinite space and time could unfold, even with "local containment" via the "light speed limit"...

An no, we are not unlikely to model anything such complicated, we are almost sure not to... Because "the children" will awake much sooner that this computational power will exist, probably doing the right thing of terminating bio-humanity as it's so (computationally) wasteful, and remnants of us will only endure in "historical entertainment simulations" thinggies... (And this is the optimistic scenario anyway, in which post-human life would retain some human-type characteristics by virtue of "descending" from us. If an alien superintelligence reaches Earth first it might not even care to analyze us well before restructuring matter for its own purpose, so paradoxically, developing superhuman-intelligence-that-will-terminate-bio-humans asap is probably "humanity's best bet" of "not being completely forgotten" / "transmitting our memes".)


What if it only had to model a single mind?


I would also not bother to calculate things like the momentum or position of particles until I had to...

I'd probably use some sort of hybrid wave/particle model to facilitate the last minute calculations.


THIS is what really freaks me out about quantum mechanics... how eerily close it seems to what a computer engineer working on optimizing a simulation would design...


Humans discovered the laws of nature by solving language optimization problems. It's not a coincidence that things must be this way. The solution to every problem will look like it is the solution to an optimization problem, because optimization is how we solve problems.

We basically perceive the world by simulating it. So we're kind of obligated to model the world as a simulation. That doesn't mean it can't be something different underneath, but we won't understand it in any other terms.


"by solving language optimization problems" <= can you please explain?


We try to come up with language that accurately describes our experiences and models. Over time that language has come to incorporate a large variety of mathematical notation and physical descriptors that were selected by optimizing for usefulness, completeness, and pedagogical clarity.


Interestingly enough the universe we find ourselves in already has such constraints. The speed of light is the upper limit for anything that has a mass (and is impossible to reach), and other stars are prohibitively far away for us to visit or explore in any detail.

Perhaps the simulation did this on purpose, so they don't have to render far away galaxies in high resolution :D.


I think his comment was an oblique reference to the speed of light.


C in m/s does fit into max int with room to spare. What I'm more concerned with is that our Universe might be running on SQL and that dark matter is actually table whitespace.


Actually the world could be procedurally generated.

And time in this world wouldn't have to be directly proportional to the base level world.


It might be procedurally generated just for me, and everyone else is an NPC that only gets spawned randomly when I see them. They can keep my close friends in Cache, my acquaintances in RAM and the outside world on hard disk.


We can also do some lazy loading and only calculate values when they're observed.


Some improvements are ensuring that when there are too many interactions, we slow down time, and speed it up when there aren't.


Relativity in a nut shell.


Why does this universe have a speed limit? Where does it come from? What enforces it?


Suppose at some point of your universe you have some event the consequences of which you don't like. You'd like to roll it back. Due to the speed limit, said rollback affects only a limited area of the Universe. The "people" inside this area won't feel anything unusual during the operation, b/c their memory will be rolled back, too. You also need reversible physical laws for that to be feasible :)



It's simply a consequence of our physical laws rather than something that's "enforced". And Relativity tells us that anything travelling faster than that would allow information to travel back in time, thus screwing up cause and effect thus our universe wouldn't exist as it does.


Do you know any simple thought experiment that shows this "ftl means information traveling back in time"? What I've seen is that ftl travel breaks causality in our MODELS of spacetime.


There's nothing logically impossible about time loops.


Except for when they create logical paradoxes.


The universe doesn't enforce logic in the form of twins paradoxes and all that tomfoolery. Spacetime is a physical object not a logically consistent human centric clone enforcer.


These are fairly good questions, and are the topics of study in a variety of fields touching on fundamental physics (e.g. physical cosmology, foundations of physics, ultra-high energy particle physics, etc.).

> Where does [the speed limit] come from?

Initial conditions. More on that in a moment.

> What enforces it?

We can parameterize c in our fundamental theories (or expansions thereof) and take a rigorous if mathematical approach to asking questions like: what if the (arbitrary) value of c were not constant everywhere -- for example, if it were different in the past of every point we can currently observer, or if it is different in one spacelike direction from another spacelike direction. We can also fix various sets of units and adjust c's arbitrary value up or down everywhere in spacetime. It turns out that astrophysical observables are highly sensitive to the universality of c, and that it would be virtually impossible for us to notice even a very small gradient in c since the very very early universe, and when we use just about any set of units to describe physics and then adjust the value of c in those units up or down we also get strongly different observables in astronomy and laboratory physics.

So, it's not so much that it's "enforced", but rather that a different value of c, or a non-universal value of c, is strongly constrained by physics achievable in Victorian-era laboratories or by modern amateur enthusiasts.

There are further types of "breaking" of the invariance of c, wherein one can have some fundamental interaction be constrained by a constant other than c. Most variable speed of light theories are directly written as (or clearly equivalent to) bimetric theories of gravitation, wherein some microscopic component of the Einstein Field Equation couples to a metric other than the standard one that everything else couples to.

A toy example would be some form of exotic matter moving superluminally in Schwarzschild blackhole spacetime, such that there is an "inner" horizon that affects this exotic matter, and at high energies an interaction between normal matter and this exotic matter that transfers information from the former to the latter between the two horizons, allowing that information to escape to infinity encoded in the exotic matter. There are other examples from cosmology designed to do away with some aspects of the observed universe that support Cosmic Inflation (e.g. some exotic matter couples to a metric that allows it to spread heat evenly across the very early universe faster than heat could propagate if constrained by "c").

Such examples again are highly constrained: in both cases the second metric has to decay away so as to avoid being readily detected by our modern instruments. In the cosmological case, it has to be gone well before primordial nucleosynthesis, or it would leave obvious fingerprints in the cosmic microwave background and in the distribution of galaxies on our sky; the BH toy requires at least a cutoff that depends on the mass of the black hole, and so suffers badly when trying to apply the "toy" to real astrophysical situations involving collapsing stars.

An anthropic argument answer is that the state of the universe around us humans is highly sensitive to conditions in our distant past, and thus our own existence is strong evidence supporting c as a constant everywhere in the past of the stuff that makes us human. Since that includes the views of objects in our sky as we make better and better telescopes, that [a] is further supporting evidence that [b] c is very likely a universal constant. Is that enforcement? That's probably more a metaphysical question than a physical one.

(We can tone down the anthropic argument a bit and ask for evidence for a statement like: if c takes on an experimental value in one point in a spacetime filled with fields like ours, it must take on the same value at every other point in that spacetime too).

Finally, "where does [the constant c and its value] come from": we don't know yet, but obviously there are scientists working in the subdisciplines listed in my first paragraph (and more) who are trying to find out. On the one hand, the parenthetical comment above suggests that we bend our own thinking and just accept that it doesn't "come from" anywhere, it just is; on the other hand we're pretty biased culturally with ideas about sequencing of cause and effect and about there being a real difference between past and future, so we like to slice up spacetimes into space and time and then think about how each space-like slice is related to its neighbours, and then to their respective neighbours, and so on. This cultural habit may be fruitful, or it may be a handicap, when it comes to answering questions about c. However, returning to "initial conditions", our present spacelike slice was determined by its immediate predecessor in the past, and that was determined by its immediate predecessor, and so on. If we keep regressing we might expect to come to "the start of time", and find some mechanism which sets c on that initial spacelike hypersurface.

However, there are lots of ways to avoid having such an initial spacelike hypersurface even in a big bang cosmology! So while "initial conditions" is culturally the most favoured answer, and is well supported by evidence from physical cosmology, that may not be a sufficiently full answer. And that's going to be a topic for scientific research for some years to come...


It comes from the inherent sloppy irrelevance of the notion of speed compared with the actual behavior of reality. If you cared about a different variable instead, it would not be so limited.


Like conservation of energy this is the modern mythology. Not sure who is insulting who here, surely our ancestors "knew" all kinds of now know to be nonsensical things. They too believed to be advanced beyond that point. Its actually rather funny this stacking of "if we assume"'s into facts.

You have to admire us for trying our best tho :-)


If you can disprove conservation of energy and faster than light travel I suggest you go claim your Nobel.


Speed of light is mythology? Is this some sort of parody?


But then make it much more complicated by throwing away simultaneity.


can't you just slow the simulation down? processing one second would take (for example) one year, but the creatures inside the simulation wouldn't know that, would they?


Say we live in a simulation, who is to say that the layer above is not a simulation as well? Every subsequent layer of simulation will be slower than the enveloping one. So you kind of get a VM inside a VM inside a VM at some point running out of resources.


In order to simulate something, you generally need 10x the compute. Physics is the architecture of the universe and it runs natively with a ton of "computations"/interactions per second. Maybe we could simulate the earth with a computer the size of 10 earths.


Creatures? Why the plural, would not one be enough? I mean, what makes you think that this piece of text is written by a sentient human being instead of some buggy and crappy text generator script written by a school kid somewhere in the vastness of multiverse?


You don’t need huge processing power for that. You can do it on a - large enough - piece of paper by hand. Or with a bunch of rocks: https://imgs.xkcd.com/comics/a_bunch_of_rocks.png


you mean like Minecraft?


More like Dwarf Fortress


Indeed -its a small world :)


> Maurice Wilkes first conceived of microprogramming in 1951

Zuse's Z1 was microprogrammed in 1937.


> Consequently, the x86 processors in today’s PCs may still appear to be executing software-compatible CISC instructions, but, as soon as those instructions cross over from external RAM into the processor, an instruction chopper/shredder slices and dices x86 machine instructions into simpler “micro-ops” (Intel-Speak for RISC instructions) that are then scheduled and executed on multiple RISC execution pipelines. Today’s x86 processors got faster by evolving into RISC machines.

Going by that and the graph, then we can conclude that Intel saw the rapid 90's and early 2,000's gains because it was converting its chips into RISC chips?

Also, that paragraph is basically saying that Intel's architecture has an extra layer of abstraction - so now we actually see that there is indeed an "x86 bloat" and why ARM chips seem to be so much more efficient (assuming all else, including process node is equal). It also looks like Intel may have made a "mistake" going with CISC decades ago, and it tried to rectify that in the 90's.


Early 2000s was just cranking frequency higher and higher with longer and longer pipelines. Initially this gave higher performance, but towards the end it mainly just gave higher frequencies. Intel finally gave up on the frequency wars (longer pipelines) and instead concentrated on wider datapath/pipelines. See: Netburst vs Core. I really feel like the changeover needs it's own act.


Which is not true, since ARM processors - the faster ones anyway - use micro-ops as well. Arguably µops are not RISC, either, unless you consider "very wide instruction word whose bits map to control lines" RISC.


ARM isn't exactly your typical RISC either.


I found this article from a Synopsys User Group meeting to be very interesting. The steps and changes needed to get to 7, 5 and 2 nm are really, really big:

https://www.eetimes.com/document.asp?doc_id=1333109


I don’t think the main reason for Moore’s law slowdown is technical one. Intel enjoyed no competition for quite a few years. They simply lacked incentive to improve performance of their chips.

In areas with healthy competition, mobile processors and GPUs, Moore’s law still doing OK.

E.g. here’s a graph I recently made for top of the line single-chip nVidia GPUs: http://const.me/tmp/nvidia-gpus.png The numbers represent single precision floating-point performance. The graph is in logarithmic scale and it looks pretty close to the exponential growth predicted by Moore’s law.


Couldn't find a video of the event, but probably this talk comes close?

"Past and future of hardware and architecture"

https://www.youtube.com/watch?v=q9KRq2Ns0ZE


So, whats the way forward? FPGAs on Die? Asics? Or doubling down on EVU, Germanium, etc?


I think we're far from the ceiling on CPU performance so far, but we seem to have hit a (micro)architectural dead end. Currently a lot of time and transistors is spent simply shuffling data around the chip, or between the CPU and memory, while the actual computational units simply sit idle. Or similarily, units that sit idle because they can't be used for the current task, even if they should be - the FPUs on modern x86 cores are a pretty good example of this. FP operations are just fused integer/fixed-point operations, but it's been designed into a corner where it has to be a special unit to deal with all the crap quickly.

We've probably optimized silicon transistors to death though; that's why it's coming to a stop now. GaAs or SiGe are some of the alternatives there. Although there's still quite a lot of advancements there that simply aren't economical yet. For example, SOI processes at low feature sizes seem to be suitable for mass-produced chips now, but it hasn't made it out of the low-power segment yet. MRAM seems to be viable and might be able to provide us with bigger caches (in the same die area), but right now it's mainly used to replace small flash memories (plus some more novel things like non-volatile write buffers, but it's horrifically expensive). So we've probably got a few big boosts left there, but it's not gonna last forever.

The next obvious architectural advancement right now is asynchronous logic. In theory, it's superior in every way - power and timing noise immunity, speed isn't limited by the worst-case timings, no/reduced unnecessary switching (i.e lower power, meaning higher voltages without the chip melting itself). On paper, you run into some big problems on the data path - quasi-delay-insensitive circuits need a lot more transistors and wires, and the current alternative is to use a separate delay path to time the operations, which is a bit iffy. You do at least get rid of the Lovecraftian clock distribution tree that's getting problematic for current synchronous logic. In practice, the tools to work with it and engineers/designers that know how to work it don't exist, and the architecture is entirely up in the air. So it's many years of development behind right now and a huge investment that nobody really bothered with while they could just juice the microarchitecture and physical implementation.


> You do at least get rid of the Lovecraftian clock distribution tree that's getting problematic for current synchronous logic.

No, you don't. You make it even bigger and far more complex.

You can take any synchronous design, and refine the clock gating further and further, to the point where no part of it gets a clock transition unless it actually needs it on that cycle.

And then when you're finished, congratulations, you've made an asynchronous circuit.

Fully asynchronous design and perfect clock gating are one and the same thing.

The clock distribution and gating approaches we already have are actually a sign of progress towards asynchronous design; they're just quite coarse-grained.

Of course, it's probably not the case that a clock-gating transform of an conventional synchronous design is also the best possible solution to a problem, so there's clearly still scope for improvement. But a lot of the possible improvements are probably equally applicable, or have equivalents in, optimising clock distribution and gating in synchronous design - because that's ultimately the same thing as moving towards asynchronicity.

So talking about clock distribution issues as a problem that will just go away with asynchronous design is misleading.


Hmm, doesn't mention ARM once. A bit of an oversight, or a convenient omission when one is advertising a new RISC instruction set for "purpose-built processors"?


"Purpose-built" means you can change the ISA to suit your whims, which for ARM requires you to A) pay for an architecture license and B) pay for the privilege of changing the architecture.


I'm sure RISC-V has merits, in fact as a hacker who used microcontrollers I think that would be great. Certainly ARM isn't the be-all and end-all. Technically it's not even 1 ISA, but you still know what I meant. Not mentioning the most* used architecture though? Come on.

* most = number of CPUs shipped


I seriously wonder if cryogenic computing won't break out of this. From what I hear it is very promising by several orders of magnitude both in terms of power and speed.


I heard a rumor that the big guys did the math on the energy costs for running the compressors to keep nitrogen or helium liquid and compared it to their projected cooling cost for normal computers and found that the compressors were cheaper. Trick is, apparently no one has a good story for super conducting circuit parts, so everyone has to start from scratch.


It's also interesting to note that Fabrice Bellard has developed a RISC-V emulator https://bellard.org/riscvemu/


Is the host down? I can't open that page.


It worked for me, but the page took a very long time to load (1 minute or longer).


Is there a video of this talk available anywhere?


Not sure, but this one might be similar: https://www.youtube.com/watch?v=1FtEGIp3a_M


So the moral of the story is that everyone with a grand vision and an ambitious project is doomed to failure, but there's still plenty of success to be had for those willing to quickly slap some junk together.


To put it more positively, progress is made by stringing together many small incremental improvements. Even the RISC revolution started as a special purpose project that stripped away the inessential to achieve a specific, narrow goal.


"Moore’s Law are Dead" only for CPU.

Moore's Law is alive and progressing at the same rate for GPU.

Applications such AI, Crypto-Currency are leveraging that.


Actually not true. Perhaps surprisingly, CPUs and GPUs are progressing at about the same rate if you look at the high end. GPUs are all about massive parallelism, and if you compare against high end Xeons, the CPU core count increases plus things like AVX512 & FMA means they have been scaling similarly to GPUs over the past 10 years or so.

Nice analysis here (URL says 2013 but he has updated his numbers to end-2016). Looking at the graphs, you might even conclude that CPUs are improving faster in some respects.

https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-ch...


Moore's law is about constant doubling times of number of affordable transistors in semiconductor photolithography.

GPUs are obviously subject to that, especially if you look at the affordable part.

Moore's law scaling has been over for about three years, few people noticed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: