Off topic thought: "Esperanto Technologies" is apparently the name of the company, in case you're confused with the headline like I was. I was amused to discover their offices are literally 3 blocks away from where I live. (So is YCombinator and a thousand other tech companies, so not that surprising, really, but amusing.)
At this point I think we need to go back to descriptive old-school 70s company names like, "West Coast Microprocessor Solutions", "Digital Logic, Inc.", "Mountain View Artificial Intelligence Laboratories", etc.
Edit: Looking at that map, some of the company names are fantastically generic! "Electronics Corporation", "California Devices", "General Technology", "Test International".
Looks efficient, at least on the face of it. Certainly seems credible (Dave Ditzel) and as a way to lowering the cost/improving the efficiency of targeted ad-serving they could be on to a winner.
Looking at this, I'm confused by basic questions. Is this a Mimd or Simd architecture chip? [1] What is the memory/caching structure here and would be fast or slow? Is this to replace a GPU or to replace the CPU you connect to the GPU or both? Would you get code and/or data divergence here? IE, "Many cores" seems to imply each has it's own instructions but ML usually runs on vectoring machines like a GPU.
Edit: OK, I can see this has "network on a chip" architecture but I think that only answers some of my questions.
The presentation has all the answers. It's not MIMD or SIMD; it's an 32x4x8 grid of logically individual SMT2 cores. Groups of 8 shares an I$, groups of 32 shares a D$, everyone shares a LLC and memory interface. There's no divergence, each core is independent.
Don't miss the most important detail: low voltage. By running the Minions at a low voltage, you are reducing performance, but disproportionately increasing the efficiency (up to a point). It's an interesting trade-off of power, area, performance, and efficiency.
Add: it's clearly not indended to drive a GPU, but is presented as an accelerator. They have mentioned that it can also run standalone, thus, can be configured both a PCIe host and target.
How does low voltage help? I mean, you can lower the voltage to increase efficiency on any chip. Nobody does that because it is a waste of expensive silicon.
Power consumption is a function of the voltage squared, but maximum operating frequency drops down slower than that as you lower the voltage, up to a point as GP said. So efficiency defined as f*P goes up as V goes down.
People definitely do that, the voltage you chose to power your chip is always a consideration, it's one of the variables used by laptop manufacturers to manage thermals, and it's also pretty common to lower the CPU voltage so that you can run a desktop system with passive cooling.
You can't just "lower the voltage on any chip" while preserving timings and reliability. These chips will be designed from scratch to run at that kind of voltage, which very few chips are.
> They have mentioned that it can also run standalone
I’d love to see this misused as a workstation. We’d need to do something with `htop` though, because showing all these cores would require an insanely tall terminal.
MIMD via 1088 cores per chip, each core has 512 bit short vector SIMD, and a 1024 bit tensor unit.
~100 MB of SRAM on chip, 8 GDDR busses to DRAM off chip.
It's purpose designed for parallel sparse matrix ML problems. It's more efficient than both a CPU and GPU at these, as well as faster in absolute terms, taking their numbers at face value.
I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA) or 64 lanes (AMD CDNA).
The rest of those lanes come from MIMD techniques.
-------
CPU cores are more MIMD today than SISD because of out of order and superscalar operations. So honestly, I think it's about time to retire Flynn's taxonomy. Everything is MIMD.
> Not sure what you mean here. Are there places where different groups of kernels can simultaneous execute different code?
Yes. Lets take AMD's GCN / Vega, since I'm most familiar with it.
The Vega 64 has 64 "compute units" (CU for short), where a CU is the closest thing to a "core" that the Vega GPU has. So lets really look at how one of these CUs functions.
1. A kernel in GCN executes 64-wide SIMD assembly language. This is the programmer model, but this is not what's going on under the hood. This 64-wide SIMD is executed every 4 clock ticks, 16-at-a-time.
2. The CU has 4x16 groups of vALUs (where a vALU is a set of 64-wide registers and 16-wide arithmetic units). The CU also has 1x group of sALU ("scalar ALU" has 32-bit registers). A kernel has access to upto 256 vGPRs (each executing in SIMD fashion: 16 wide over 4 clockticks for the 64 lanes) + 103 sGPRs, which are shared. (For example: a function call is an sGPR. Because function calls are "shared" between all 64-lanes, its more efficiently implemented in sGPRs than vGPRs).
------------
There can be up to 40-simultaneous instruction pointers being executed by the AMD GCN GPU. Aka: occupancy 40. The exact instructions will be switching between sALUs (ex: branching instructions, function calls, push / pop from the stack), and vALUs (ex: multiply-and-accumulate, which will be SIMD-executed 64x in parallel over 4 clock ticks). (Really: each vALU has up to 10 instruction pointers that its tracking)
As we can see, the modern GPU isn't SIMD at all. Its executing multiple instructions, from multiple different instruction pointers (possibly different kernels even) in parallel.
Even at a minimum number of threads... there's 4-wavefronts per CU getting executed (4-instruction pointers, one for each vALU). That's MIMD in my opinion, since there's no smaller unit than the CU in the whole of AMD Vega.
Just because its executing SIMD kernels / SIMD assembly code doesn't mean that the underlying machine is SIMD.
---------
NVidia is similar: except with 32-wide SIMD programming model and 32-occupany per SM (though these magic numbers seem to change every generation) NVidia is also beginning to implement superscalar SIMD (2-instructions per clock tick, if those instructions go to different execution units).
So I'd also classify modern NVidia as a MIMD machine.
--------
CPUs of course have superscalar units: something like 4-way decode and something like 6+ instructions per clock tick (if executing out of the uop cache. 4-instructions per clock tick otherwise).
Your description fits with my understanding, but I'd draw a slightly different conclusion out of it. As I understand it, the terms SIMD/MIMD/etc. are meant to refer to the assembly instructions used to program the chip, and from that perspective, AMD GCN is very clearly SIMD. It's just that each "core" (CU) runs up to 40x SMT...
Each "core" runs 4-wavefronts (aka: instruction pointers) per clock tick, which is MIMD.
If you have fewer than 4-wavefronts running at any given time, you've got an underutilized CU. That's 256-SIMD lanes you need at a minimum to fully utilize any CU from an AMD Vega.
They happen to scale up to 10x wavefronts per vALU (aka: 40-wavefronts maximum). At least, if your kernels use few enough registers / __shared__ memory. But even if we remove all the SMT-like features in an AMD Vega CU, we still have 4x instruction pointers being juggled by the underlying system.
Which should be MIMD by any sane definition. (4 instruction pointers is MI or "multiple instructions", each of which is a SIMD instruction, so MD is also happening).
---------
Note that SMT / Hyperthreading seems to be considered MISD in Flynn's original 1966 paper. So SMT + SIMD == MIMD, in my opinion. These modern GPUs didn't exist back in 1966, so we don't know for sure how Flynn would have categorized today's computers.
But honestly, I'd say its somewhat insane to be reading a paper/organization scheme from 1966 and trying to apply its labels on computers and architectures invented 50-years later.
Well, for the purpose of designing and running otherwise arbitrary algorithms/code, it seems like four independent SIMD core/chips (with 1000/many kernels each) is quite different from a 1000 thousand cores each running instructions independently. The latter offers more options.
There are 64 CUs per AMD Vega (a GPU from 2017 mind you: top end back then but just kinda middle-of-the-road these days).
So each CU runs natively 4-wavefronts (and scales to 40-wavefronts as resources allow). Each wavefront is a 64-wide SIMD, with 256-SIMD lanes total running per CU.
That's 16,384 "threads" of execution on a 4-year old GPU before all cores are "lit up" and utilized... with the option to have up to 163,840 "threads" at max 40-way occupancy (useful if you have low register usage per kernel and a dependency on something that's high-latency for some reason). These are mapped into 4096 physical SIMD-lanes / "threads" that physically execute per clock tick.
---------
At any given time, there are only 256 instruction pointers actually being executed. Which is where and how a GPU manages to be efficient (but also the weakness of a GPU: why it has issues with "branch divergence").
EDIT: The general assumption of the SIMD model of computers (like GPUs) is that line#250 is probably going to be executed by many, many "threads". So these "threads" instead become SIMD lanes, and are batched together to execute line #250 all together, saving power on decoding, instruction pointer tracking, the stack, etc. etc.
As long as enough threads are doing similar things all together, its more efficient to batch them up as a 64-wide AMD wavefront or 32-wide NVidia block. The programmer must be aware of this assumption. However, the underlying machine still has gross amounts of flexibility in terms of how to implement it. So it could be an MIMD machine under the hood, even if the programming model is SIMD.
-----
There's also the issue that when you KNOW 64-lanes are working together, you have assurances of where the data is. Things like bpermute and permute instructions can exist (aka: shuffle the bytes between the lanes) because you know all 64 lanes are on the same line of code. So in practice, cross-thread collaboration (such as prefix-sum, scan operations, compress, expand...) are more efficient on SIMD model than the equivalent mutex/atomic/compare-and-swap style programming of CPUs.
Something like a Raytracer (bounce these 64-simulated light paths around), and organizing which ones go where (compress into hit_array vs compress into miss_array) are fundamentally more efficient on the GPU/SIMD style programming, than the heavy CPU-based threaded model.
I should make it clearer - I'm looking to run that's data-intenive, structurally MIMD but with some SIMD aspects. I'm trying to figure out the most cost-effective chip with which to do this.
The other question is whether chips like Esperanto's product have effective primitive for reduce operations and how much general memory through-put they have compared to a GPU.
Mentions that each ET-Minion core has a vector / tensor unit. From [1]
> The ET-Minion core, based on the open RISC-V ISA, adds proprietary extensions optimized for machine learning. This general-purpose 64-bit microprocessor executes instructions in order, for maximum efficiency, while extensions support vector and tensor operations on up to 256 bits of floating-point data (using 16-bit or 32-bit operands) or 512 bits of integer data (using 8-bit operands) per clock period.
So sounds like at least 8736 SP FP operations per cycle.
> Ditzel explained, were running x86 servers with open PCIe slots, leaving Esperanto an opening to enter existing datacenters through a high-performing PCIe card.
Brilliant move
> the entire chip would consume just 8.5 watts ... Ditzel said, one chip would take about 20 watts
You've misread it. Here's those lines with a bit more context:
> But if, instead, they followed the peak of the energy efficiency graph, the entire chip would consume just 8.5 watts [...] and operating at about 0.4 volts, Ditzel said, one chip would take about 20 watts.
"the peak of the energy efficiency graph" isn't stated explicitly, but is at around 0.32 volts.
Basically they want to use up all 120W of a PCIe slot, at the highest efficiency possible. They could have gotten higher efficiency (at 8.5W per chip), but that would have resulted in not being able to use the full 120W and thus actually having worse performance overall, even though it's more efficient.
> Basically they want to use up all 120W of a PCIe slot, at the highest efficiency possible. They could have gotten higher efficiency (at 8.5W per chip), but that would have resulted in not being able to use the full 120W
Right, at the end of the day they were constrained due to not being able to fit more than 6 chips on a single PCIe card. If they could fit 14, they would've been able to make each chip use as little as ~8.57W while still using up the 120W available to the card.
This looks similar to a research project I worked on called the Hammerblade Manycore. The cores were connected by a network, where you could read/write to an address by sending a packet which contained the address of the word, whether you were reading or writing, and if writing, the new value. The packet would then hop along the network one unit per cycle until it reached its destination.
I've heard that specific term meaning more about consolidation via virtualization technology, but maybe there's some usage of it referring to increasing scales of transistor integration that I don't know about.
Could you share more about Glow? I am also curious as to why Facebook would make this technology open source. I am happy to be able to benefit from the open source work large corporations do, but am not always clear why they do it.
The link to our GitHub repo in a sibling comment probably does more justice than I could do in an HN comment, but it's essentially an ML-graph-to-machine-code compiler that focuses on accelerators.
The rationale for open-sourcing here, in addition to the general recruiting/hiring benefit, is that we want vendors to target a common interface so that it's easy to make direct comparisons amongst different hardware.
I'd say, though, that ML is moving somewhat away from the "graph compiler" approach. PyTorch (and users' experience with TPUs/XLA vs GPUs) has suggested that static graphs aren't desirable for usability or necessary for performance. These days, I'd say write a PyTorch device backend and a fast kernel library.
Well, it's explicitly designed so that vendors can add custom extensions. But code that sticks to standard instructions (anything that pure C/C++ code can compile to, basically) should be 100% compatible. There are conformance tests in the process of being created, but that's not fully in place yet. But with so few instructions, which are all used by software of any significant size, if you successfully boot Linux then it can't be far off :-)
More extensions are in the process of being ratified before the end of the year. The big one is the Vector extension, which allows code using vectors/SIMD to execute completely unchanged on machines with vector registers anywhere from 128 bits to 64k bytes in size.
I know of one apparent implementation compatibility bug that's just been diagnosed. GeekBench, for whatever reason, is using the fairly new FENCE.TSO instruction in their code. A bit pointless as I don't think they are yet any RISC-V cores that implement TSO memory semantics, so a FENCE RW,RW would do just as well. The FENCE opcodes have a largish field with the lower bits giving pred and succ R, W, I, O bits. In the base FENCE instructions the upper bits are all zero. Future FENCE instructions are supposed to be designed so that if a CPU ignores the upper bits then they devolve to some slightly stronger standard fence. So FENCE.TSO for example is FENCE RW,RW with one of the upper bits set. It seems that the Alibaba C906 core (in the Allwinner D1 chip on the Nezha board) is not treating unknown upper bits in FENCE as if they are all zeros as the spec says, but is instead giving an illegal instruction trap.
Fortunately, this can be worked around by adding a FENCE.TSO emulation handler (or FENCE with upper bits set in general) to OpenSBI, alongside the emulation of things such as misaligned loads and stores. These can be safely present in the M mode software for every CPU type as they will never be triggered if the hardware handles those things directly.
Of course the trap and emulate performance penalty will be far far greater than any minor performance improvement from using FENCE.TSO instead of FENCE RW,RW on a (hypothetical at this point) machine that actually implements the TSO extension.
The RISC-V folks are in the process of ratifying standards for vector processing ("V") and lightweight packed-SIMD/DSP compute ("P") that should make it easier to expand compatibility in these domains. As of now, these standards are not quite ready and Esperanto are still using proprietary extensions for their stuff.
I really want on of these chips, but I think "Here is our amazing chip, if you move your workload to our chip, then you'll be super duper fast" is a somewhat risky business model
There have been so many of these, and yet we, DL practitioners, are still SOL trying to find NVIDIA cards for less than 3x MSRP. Not even gonna bother with this until I can buy it.
I'm not any kind of expert in the field, but trading single chip speed for more chips surely has it's downsides, which aren't mentioned in the article at all.
Read the article. It's about ML workloads which scale well across many cores. It's also being compared to GPUs. The whole point of what they're doing to is to be able to pack more cores versus a CPU but with a larger instruction set than a GPU core.
I like ML but it's not a very good language for this highly parallel HPC'ish stuff. We'll see how Rust does, it should be a lot closer to what's actually needed here.
Yes, but it’s meant to do ML inference, which can be parallelized decently. On those workloads, you can use GPUs, which are also composed of thousands of “wimpy” cores.
Sort of. GPU "cores" in the CPU space would be called SIMD lanes. Apples to apples GPU cores using the CPU terminology would put an Nvidia 3060 at 28 cores and a 3090 at 82 cores.
A full CPU is useful for decision intensive or time series intensive data. Normal ML inference is not necessarily either of those. You could have more complicated neurons (or just make normal compute tiles which they may be doing).
I thought the same thing back in 2015 considering the way GPUs supposedly handle branches with warps. However, my stock trading simulator ran way better on 3 GTX Titans rather than the Intel "Knights Many Cores" Phi preview I had exclusively been able to obtain. I was excited because it had something like 100 pentium 4 cores on it, and was supposed to be much faster than a GPU for logical code. Dissapointment set in when the GPU stomped it performance wise. I still don't even understand why but I do know now that the whole "GPUs can't handle branching performantly" is a bit overstated. Intel discontinued their Phi which I can only gander was due to its lack of competitiveness.
A standard way to handle branching in gpu code is with masking, like so (where x is a vector, and broadcasting is implied):
M = x > 0
y = M * f(x) + (1-M) * g(x)
So you end up evaluating both sides of the decision branch and adding the results. But this is fine if you've got a dumb number of cores. And often traditional cpus wind up evaluating both branches anyway.
> And often traditional cpus wind up evaluating both branches anyway.
That's actually really overstated. Evaluating both sides isn't really something CPUs tend to do, but instead predict one path and roll back on mispredict. This is because the out of order hardware isn't a tree fundamentally, but generally better thought of a ring buffer where uncommitted state is what's between the head and tail. Storing diverging paths is incredibly expensive there. I'm not going to say something as strong as "it's never been done", but but I certainly don't know of an general purpose CPU arch that'll compute both sides of an architectural branch, instead relying an making good predictions down one instruction stream, then rolling back and restarting when it's clear you mispredicted.
It's not even about the expense of implementing diverging paths in hardware.
This concept, of exploring like a tree vs a path was explored under the name Disjoint Eager Execution. You know what killed it? Branch predictors. In a world where branch predictors are maybe only 75% effective, DEE could make sense. We live in a world where branch predictors are far better than that. So it just isn't worth speculating off the predicted most likely path.
What killed it was more the effectiveness of Tomasulo style out of order machinery, and the real problem not being control hazards, but data hazards. DEE was thought of in a day where memory was about as fast as the processor. That's why it's always being compared with cores like R3000.
Sure, but this is a coprocessor on an expansion card, similar to a GPU. I've worked on a few systolic algorithms and this kind of chip has massive potential in that space. TPUs have been a big letdown in that regard, as they don't even have the comparison operation needed for the matrix-based shortest-path algorithm.
I would not take much to acknowledge them, just a "fastest RISC-V chip in many core workloads" would a long way.
I personally think those chips would be an absolute monster for solving MILP problems as they tend to have enormous parallelism and a lot of linear algebra (in particular simplex iterations).
However there is no hype for funding in Mixed-Integer Linear Programming so machine learning it is.
Well, it really depends on the computational intensity your algorithm needs. I've stumbled upon things of beauty porting things to GPUs, especially if you're going to perform huge amount of operations based on a very small amount of data. As long as you don't have too much intermediate data, register spilling, etc. these GPU things do fly. They're also very impressive on NN-based workloads... Even something 2 or 3 gens behind can be game changing, with some optimization effort. Tensor libraries leave a lot on the floor to pick up, especially if you're not using the canned 'competition winning' networks.
Nature of the workload certainly matters a lot, and for a lot of work GPUs do memory bandwidth isn't always the limiting factor (though it often is, that's why consumer grade GPUs have 12GB of ram and beefier systems grade GPGPUs/TPUs have 40+GB of memory).
Datastructure and locality matters a lot for GPGPU programming. While PCI-Express is really fast, it's got a lot of latency and is limited.
I believe the applications are mostly what you describe. You take a very general algorithm like a NN or raytracing and scale the model and/or data.
For NN, it’s quite easy to use a huge amount of memory by making back-propagation chains huge, which is natural for deep models or a recurrent architecture. The model doesn’t have to be huge in a conventional sense; it just has to maintain state e.g., the recurrence.
So, a big image model on video classification/segmentation (e.g., self-driving cars) is probably the ideal combination of memory consumption.
At this point I think we need to go back to descriptive old-school 70s company names like, "West Coast Microprocessor Solutions", "Digital Logic, Inc.", "Mountain View Artificial Intelligence Laboratories", etc.
You know, something that would blend into this map: https://s3.amazonaws.com/rumsey5/silicon/11492000.jpg
Edit: Looking at that map, some of the company names are fantastically generic! "Electronics Corporation", "California Devices", "General Technology", "Test International".