It's so unusual to find projects like this. I have so much respect for these guys. It's always fun to see what they're working on. Lots of the technical specifics are over my head, but I really appreciate their tone and approach.
Seems like a _fascinating_ architecture. I don't get the double program counter bit, also seems like too much cognitive overhead to me. The belt (queue of un-named slots) idea and function call for free with multiple return results, such that regular ops and user defined functions work the same way, that sure is clever.
Godard mentions that the CPU space has seen little to no innovation. But what about Transmeta? They were doing some cool stuff, no? And the DEC Alpha, that was super interesting. Probably lots more besides. I'm sure if I bothered to check that Wikipedia would have a CPU graveyard page :) Thing is, x86 has been the only game in town for a long time with Motorola challenging in the past and ARM challenging now. Maybe time for a three horse race?
Alpha was amazing, and the DEC chip people were best-in-the-world talented. If DEC had made some smarter choices about personal computing we might have had 10 or 20GHz desktops now.
Otherwise, CPU design is almost the poster-child for technical debt. The problems are more social and economic than technical. Once an architecture dominates the market it's almost impossible to replace it with something better. The best hope is to do what ARM are doing and sneak up on the market sideways, starting from a segment with much less competition.
I don't mind admitting that I _lusted_ after a PC powered by an Alpha chip. All I knew is that spec-wise they seemed to blow Intel and Motorola out of the water. They were not too much pricier either. But alas, I was young, had no real £ of my own, so I could only dream. Eventually got an Atari STfm with a Moto 68k as that is what my folks could afford.
Once something is baked into a CPU design that seems to be it forever alright, I take it that that's what you mean by technical debt. Besides, everybody up the stack (and that means everybody because you can't get much lower down the stack than the CPU) has got to re-tool. Mill Computing have been working on this for 12 years! Transmeta had Linus Torvalds for crying out loud. Best-in-the-world does not cut it when it comes to CPU design. Compatibility and small incremental logical changes seems to trump all. Or do what ARM are doing, as you say.
Intel's own Itanium was also very interesting and ultimately a failure. memristor is going to take over the world a year from now, just like it has been going to for the past 7 years.
Exotic computing (CPU and/or memory) architectures are easier (still hard, but easier) to dream up than to implement on real hardware with a real software ecosystem (while still keeping the theoretical benefits).
As a tech geek I think what they are doing with the Mill is pretty fascinating, but it "looks like a duck, quacks like a duck" in the way of transmeta, alpha, itanium, etc, and I'm old and jaded so I'd honestly be pretty shocked if anything came of it.
Well, part of their strategy is to be able to much more rapidly design a wide variety of specialized parts, such as their ability to generate the specification computationally in 15 minutes. They recognize that the needs of the market are changing rapidly, and they have to be able to meet any of them.
Hopefully this lets them pivot and take advantage of new opportunities better than previous efforts to make a superior CPU.
This quick & easy specialization might be a part of what could give them market share, but I don't think this is their best shot.
While this "IoT" might be the most stupid buzzword I've ever heard, what it stands for is true: everything will be networked in the future. The only thing that matters here is that devices can talk together, and this might be their biggest shot. These days micro-controllers are already being replaced by cheap ARM's that run circles around them and are easier to develop for. This is an area where the semi-conductor industry will grow massively. If they're able to prove that their design offers the same processing power for less power, a whole lot of companies could be interested.
Other way around really. AMD licensed a front side bus from DEC/Compaq. The chip designers all ended up at Intel. The closest living descendent of the Alpha 21264 is Skylake.
As to finding instruction boundaries with a variable-length instruction format: you could use the trick they use to start two instruction streams from a single address by using it for a stream of fixed-length instruction sizes running down that can be used to find the border in the stream of variable-length instructions running in the ’normal’ direction.
You could have:
offset : -----000000000111111
offset : 54321012345678901234
contents: 64212iijkkllllmmmmmm (digits are instruction lengths, letters are instructions)
So, reading offsets -1, -2, -3, etc. you read off the sizes of the instructions you can read in offsets 1, 2, 3, etc. If you have 4 decode units, you can replace the sizes of the instructions by their offsets, resetting the offset every 4th instruction.
With instruction sizes of 1-4 bytes, you could store instruction sizes in 2 bits per instruction. That would lower the pressure on your cache for the extra information you store. There also would be more space in your instruction set because you no longer would have to worry about creating an instruction that has a prefix that also is a valid instruction. For example, you could use 0x01 both for INC and ADD IMMEDIATE VALUE, where it means the former if the instruction length is 1, the latter if it is longer, with the size of the immediate determined by the size of the instruction.
Would that be a net win? I wouldn’t know, but it seems simpler (but ’simpler’ looks like a negative in their worldview) than creating those two streams that somehow must work together to do what a single statement sequence in the source code does.
The twin instruction streams have the advantage in that the instructions are split between two completely separate caches that can have reduced latency relative to (total) capacity, bringing benefits in the case of branch mispredict.
I don't understand why they don't use simply a prefix-free code, for example the Huffman code. You can have arbitrary length instructions at entropy-optimal efficiencies that way, and decoding in binary-tree fashion should be linear and very fast in hardware.
Could a prefix-free code work with all the instruction arguments like arbitrary 64 bit immediates? Perhaps you could have a separate argument stream but you still need to parse all the prior instructions to get the offset.
Not my area of expertise, but in principle any set of instructions can be encoded into a prefix-free code (e.g. a binary argument just expands the set of instructions by a factor of 2). If both arguments have equal probabilities this simplifies to simply appending the argument as a prefix. If the probabilities are unequal, you can save bits (entropy) by bundling frequently occurring arguments together in the Huffman scheme, at the cost of increased decoding complexity. I heard bandwidth/cache sizes is a much greater problem than circuit area though.
You are right that you need to parse all previous instructions to get an offset. Again, this is far from my area of expertise (first time I read about CPU design), but it seems this problem could easily be alleviated by slight modifications to Huffman codes, for example by setting "blocks" of instructions of certain sizes and assigning a prefix to each block containing the number of instructions in each -- you then can get offsets very quickly.
Anything is possible. But consider: Intel length-encoded opcodes for the 8086. Decades later, turns out they made bad choices. The shortest instructions (xchg r1, r2) are NEVER used. Compilers have changed; nobody codes much ASM any more. Its a crap-shoot.
Linear decoding is too slow. The talk is about efficient parallel decoding, which requires having decoders at every potential alignment and discarding the improperly aligned outputs after-the-fact.
I don't get what you mean by "linear is too slow" -- if you want to decode n bits you need at least O(n) time. Decoding every potential alignment seems equally silly since even in parallel processing each branch will have to performing at least one decoding task, so what exactly are the gains?
By quadratic you mean O(2^(2n)) physical space, right? By a naive approach (making decoding like a ROM memory), I still can't see how it's not at least linear on the number of bits even assuming simplifying things for circuit sizes.
Quadratic is O(n^2), not O(4^n). n is the number of bits.
The bits are decoded in parallel, ergo. in constant time. There may be a linear step afterwards (eg. numbering each instruction result) but the actual decode happens in parallel.
I'm a bit confused about why you think it's linear time. (I suppose technically it even has to be constant time, since the number of bits is capped at a low constant, but I suppose that's also besides the point.)
It's not actually quite that simple, since it's actually a 3-phase decode and there is some interdependence. However, the point remains that shifting the read along one instruction at a time would be too slow and parallel decode is key to fixing that.
I suppose you might instead be asking why it's not linear hardware space complexity (rather than quadratic); the answer to that is a little more subtle.
I don't really understand your solution, but then again I did not understand the Mill's solution. Can you explain it again?
All I will say is that they seem to have really tried to simplify and regularize the Mill _except_ in the opcode stream / decode step where they say that two streams is optimal! Seems like it would be a nightmare and introduce all sorts of complexities. Was the only bit of the talk I didn't grok, whereas most others bits had me nodding my head and thinking "oh, neat".
(Mill team)
The streams are not streams of instruction; they are streams of half-instructions. The Mill is a wide-issue machine, like a VLIW or EPIC; each instruction can contain many operations, which all issue together. Each instruction is split roughly in half, with some of the ops in one half and some in the other. On the Mill, the split is based on the kind of operation and the info that ikt needs to encode: all memory and control flow ops on one side, all arithmetic on the other, although other divisions are possible.
Each half is grouped with the same half of other instructions to make a stream, and the two streams decode together, step by step, so each cycle one instruction, comprising both halves, decodes and issues.
The result is to double the available top level instruction cache, and cut in half the work that has to be handled by each of the two decoders.
OK. Here's a simpler way to describe it: if every instruction starts with a bit sequence indicating the length of the instruction, you only need to read those bits to find instruction boundaries, but you still need to linearly scan the instructions to find, say, the 6th instruction.
With this two-stream approach, you can split the bits that describe the instruction lengths from the bits that are the actual instructions, and move the former into the downward-moving 'CPU counter'. That way, you have a simple array containing those bits. That allows you to find the length of an instruction or, alternatively, its offset within its group in O(1) fashion.
From another comment, it seems that doesn't solve the main problem, which is that, with variable-length instructions, you need the ability to shift about any contiguous sequence of bits from a cache line (worst case: cache lines) into an execution unit. That wouldn't be that hard, if it didn't have to be very fast, too.
That doesn't really solve the problem, and has several deficiencies. A split stream might seem complicated, but IMHO it's an amazingly clean way of simplifying the problem.
---
The problem is not figuring out what sizes things are, but the routing. When you have 30+ instructions a cycle, you need to route a lot to a lot of different places. You also can't do a parallel read since the size of previous instructions affects the placement of the next instruction - you will eventually fix this with routing but it's not easy.
Consider a 30-wide instruction. Decoding it involves knowing where in the instruction stream it starts. This can only be known once the prior instruction is partially parsed.
Consider its later half - this can only be read after the first half is read. In your case, you actually only need to read the 2-bit prefixes but the difficulty remains.
With a split instruction stream, since there are two independent (at least until a branch) program counters so you have twice the contextual information. You already know where the second half starts from the prior instruction.
One more great thing about this is that the contextual information is preserved in jumps, without increasing the size of jump instructions. Even if prior instructions had metadata about splitting the next instruction for parsing (at yet more space wastage), this would be lost across jumps. That would worsen jump latency, which is a common bottleneck.
The above point is way more important if you expect a flow instruction almost every bundle, due to the massive ILP the Mill has.
---
> With instruction sizes of 1-4 bytes, you could store instruction sizes in 2 bits per instruction.
However, the Mill does not constrain its instruction sizes to multiples of 8 bits. This means either you'd lose that bonus or you'd need even more bits.
The Mill also splits which instructions are allowed in each half. Not only does this give you a free implicit bit, it again halves the amount of work for each decoder. This could probably happen with interleaved halves like you describe, but not without difficulty.
> For example, you could use 0x01 both for INC and ADD IMMEDIATE VALUE, where it means the former if the instruction length is 1, the latter if it is longer, with the size of the immediate determined by the size of the instruction. Would that be a net win?
The answer is no. The instructions codes are generated to be space optimal. I'm not sure how this is done, but it has been stated. They also leave it open for metadata like you mention to be included in each instruction stream, and in fact their blocks do have headers.
> creating those two streams that somehow must work together to do what a single statement sequence in the source code does
Given the instructions are disjoint and they only need synchronization at jumps, I would be very surprised if this was problematic.
"The problem is not figuring out what sizes things are, but the routing"
Aha! Thanks.
"Given the instructions are disjoint and they only need synchronization at jumps"
But they are (AFAIK) executing a single thread of execution, so creating those disjoint streams in general will be far from easy, won't it? if it were possible to do that, we would see a 2x speed up of single-threaded code on CPUs with two cores.
Well, I was actually trying to say the decoders work independently and produce disjoint subsets of the instruction stream. How the calculations actually get done is harder, but should be largely independent of how the instructions are encoded.
One great thing about the Mill is that instructions within a bundle are mostly disjoint. The Mill has no data hazards and each instruction takes from the belt, which is not changed until the bundle is finished. There are "gangs" that allow combining multiple instructions, but I don't know how they are handled.
In general, this means a Mill has more instruction interdependence than an x86. Although an x86 could have a split-stream instruction cache, data dependence prevents a traditional architecture from parallelizing itself as much as the Mill and hazards prevent the kind of speculation the Mill can do. Since instruction decode isn't a big bottleneck on fixed-width architechtures, there's no real pressure to innovate there.
I know this architecture tries to innovate everywhere, but disjoint streams that share registers, but not caches?
I haven't read all documentation on the Mill, but I would expect that that sharing part means that both streams can read or one can write each register.
Each Mill core has two separate physically separated instruction caches that each feed a dedicated decoder, which decodes for a distinct set of pipelines.
I apologize for my mistake, I hope that it hasn't induced too many confusions (I take it that 'distinct set of pipelines' means distinct operation and 'belt' so the registers aren't shared).
The pipelines are distinct, but the set of ops that any particular pipeline supports is arbitrary (as in picked by the HW guys who have to fit the FUs in) and the belt is shared. The belt is where the two sides join.
We tend to get good entropy putting the flow on one side and exec on the other, and that's nice for humans to model too, but there is no technical requirement.
The thing about the Mill is its easily customisable. If you have some new FU (lets say you want a particular op that assists in say encrpytion) you can choose where you put it, and how many pipelines you put it in etc.
They've publicly stated that they intend to produce their own chips, though. Regarding licensing the architecture (a la ARM), Ivan said during a talk that Intel's quarterly dividend was as much as ARM's yearly revenue. So I don't think that selling the entire company is even on their radar right now, unless you're really cynical and doubt that declared intention.
It's a middlebrow dismissal. It should be downvoted. That's not sycophancy, it's just Hacker News' losing battle to keep the conversation stimulating and worthwhile.
I guess if your definition of "stimulating and worthwhile" is "uncritical fawning", then I can understand your support of downmodding anything critical. Thanks for the confirmation of my original suspicion.
I agree - an entirely new architecture with nothing to speak for it (no FPGA code to try it out, no production chips, just paper this, paper that). I'm much more impressed by Risc-V - it exists in a format you can actually use.
Yup...the RISC-V folks have a working, open source, ~1GHz processor. That runs real software. Today.
But what do they know?
Mill has...videos. And patent applications. A strong social media presence. And a simulator that they really, really promise you'll be able to use real soon now and is going to totally own everything. And they're going to fab the chips themselves, long after everyone but Intel figured out that was stupid. Or something.
Because that sort of "we have the off-the-wall architecture that will change fucking everything in computing forever" attitude worked so well for the iAPX432, Transputer, Rekursiv, i860, everything Chuck Moore has ever done, TI9900, most (not all) things called VLIW, etc." Right?
Best of all? A completely uncritical, fawning audience on HN.
As I understand, they aren't seeking venture capital backing, so progress is slow. I know they have (had) plans to implement a Mill on an FPGA, but even going to a usable ASIC will probably be a challenge.
It's a major architectural change as well. Most existing systems rely on virtual memory, which the Mill doesn't support.
> Most existing systems rely on virtual memory, which the Mill doesn't support.
I believe you're mistaken. The Mill does support virtual memory; the main differences to most systems are that its caches use virtual addresses (the TLB is between the L2 cache and the DRAM), and that it's a single-address-space architecture.
ARM did that with L1 cache in the '90s and it was terrible for targeting a VM OS to. Cache aliasing is a nightmare and a never-ending source of bugs for e.g. shared memory.
It's very likely that they will need to write their own OS from scratch for the Mill. They might be able to reuse a lot of code from Linux, but it won't be Linux, as the whole ethos of the Mill is to rethink everything.
In particular the Mill does not have the same notion of a process, the CPU itself is aware of threading, it has its own security design that isn't anything like user/supervisor mode, there are no syscalls, and programs have to be specialised before they can be run, a la ART on Android.
I suspect once they have LLVM up and running, one of the next biggest wins they'll want to go for is porting HotSpot. The only other company I'm aware of that was able to actually make money selling a really crazy custom CPU architecture is Azul with their Vega arch, and they did it by making Java bytecode their exposed bytecode. The actual CPU opcodes were an implementation detail and everything compiled on the fly.
The nice thing about porting the JVM is tons of business-critical, performance sensitive code would Just Work™. And then you have a ready made market for companies with huge Java (e.g. Hadoop) jobs that are very power/performance tradeoff sensitive.
Is VEGA an actual discrete CPU architecture or something they've licensed/copied from existing chips?
Azul made their money from selling high performance JVM's they have their "VEGA" compute appliances for Java applications but I've never actually seen their ISA documentation for the hardware which would be a very interesting thing to see....
All of the Mill's cache, not just one level, is virtual SAS.
Further, the Mill has no false aliasing, consistent caches, is always sequentially consistent and has no aliasing races. Some of this is covered in http://millcomputing.com/wiki/Memory.
The more criticisms I hear of this architecture, the more impressed I get.
If it's single address space, I wouldn't call it Virtual Memory. That sounds more like what the Blackfin had. I'm going to watch their memory talk today (when I can get a free 90 minutes).
[edit]
Reading the Memory page on the wiki, it is indeed SAS with an MPU. Obvious difference from Blackfin is that you have 2^28 times as many addresses available.
It would also be interesting to see how much smaller a PLB entry is compared to a typical TLB. TLB misses are responsible for a surprising amount of performance loss on some workloads as L2 caches have gotten so much larger, so if they can have e.g. an order of magnitude more entries in the PLB it would be a big performance win.
[edit3] It just occurred to me that your PLBs could be as slow as your L1 cache with little impact on performance. This can also let them be physically larger than your TLBs.
Assuming the ISA is designed for efficient PIC/PID a unix-like shouldn't be too terrible to port from that difference alone. I know there have been SAS unixes before.
I have run into the 46-bit userspace limit on linux AMD64 (think mmap), so I'm somewhat concerned 64-bits may not be enough in the future as disks get bigger, though that's a long way off as you have 2^14 times that room under this scheme.
Afraid this answer is quite late, but hopefully it finds the curious passing through anyway :)
The really big gains from splitting protection from translation are twofold:
First, you can do the protection lookup in parallel to the cache lookup, rather than before it. This allows the PLB to be much bigger than a conventional TLB could ever be.
Second, the PLB entries do not need to be fixed power-of-two sizes. Each PLB entry has a start and stop offset so protection can be as granular as protecting one byte. In general, though, a single PLB entry will be quite large, covering the whole mapped region.
The Mill still has a TLB but it sits between chip and the off-chip memory. This allows it to be larger than a conventional TLB, as it taking slightly longer will not have the same impact. Additionally, as a small implementation optimisation, the TLB on the bigger Mill family members will do a lot of things like allocating-from-a-pool dynamically, so you only trap to the OS when things get dire.
The Mill address has a special bit to signify if it is a global address, or whether it is "local" to the process (using a special register which is XORed into the address). This allows the Mill to fork() and all the other things that traditionally have been hard on a SASOS.
We are porting Linux, and it will be Linux. Only a very very small part of the HAL will need to know anything Mill-specific :)
> Second, the PLB entries do not need to be fixed power-of-two sizes. Each PLB entry has a start and stop offset so protection can be as granular as protecting one byte. In general, though, a single PLB entry will be quite large, covering the whole mapped region.
This seems orthogonal to the choice of PLB vs. TLB. You could make a TLB with a start and stop offset if you really wanted to (though I suppose the increased size per entry would be problematic with TLBs needing to be faster than PLBs (which only need to be as fast as an l1 cache hit).
> The Mill address has a special bit to signify if it is a global address, or whether it is "local" to the process (using a special register which is XORed into the address). This allows the Mill to fork() and all the other things that traditionally have been hard on a SASOS.
Would shared-memory (in the unix sense) be considered global or local? Would read-only text sections be marked global to improve fork performance? I assume CoW is out-of-the-question for non-read-only sections and fork()? If there's a wiki page relating to fork, just point me to it :)
> We are porting Linux, and it will be Linux. Only a very very small part of the HAL will need to know anything Mill-specific :)
The feasibility of this seems obvious to me, as I've seen linux run other SAS platforms that are far more constrained than The Mill.
The Mill has non-paged protection in the PLB and paged translation in the TLB.
It is true that a 'conventional' could have non-paged translation, but non-paged translation would be slow. Really slow. Shudder to imagine how it would work internally :)
Re SAS:
Conventional machines are Multi-address-space (MAS). In a MAS, addresses can alias each another. Not so on the Mill (at least, if you deliberately alias in the Mill TLB you'll have to know that reads and writes will merge as they go off-chip etc; tricky, but can be useful in select low-level scenarios).
On the Mill, everything that is shared between processes is global. On a Unix port, everything that is local to a process is 'local'. So MAP_SHARED gets addresses without the local bit set and MAP_PRIVATE does.
When we fork(), we have some choices. The OS picks a new local offset for the new process, meaning that local pointers don't alias one another.
We could straight away clone the protection entries and page tables.
Or, as an implementation detail, depending on the extent to which the PLB self-manages on the particular Mill member, just let the PLB trap. As the new forked process accesses memory ranges for the first time the PLB will miss and there will be the classic trap. In the trap, the handler can set up COW lazily and for the memory regions that are actually touched by the child.
I am simplifying it a bit ;)
Most fork() is just going to exec(), and the latter will win. Some people are still using fork() in other ways, and they may benefit from an aggressive COW up front on fork(). Swings and roundabouts.
> If it's single address space, I wouldn't call it Virtual Memory.
The term "virtual memory" refers to a dynamic mapping between physical and virtual addresses. SAS doesn't preclude virtual memory; you can still have many of the advantages of virtual memory without multiple address spaces (for example, memory mapped files).
I know right? Even a crappy CPU based on the Mill concept would be good. You could at least perform computational efficiency (ops/watt) style benchmarks. I'm sort of curious what the endgame is, myself.
I'm a little confused about how the "pointerness" of pointers is getting lost in LLVM. The mill can't be the only architecture that encodes extra data in the high bits of pointers right? The 64 bit objective-C runtime and Xbox 360 come to mind as systems where this is done. Clearly LLVM can generate good code 64 bit Objective-C... Am I missing something?
Seems my download was truncated at 2485374429 bytes, a mere 137934 over the 2 GB mark for some reason. I'll fix it and post the new link. EDIT: The link was updated.
Those CPUs look like having a huge potential, but there is no really working compiler for them. In my opinion, they could try to engage Fabrice Bellard, the genious behind QEMU and TCC. Or people with experience on Mali GPU architecture, which shares many points with the Mill CPU approach.
Can you elaborate on the common point with the Mali?
You may not know that Ivan has some compiler background, he is not really a newbie here! And the main problem right now is not a crappy compiler but no silicon...
I would fix the silicon problem last, if I were them. Once manufacturing begins, the company will be bleeding money and customers better not be waiting for something else before buying.
They'll get there. They aren't burning through money, they don't have any: they're not a properly formed company, just volunteers. It'll get done when it gets done.