It appears to be for the Myricom LanAI processor, which is available to others. It's an older 32 bit RISC processor, used as the offload engine for their NIC cards.
Sounds like it's not particularly interesting unless you want to write your own network offload code.
Edit: Probably just the offload processor they are choosing for their in-house routers/switches/servers. And, probably they want their own firmware either for security/nsa reasons, or performance, or both.
The page states:
"""
A generally accepted rule of thumb is that 1 hertz of CPU processing is required to send or receive 1 bit/s of TCP/IP.[3] For example, 5 Gbit/s (625 MB/s) of network traffic requires 5 GHz of CPU processing.
"""
Question for people in this line of work: is this accurate / reasonable?
The generally accepted rule of thumb is that 1bps of network
link requires 1Hz of CPU processing. Figures 11, 12
give a full story of this rule of thumb. (where Hz/bps ratio
= %CPU utilization * processor speed / bandwidth). It had
held up remarkably well over the years, albeit only for bulk
data transfer at large sizes. For smaller transfers, we found
the processing requirement to be 6-7 times as expected.
Moreover, the figures show that network processing is not
scaling with CPU speeds. The processing needs per byte
increase when going from 800MHz to 2.4GHz. This happens
because as CPU speed increases, the disparity between
memory and I/O latencies versus CPU speeds intensifies.
Does anyone know if memory latency is still causing problems in common implementations (like linux and freebsd)? I would think that the parts that are the bottleneck could be re-written with that in mind and gain quite a bit from it.
TSO really turbocharged bulk TCP transfers, so now 2 GHz can drive 10 Gbps as long as you're sending >=64KB chunks. This has made performance brittle, because 10 Gbps of small packets requires 10x-100x as much CPU as 10 Gbps of bulk traffic. Also, receiving requires more cycles than transmitting.
Indeed, it's common to see DDoS protection offerings defined both in throughput (e.g. 50 Gb/s) and packets per second (e.g. 30 Mpps), which results in a bottleneck in packet size (e.g. ~200 bytes) at high throughput.
I'm very biased against “approximately X per Y” statements in general because just about any non-pathological curve has a linear tangent somewhere or the other, but if intervals wherein that proportional approximation can be said to hold are not stated it's pretty haphazard as a means of estimation.
Also, what kind of physical/logical process and attendant costs is it encoding? Does one per second of anything require one per second of something else? What if the processor were half- or double-the-bits?
Just now I was working on integrating and optimizing networking for some microcontroller-based software (using lwIP). At the end, I've reached 7MB/s send and 10MB/s bulk receive speed.
I just did the calculation. It says for 10MB/s I'd need 80MHz. The chip runs at 84MHz. So, it's pretty close :)
There's other things you can offload now as well. Search for info on DPDK, Intel QuickAssist, etc. These sorts of things will allow off the shelf hardware to displace expensive, proprietary ASIC accelerated routers, firewalls, etc.
I thought this too, but someone in the email thread explicitly mentioned the myricom, and the _response_ to that message was that it was purely internal hardware, not useful for others...so it may just be a coincidence.
It certainly seems believable that this is a third-party processor architecture, but Google has a contract with them to build specific models of that processor that meet their needs (possibly alongside some other proprietary hardware, like a high-speed NIC), and those models aren't sold to the general public. That's pretty common outside x86, right? For instance, is there a way for me to buy a BCM2835 other than by buying a Raspberry Pi?
If you search a little, you'll find that Google hired Myricom's CEO, Founder/CTO, and several engineers. I suppose it's possible they licensed the IP too.
For Google, sure. Everybody working with the in-house hardware could do so without hunting for a version of the internal patch/fork. Using the fork would surely be easy enough when directly using LLVM, but the deeper it is hidden within a dependency tree, the more friction it adds to whatever the user at the root of that tree is doing.
Everybody else pays a small "maintenance tax" when working on the codebase or relying on work being done on that codebase.
Being by no means a compiler expert I do suspect however, that this "tax" is tiny and likely to be dwarfed by other contributions from Google, so letting them solve distribution of their private backend by piggybacking on the public release is most likely the right course of action.
Maybe they just don't want to backport their backend code again and again for every LLVM snapshot they do, because that can be a tedious thing to do compared to just having it in origin/master forever for free.
Without the ability to test though they will still have to do a bunch of work in order to be able to use a new release. Possibly the release breaks their backend but nobody will know until google tries to use it.
Suppose the LLVM devs change a function signature used by Google's private backend. If the backend is private, Google will have to update their usage of that API when they do a merge. But if the backend is present upstream when the LLVM devs make that change, then the LLVM devs are responsible for making sure all supported backends update their usage of the API.
But as people have noted, if you don't have the hardware, it's impossible to test that any changes actually work.
If we're just talking about pure refactoring that doesn't change any output, you could test that the generated machine code is identical. But then you have to ask, why isn't there a stable API rather than all this refactoring churn?
I guess this is just the way LLVM and Clang are designed -- all components really tightly coupled together. And it's a successful project so it must be working out for them. But...!
"If we're just talking about pure refactoring that doesn't change any output, you could test that the generated machine code is identical. But then you have to ask, why isn't there a stable API rather than all this refactoring churn?
"
LLVM deliberately does not want a stable API. It wants people to keep up with trunk.
They do this because they saw what has happened with other compilers, where the stable API became literally impossible to change over time.
This is one of the reasons GCC still has a crappy backend. You either have to build a new API and port everyone over, or you have to find an incremental way to change an API interface with hundreds of random interface points.
That is definitely not true about needing the hardware. There are loads of compilers for fictitious hardware and they work perfectly well. There was an entire x86-64 tool chain before anyone ever manufactured one of those.
So I'd venture to presume that there's an extensive battery to tests for each target processor. It seems unlikely to me that developers on the LLVM/Clang team all have physical access to every target CPU.
Looks like there's consensus to accept it as an "experimental" backend. I think that means the code is in tree, but other people changing LLVM aren't obligated to keep it working.
It may be accepted as a full backend in the future, but that discussion is deferred since people are happy with it being experimental for now.
GCC has done it for years.
For LLVM mainline, often for small and simple architectures, the people contributing to mainline is much more valuable than the cost of having to make mechanical API changes in that backend when codegen api's are changed.
(Once you have > ~5 backends, which LLVM does the cost of doing the latter just doesn't really change much).
If you want to build a real community, turning away contributions likely to lead to an overall net positive for the community tends not to be a good approach :)
(I await the arguments about corporations having no care about communities or whatever else)
The difference between hardware and software is getting fuzzy (Transmeta, NVIDIA Denver, ...), but there's at least one (admittedly terrible) FPGA implementation of a subset of MMIX large enough to execute small graphical demos: https://github.com/tommythorn/fpgammix
Ordinarily I'd agree with you. Except the tradeoff of having Google's other contributions to the main branch is probably worth it for a little bit of code that isn't harming anyone else.
It's perhaps not fair, but just a reality that a company like Google has a lot to offer so they can sometimes get special treatment.
Having been in this position before, I can tell you that maintaining large patch sets increases friction against submitting anything upstream. Even if this was a Google-only CPU (it's not), that'd still be an argument for merging upstream.
Nothing too interesting, as it doesn't use IBM's 7nm or Intel's 10nm chip tech.
Just a simple but parallel high-speed network chip, as used in the Myrinet network cards. The old ones ran with 33MHz but very low latency.
Really exciting would be the Power8 based on IBM's new 7nm, which would finally blow away Intel advantages on an fully open (and unbackdoored) design.
Nothing about the POWER8 or IBM's process technology implies unbackdoored. I can't inspect the factories, or their supply chain, or the HDL they used, or the tools that processed the HDL. The only thing more "open" about POWER8 is firmware in some deployments, and maybe licensing the ISA (if you have enough clout/money to join the foundation, I can't find any licensing information at all). RISC-V is more interesting in every way, with respect to openness.
> Really exciting would be the Power8 based on IBM's new 7nm, which would finally blow away Intel advantages on an fully open (and unbackdoored) design.
Lanai is a simple in-order 32-bit processor with 32 x 32-bit registers, two registers with fixed values, four used for program state tracking, and two reserved for explicit usage by user, and no floating point support.
Might as well be a MIPS. The fact that Google has suppoedly developed its own CPU is interesting, but the architecture itself seems quite mundane.
Edit: comments on the article suggest it's the Myricom LANai, a NIC embedded processor. Google may just happen have these NICs in their machines and want to write firmware for them.
Must be funny working at Google and knowing the inside scoop (watching people speculate). So, here is my wild speculation about a similar secret project. There was a talk given by Dick Sites (from Google) [1, 2], where he talks about performance monitoring across the Google fleet. Very technical and useful if you are into monitoring (highly recommend it). A large part of his talk is dedicated to issues and limitations with off-the-shelf CPUs. Given that AWS has custom chips (talked about at Invent a few years back), why wouldn't Google solve this issue too, they have the talent and money.
It's already been posted in this thread that it's just a smart NIC evolved from existing Myricom NICs. (Of course, my coworker was just telling me a story about how John Cocke hid his RISC processor in a printer project.)
And Amazon's "custom" CPUs are 100 MHz faster Xeons.
but at least for Lanai it looks like a CPU used just internally and isn't something that is going to power the next generation of Android devices;
How many times have we seen a huge corporation do something in house, then release it for public consumption several years later? At this time, it will probably stay in house, but if there is a revenue opportunity, trust me, they'll release it.
Google is also a member of the OpenPOWER Foundation (though they've done much more with that). I don't think sponsorship is necessarily endorsement or deployment plans -- it'd be nice if that were the case though!
Is Google planning on eventually releasing this architecture publicly or maybe licensing it to third party manufacturers, a-la ARM?