LLVM Patches Confirm Google Has Its Own In-House Processor

TheCoreh · on Feb 11, 2016

Now, would it even make sense to merge into the mainline codebase of LLVM backend code targeting a platform that noone else can test or use?

Is Google planning on eventually releasing this architecture publicly or maybe licensing it to third party manufacturers, a-la ARM?

tyingq · on Feb 11, 2016

It appears to be for the Myricom LanAI processor, which is available to others. It's an older 32 bit RISC processor, used as the offload engine for their NIC cards.

Here's gcc for it: https://github.com/myri/lanai-gcc

Sounds like it's not particularly interesting unless you want to write your own network offload code.

Edit: Probably just the offload processor they are choosing for their in-house routers/switches/servers. And, probably they want their own firmware either for security/nsa reasons, or performance, or both.

JabavuAdams · on Feb 11, 2016

Didn't know what an offload engine was, so I looked up TCP offload engine (TOE) on Wikipedia: https://en.wikipedia.org/wiki/TCP_offload_engine

The page states: """ A generally accepted rule of thumb is that 1 hertz of CPU processing is required to send or receive 1 bit/s of TCP/IP.[3] For example, 5 Gbit/s (625 MB/s) of network traffic requires 5 GHz of CPU processing. """

Question for people in this line of work: is this accurate / reasonable?

Thanks

umanwizard · on Feb 11, 2016

From the linked paper:

The generally accepted rule of thumb is that 1bps of network link requires 1Hz of CPU processing. Figures 11, 12 give a full story of this rule of thumb. (where Hz/bps ratio = %CPU utilization * processor speed / bandwidth). It had held up remarkably well over the years, albeit only for bulk data transfer at large sizes. For smaller transfers, we found the processing requirement to be 6-7 times as expected. Moreover, the figures show that network processing is not scaling with CPU speeds. The processing needs per byte increase when going from 800MHz to 2.4GHz. This happens because as CPU speed increases, the disparity between memory and I/O latencies versus CPU speeds intensifies.

CyberDildonics · on Feb 11, 2016

Does anyone know if memory latency is still causing problems in common implementations (like linux and freebsd)? I would think that the parts that are the bottleneck could be re-written with that in mind and gain quite a bit from it.

wmf · on Feb 11, 2016

TSO really turbocharged bulk TCP transfers, so now 2 GHz can drive 10 Gbps as long as you're sending >=64KB chunks. This has made performance brittle, because 10 Gbps of small packets requires 10x-100x as much CPU as 10 Gbps of bulk traffic. Also, receiving requires more cycles than transmitting.

nly · on Feb 11, 2016

Indeed, it's common to see DDoS protection offerings defined both in throughput (e.g. 50 Gb/s) and packets per second (e.g. 30 Mpps), which results in a bottleneck in packet size (e.g. ~200 bytes) at high throughput.

qubex · on Feb 11, 2016

I'm very biased against “approximately X per Y” statements in general because just about any non-pathological curve has a linear tangent somewhere or the other, but if intervals wherein that proportional approximation can be said to hold are not stated it's pretty haphazard as a means of estimation.

Also, what kind of physical/logical process and attendant costs is it encoding? Does one per second of anything require one per second of something else? What if the processor were half- or double-the-bits?

ambrop7 · on Feb 11, 2016

Just now I was working on integrating and optimizing networking for some microcontroller-based software (using lwIP). At the end, I've reached 7MB/s send and 10MB/s bulk receive speed.

I just did the calculation. It says for 10MB/s I'd need 80MHz. The chip runs at 84MHz. So, it's pretty close :)

tyingq · on Feb 11, 2016

There's other things you can offload now as well. Search for info on DPDK, Intel QuickAssist, etc. These sorts of things will allow off the shelf hardware to displace expensive, proprietary ASIC accelerated routers, firewalls, etc.

mikecb · on Feb 11, 2016

I thought this too, but someone in the email thread explicitly mentioned the myricom, and the _response_ to that message was that it was purely internal hardware, not useful for others...so it may just be a coincidence.

geofft · on Feb 11, 2016

But then they go on to say that their "previous binutils used" was https://github.com/myri/lanai-binutils.

It certainly seems believable that this is a third-party processor architecture, but Google has a contract with them to build specific models of that processor that meet their needs (possibly alongside some other proprietary hardware, like a high-speed NIC), and those models aren't sold to the general public. That's pretty common outside x86, right? For instance, is there a way for me to buy a BCM2835 other than by buying a Raspberry Pi?

tyingq · on Feb 11, 2016

If you search a little, you'll find that Google hired Myricom's CEO, Founder/CTO, and several engineers. I suppose it's possible they licensed the IP too.

akeruu · on Feb 11, 2016

Now that's meta. Using Google to uncover their own secrets / low profile activities.

tyingq · on Feb 11, 2016

Heh. I suppose there's an app idea in there somewhere...click on a company and visualize notable inbound/outbound talent migrations.

ShinyCyril · on Feb 11, 2016

Chips like the BCM2835 are available for purchase, just you need to have a very large purchase order.

usrusr · on Feb 11, 2016

For Google, sure. Everybody working with the in-house hardware could do so without hunting for a version of the internal patch/fork. Using the fork would surely be easy enough when directly using LLVM, but the deeper it is hidden within a dependency tree, the more friction it adds to whatever the user at the root of that tree is doing.

Everybody else pays a small "maintenance tax" when working on the codebase or relying on work being done on that codebase.

Being by no means a compiler expert I do suspect however, that this "tax" is tiny and likely to be dwarfed by other contributions from Google, so letting them solve distribution of their private backend by piggybacking on the public release is most likely the right course of action.

lhecker · on Feb 11, 2016

Maybe they just don't want to backport their backend code again and again for every LLVM snapshot they do, because that can be a tedious thing to do compared to just having it in origin/master forever for free.

jacquesm · on Feb 11, 2016

Without the ability to test though they will still have to do a bunch of work in order to be able to use a new release. Possibly the release breaks their backend but nobody will know until google tries to use it.

VikingCoder · on Feb 11, 2016

If their contribution includes unit tests, it could be enough to guarantee most changes won't break it.

iainmerrick · on Feb 11, 2016

What makes that tedious? And can't it be fixed?

coldpie · on Feb 11, 2016

Suppose the LLVM devs change a function signature used by Google's private backend. If the backend is private, Google will have to update their usage of that API when they do a merge. But if the backend is present upstream when the LLVM devs make that change, then the LLVM devs are responsible for making sure all supported backends update their usage of the API.

iainmerrick · on Feb 11, 2016

But as people have noted, if you don't have the hardware, it's impossible to test that any changes actually work.

If we're just talking about pure refactoring that doesn't change any output, you could test that the generated machine code is identical. But then you have to ask, why isn't there a stable API rather than all this refactoring churn?

I guess this is just the way LLVM and Clang are designed -- all components really tightly coupled together. And it's a successful project so it must be working out for them. But...!

DannyBee · on Feb 11, 2016

"If we're just talking about pure refactoring that doesn't change any output, you could test that the generated machine code is identical. But then you have to ask, why isn't there a stable API rather than all this refactoring churn? "

LLVM deliberately does not want a stable API. It wants people to keep up with trunk.

They do this because they saw what has happened with other compilers, where the stable API became literally impossible to change over time.

This is one of the reasons GCC still has a crappy backend. You either have to build a new API and port everyone over, or you have to find an incremental way to change an API interface with hundreds of random interface points.

thrownaway2424 · on Feb 11, 2016

That is definitely not true about needing the hardware. There are loads of compilers for fictitious hardware and they work perfectly well. There was an entire x86-64 tool chain before anyone ever manufactured one of those.

mywittyname · on Feb 11, 2016

So I'd venture to presume that there's an extensive battery to tests for each target processor. It seems unlikely to me that developers on the LLVM/Clang team all have physical access to every target CPU.

coldpie · on Feb 11, 2016

I'm just explaining why Google might want it merged upstream. The work you're talking about has to be performed in either case.

microcolonel · on Feb 11, 2016

In this case you would just need to compile to know it's broken.

ddevault · on Feb 11, 2016

No, I don't think it really makes sense for LLVM to mainline an architecture no one can use. I doubt they'll be on board with this.

kenferry · on Feb 11, 2016

Looks like there's consensus to accept it as an "experimental" backend. I think that means the code is in tree, but other people changing LLVM aren't obligated to keep it working.

It may be accepted as a full backend in the future, but that discussion is deferred since people are happy with it being experimental for now.

DannyBee · on Feb 11, 2016

GCC has done it for years. For LLVM mainline, often for small and simple architectures, the people contributing to mainline is much more valuable than the cost of having to make mechanical API changes in that backend when codegen api's are changed.

(Once you have > ~5 backends, which LLVM does the cost of doing the latter just doesn't really change much).

If you want to build a real community, turning away contributions likely to lead to an overall net positive for the community tends not to be a good approach :)

(I await the arguments about corporations having no care about communities or whatever else)

verbatim · on Feb 11, 2016

It's not exactly unprecedented. GCC supports mmix, which hardware doesn't even exist for. (An educational platform from TAOCP.)

https://gcc.gnu.org/backends.html

cmrx64 · on Feb 11, 2016

"Only has simulators" is very different from "nobody can use". MMIX has value.

iainmerrick · on Feb 11, 2016

MMIX is fully documented, though. Being an educational platform is a valid reason to be supported.

FullyFunctional · on Feb 11, 2016

The difference between hardware and software is getting fuzzy (Transmeta, NVIDIA Denver, ...), but there's at least one (admittedly terrible) FPGA implementation of a subset of MMIX large enough to execute small graphical demos: https://github.com/tommythorn/fpgammix

jakejake · on Feb 11, 2016

Ordinarily I'd agree with you. Except the tradeoff of having Google's other contributions to the main branch is probably worth it for a little bit of code that isn't harming anyone else.

It's perhaps not fair, but just a reality that a company like Google has a lot to offer so they can sometimes get special treatment.

maaku · on Feb 11, 2016

Having been in this position before, I can tell you that maintaining large patch sets increases friction against submitting anything upstream. Even if this was a Google-only CPU (it's not), that'd still be an argument for merging upstream.

desdiv · on Feb 11, 2016

This story has a little more info, including the chip's origin from Myricom:

http://www.theregister.co.uk/2016/02/09/google_processor/

rurban · on Feb 11, 2016

This LANai v11 RISC chip seems to be the successor of the LANai v7 and LANai v9 chips described here: https://www.myricom.com/scs/myrinet/...ogramming.html for which the gcc toolchain already exists: https://github.com/myri Porting this to LLVM makes more sense.

Nothing too interesting, as it doesn't use IBM's 7nm or Intel's 10nm chip tech. Just a simple but parallel high-speed network chip, as used in the Myrinet network cards. The old ones ran with 33MHz but very low latency.

Really exciting would be the Power8 based on IBM's new 7nm, which would finally blow away Intel advantages on an fully open (and unbackdoored) design.

cmrx64 · on Feb 11, 2016

Nothing about the POWER8 or IBM's process technology implies unbackdoored. I can't inspect the factories, or their supply chain, or the HDL they used, or the tools that processed the HDL. The only thing more "open" about POWER8 is firmware in some deployments, and maybe licensing the ISA (if you have enough clout/money to join the foundation, I can't find any licensing information at all). RISC-V is more interesting in every way, with respect to openness.

pgeorgi · on Feb 11, 2016

> Really exciting would be the Power8 based on IBM's new 7nm, which would finally blow away Intel advantages on an fully open (and unbackdoored) design.

Sounds like you read https://raptorengineeringinc.com/TALOS/prerelease.php :-)

userbinator · on Feb 11, 2016

Lanai is a simple in-order 32-bit processor with 32 x 32-bit registers, two registers with fixed values, four used for program state tracking, and two reserved for explicit usage by user, and no floating point support.

Might as well be a MIPS. The fact that Google has suppoedly developed its own CPU is interesting, but the architecture itself seems quite mundane.

Edit: comments on the article suggest it's the Myricom LANai, a NIC embedded processor. Google may just happen have these NICs in their machines and want to write firmware for them.

vmorgulis · on Feb 11, 2016

http://www.phoronix.com/forums/forum/phoronix/latest-phoroni...

It could be lanAI too...

eternalban · on Feb 11, 2016

OT(?): reverse(Lanai) // todo: see if copying XXX cpu is OK

WestCoastJustin · on Feb 11, 2016

Must be funny working at Google and knowing the inside scoop (watching people speculate). So, here is my wild speculation about a similar secret project. There was a talk given by Dick Sites (from Google) [1, 2], where he talks about performance monitoring across the Google fleet. Very technical and useful if you are into monitoring (highly recommend it). A large part of his talk is dedicated to issues and limitations with off-the-shelf CPUs. Given that AWS has custom chips (talked about at Invent a few years back), why wouldn't Google solve this issue too, they have the talent and money.

[1] http://www.pdl.cmu.edu/SDI/2015/slides/DatacenterComputers.p...

[2] https://vimeo.com/121396406

wmf · on Feb 11, 2016

It's already been posted in this thread that it's just a smart NIC evolved from existing Myricom NICs. (Of course, my coworker was just telling me a story about how John Cocke hid his RISC processor in a printer project.)

And Amazon's "custom" CPUs are 100 MHz faster Xeons.

at-fates-hands · on Feb 11, 2016

but at least for Lanai it looks like a CPU used just internally and isn't something that is going to power the next generation of Android devices;

How many times have we seen a huge corporation do something in house, then release it for public consumption several years later? At this time, it will probably stay in house, but if there is a revenue opportunity, trust me, they'll release it.

pyvpx · on Feb 11, 2016

perhaps folks will now take 'intelligent' NICs more seriously than "we tried offloading once and it went poorly so thanks but no thanks"

ZenoArrow · on Feb 11, 2016

I wonder if they'll be using RISC-V for the same purposes in the near future.

http://www.eetimes.com/document.asp?doc_id=1328561

PeCaN · on Feb 11, 2016

Seeing as Google partially sponsored RISC-V[1], it wouldn't really surprise me.

1. Indirectly, via ASPIRE Lab.

cmrx64 · on Feb 11, 2016

Google is also a member of the OpenPOWER Foundation (though they've done much more with that). I don't think sponsorship is necessarily endorsement or deployment plans -- it'd be nice if that were the case though!

koder2016 · on Feb 11, 2016

If you are serious about your software...

mpu · on Feb 11, 2016

Hurray, no floating points!

Nano2rad · on Feb 11, 2016

Maybe there is no chip, could be they are implementing on FPGA or virtual machines.

Nano2rad · on Feb 11, 2016

First part is lan second part is ai. It could be something to do with artificial intelligence.