Why didn't Larrabee fail?

pavanky · on Aug 15, 2016

> Larrabee ran Compute Shaders and OpenCL very well - in many cases better (in flops/watt) than rival GPUs

Oh god the horror that was OpenCL on Xeon Phi.

1) It was extremely buggy. Any OpenCL program of decent complexity was bound to encounter bugs in their driver.

2) FLOPS are not the only measure for computations / simulations. The bandwidth is is extremely important for many applications. Whatever Xeon Phis OpenCL was doing was not achieving anything close to its peak. Any OpenCL kernels we had tuned would end up improving the performance of the host CPU (16 x 2 Xeon) as well. It would usually result in the Xeon CPU performing better than the Xeon Phis.

3) NVIDIA released multiple generations of their GPUs since Xeon Phi was released. The performance per watt on NVIDIA and AMD GPUs has been improving substantially over the last 3 years while Xeon Phi's stagnated in 2012.

4) Even their claim for best FLOPS / WATT is wrong. The Tesla K20 released around the same time had the same power usage (225W) but about 50% higher FLOPS (2TFLOPS vs 3TFLOPS single precision).

The only selling point was that they were x86 cores and hence did not require rewriting your code. But you had to write additional directives indicating what part of your code you wanted to offload to your accelerator and when you wanted to get it back. The best performance ironically came after a lot of tuning for your application.

4 years down the line, GPU based accelerators still dominate the market and growing. and I wonder how many of the researchers who went with Xeon Phis regret their decision.

derefr · on Aug 16, 2016

> The only selling point was that they were x86 cores and hence did not require rewriting your code.

Not quite; the selling point was that they were x86 cores and hence did not require hiring graphics programmers to write your GPGPU code. You certainly did have to rewrite and refactor to get code shoved onto the Phi, but a non-HPC-focused team could do such a rewrite+refactor with just the people they already had.

emcq · on Aug 16, 2016

Cuda did not share many similarities to the graphics pipeline back then. Even now it's a stretch to compare the two.

x86 was and is still a weak selling point. Developers from all backgrounds are deploying their apps on varying architectures like ARM or the JVM without much stress. The hard part about writing code that is fast for an architecture is only made more complex by having your compute units be x86 rather than simple SP vector units of a GPU.

I know many folks without a graphics background writing cuda apps but no one outside of hpc research environments dabbling in the complexities of xeon phi.

The reason for this is simple; if you get your code to be in a cuda friendly structure you have created a data parallel rewrite that leverages the highly parallel memory bus of GPUs to get a pretty easy speedup. By being constrained to a semi opinionated programming interface people can see real speedups and not get bogged down in multithreading, multiprocessing, and buggy device drivers.

g0xA52A2A · on Aug 15, 2016

I agree with all of your points but I think KNL will be a turning point.

Traditional multi core CPUs simply aren't an option to advance in the HPC space any more. Per generation there's not enough of a performance improvement to make it effective. And buying more of them isn't feasible from a power perspective. So everyone is going to have to re-write for massive parallelism. I think Intel's KNL and future Knights will be the most attractive path for a lot of people. You have the tooling and familiarity of x86 and a simpler memory model than GPUs.

llukas · on Aug 16, 2016

Do you still need to use vector instructions on KNL to get top performance? If so then forgot about "familiarity" when rewriting.

m_mueller · on Aug 16, 2016

This is exactly the biggest downfall of the Larrabee/Phi programming model. The claim that you can just deploy your CPU code is simply wrong if vectorization isn't taken into account. I'd argue it's worse than porting it to GPU if you start with code that's "naively" parallelized for CPU with OpenMP statements without unrolling/vector intrinsics. The reason is that the actual kernel code can stay the same for CUDA, or it can be relatively easily transformed (something I'm working on [1]) - just take whatever was inside your OpenMP loops and move it into a CUDA kernel. Meanwhile for Phi I'd have to now care about two levels of parallelism instead of just one (multicore/vector).

The entire reason why GPU has been so successful, is that its performance is relatively predictable compared to x86 cores because of its much simpler architecture - and putting hundreds of them on a single chip doesn't change that. Granted, there are applications that will perform better on KNL than GPU (I expect), because it has a somewhat greater degree of freedom (although the bandwidth you get when actually using that freedom will be a deciding factor on whether it's actually worth it over the latest CPUs in those cases).

[1] https://github.com/muellermichel/Hybrid-Fortran

g0xA52A2A · on Aug 16, 2016

  Do you still need to use vector instructions on KNL to get top performance?

Yes, but this is also the case for regular Xeons.

  If so then forgot about "familiarity" when rewriting.

When I say "familiarity" I'm referring to tooling in things like compilers and debuggers as well as just the instruction set.

gnufx · on Aug 17, 2016

Right. The only thing that looks particularly different (compared with the typical HPC node intended for workhorse vectorized code) about KNL on paper is the memory system, and possibly integrated interconnect. Given the usual library support, I don't see why there should be significant re-writing to be done, any more than for other AVX transitions (unless you're developing BLAS etc.).

elihu · on Aug 16, 2016

That's just because KNL is the only chip available right now with AVX-512; presumably this will be available on other CPUs in the future.

pavanky · on Aug 15, 2016

I don't have access to KNL so I can not comment on it. Any competition is good for the HPC space.

ak217 · on Aug 15, 2016

Pascal is supposed to have unified memory. I haven't used it yet, but their marketing says one address space.

g0xA52A2A · on Aug 16, 2016

I've not dug into this enough yet but all prior generations of Nvidia GPUs have had caveats when adding steps to unify main memory. It certainly isn't "one address space" as far as the OS is concerned, perhaps Nvidia is saying this is sufficiently well enough abstracted in CUDA 8.

The true land of milk and honey is prophesied in Volta when paired with Power 9 and CAPI, first appearing in the Summit and Sierra systems.

unsignedqword · on Aug 16, 2016

My impression that everyone was using AVX512 intrinsics on Xeon Phi, although I suppose that's not what actually happened...

stonogo · on Aug 15, 2016

The jury is still out. Larrabee is in the wild but it's a hell of a long way from 'not failed' -- every single deployment I'm aware of suffers from massive failure rates of the hardware and every single developer I've worked with much prefers to work with Tesla gear.

It's all well and good for them to aim for 'prestige buys' like Tianhe 2, but when the marketing smoke clears, the product is not achieving the things we in HPC buy products to achieve. Utilization rates are in the toilet across the ten or so Phi contracts I touch. I'm sure KNL's integrated architecture will change this, since everyone will just get this by default from vendors, but I don't know a single site who decided to buy Phi twice.

As far as I can tell, Phi is a play at a numbers game: you can ram up the Rmax of your system under LINPACK, regardless of whether it creates any actual value for your users. In other words, it's a cheap way to claw some false cred out of your top500 position, and not so much a useful HPC tool yet.

Time will tell, I guess.

sijoe · on Aug 16, 2016

I am setting a unit up now for a customer. In the years I've been working on them, the toolchain you need has not gotten much better ... no ... it went from complete crap to meh.

You really can't not use the intel compilers for them. Gcc won't optimize well for it at all. This means that in addition to the higher entry price for the hardware, you have a sometimes painfully incompatible compiler toolchain to buy as well, and then code for. Which means you have to adapt your code to the vagaries of this compiler relative to gcc. These adaptations are sometimes non-trivial.

I am not saying gcc is the be-all/end-all, but it is a useful starting point. One that the competitive technology works very well with.

From the system administration side, honestly, the way they've built it is a royal pain. Trying to turn this into something workable for my customer has taken many hours. This customer saw many cores as the shiny they wanted. And now my job is (after trying to convince them that there were other easier ways of accomplishing their real goals) to make this work in an acceptable manner.

The tool chain is different than the substrate system. You can't simply copy binaries over (and I know they have not yet internalized this). The debugging and running processes are different. The access to storage is different. The connection to networks is different.

What I am getting to is that there too many deltas over what a user would normally want.

It is not a bad technology. It is just not a product.

That is, it isn't finished. It's not unlike an Ikea furniture SKU. Lots of assembly required. And you may be surprised by the effort to get something meaningful out of it.

As someone else mentioned elsewhere in the responses, the price (and all the other hidden costs) are quite high relative to the competition ... and their toolchain stack is far simpler/more complete.

The hardware isn't a failure. The software toolchain is IMO.

mrb · on Aug 16, 2016

" Make the most powerful flops-per-watt machine. SUCCESS!"

Hum not at all... Larrabee gets destroyed by the competition in terms of flops-per-watt. Knights Corner is rated 8 GFLOPS/W at the device level (2440 GFLOPS at 300 W). For comparison Nvidia Tesla P100 rates 4× better: 32 GFLOPS/W (9519 GFLOPS at 300 W) and AMD Pro Duo rates ~6× better: 47 GFLOPS/W (16384 GFLOPS at 350 W although in the real-world it probably often hits thermal limits and throttles itself, so real-world perf is probably closer to 30-40 GFLOPS/W).

Also, if Intel wants to make Larrabee gain mindshare and marketshare, they need to sell sub-$400 Larrabee PCIe cards. Right now, everybody and their mother can buy a totally decent $200 AMD or Nvidia GPU and try their hands at GPGPU development. And if they need more performance, they can upgrade to a multiple high-end cards, and their code almost runs as is. But because Larrabee's cheapest model (3120A/P) starts at $1695 (http://www.intc.com/priceList.cfm), it completely prices it out of many potential customers (think students learning GPGPU, etc).

nkurz · on Aug 16, 2016

Larrabee gets destroyed by the competition in terms of flops-per-watt

While I expect the graphics cards to have the edge in performance per watt for brute-force math, I'm surprised that the current GPU's would have that large of an advantage. Are you sure you are comparing the same size "flops" for each?

And while it's in keeping with the article, I don't think lumping all the generations together as "Larrabee" makes sense when comparing system performance. While availability is still minimal, Knights Landing is definitely the current generation (developer machines are shipping), and as you'd expect from a long-awaited update at a smaller process size, efficiency is a lot better than older generations.

Here's performance data for different generations of Phi: http://www.nextplatform.com/wp-content/uploads/2016/06/intel...

And here's for different generations of Tesla: http://www.nextplatform.com/wp-content/uploads/2016/03/nvidi...

From these, the double-precision figures appear to be:

2016 KNL 7290: 3.46 DP TFLOPS using 245W == 14 DP GFLOPS/W.

2016 Tesla P100: 5.30 DP TFLOP using 300W == 18 DP GFLOPS/W.

I don't know if these numbers are particularly accurate (I think actual KNL numbers for the developer machines are still under NDA), but I think the real world performance gap will be something more like this than one approach "destroying" the other. If you are making full use of every cycle, the GPU's will win by a bit. If your problem requires significant branching, Phi won't be hurt quite as badly.

Also, if Intel wants to make Larrabee gain mindshare and marketshare, they need to sell sub-$400 Larrabee PCIe cards.

The interesting move Intel is making is to concentrate initially on KNL as heart of the machine rather than as an add-in card. This gets you direct access to 384GB of DDR4 RAM, which opens up some problems that current graphics cards are not well suited for. I think this plays better to their strengths: if your problem is embarrassingly parallel with moderate memory requirements, modern graphics cards are a better fit. But if you need more RAM, or your parallelism requires medium-strength independent cores, Phi might be a better choice.

But because Larrabee's cheapest model (3120A/P) starts at $1695

While it's true that Intel's list pricing shows they aren't targeting home use at this point, List Price may not be the best comparison. For a while last year, 31S1P's were available at blow out prices that beat all the graphic cards. In packs of 10, they were available at $125 each: http://www.colfax-intl.com/nd/xeonphi/31s1p-promo.aspx.

It's true that even at that price they were not a great fit for most home users: they require a server motherboard with 64-bit Base Address Register support (https://www.pugetsystems.com/labs/hpc/Will-your-motherboard-...) and they require external cooling. But if you were willing to be creative (https://www.makexyz.com/3d-models/order/b3015267f647354f96ff...) and your goal was putting together an inexpensive many-threaded screaming number cruncher, the price was hard to beat.

vardump · on Aug 17, 2016

Would love to kick tires of one of those Knight's Landing cards.

It might be possible to pull off similar tricks as on normal Intel Xeons. I wonder what kind of memory controller it got. Does it for example have hardware swizzling and texture samplers accessible?

But we use exclusively Nvidia for our number crunching product. And since the relevant code is implemented in CUDA, we have an Nvidia platform lock-in. Other options are not even on the table, and will not be.

Intel would be very smart to offer a migration path from CUDA code.

mrb · on Aug 16, 2016

You are right I shouldn't lump all generations together as "Larrabee". I didn't realize the definitive specs of Knights Landing came out (2 months ago?). KL does improve a lot over KC, and now almost matches current GPUs:

KC: 8 GFLOPS/W at the device level (Xeon Phi 7120: 2440 GFLOPS at 300 W).

KL: 28 GFLOPS/W at the device level (Xeon Phi 7290: 6912 GFLOPS at 245 W).

Compare this to 32-47 GFLOPS/W at the device level for current-generation AMD and Nvidia GPUs.

(All my numbers are SP GFLOPS.)

rbanffy · on Aug 16, 2016

> The interesting move Intel is making is to concentrate initially on KNL as heart of the machine rather than as an add-in card.

It'd be awesome if Intel made such a machine available the same way they make NUCs. Everyone vaguely involved with HPC should have one to play with.

jcbeard · on Aug 15, 2016

This article is a reminder to me that people are bloody critical of things that they don't really understand. Many products are killed, many research ideas never take off even if they're proven to work well. Ideas/products are often killed, not because they weren't great, but for a litany of other reasons having nothing to do with the awesomeness or usefulness of the idea. Products are killed b/c of politics, lack of a massive market, and most importantly poor execution. Even when things fail (research, products, otherwise), they often have a huge impact on the ecosystem. The product might fail (in this case, it really didn't), but the ideas will live on.

nwhitehead · on Aug 15, 2016

This is fun to read, I'm always amazed as the disconnect between public presentations and the engineering reality of what's really going on.

For me, "Larrabee" died during it's first public demo in IDF 2009 [1]. At the time I was working on CUDA libraries at NVIDIA. I remember everyone watching the stream to see how much of a threat it would be. When we saw the actual graphics, everyone started laughing. "Welcome to the 1990s!" At that point it was obvious that Larrabee would not be a graphics threat to NVIDIA, it just had too far to go. It was not a discrete GPU killer.

[1]: https://www.youtube.com/watch?v=G-FKBMct21g

0x0 · on Aug 15, 2016

A graphics card where the graphics core is actually an x86 application running on FreeBSD running on the "GPU"? That you could even log in to? That sounds out-of-this-world-amazing. What a shame it wasn't released like that.

SXX · on Aug 15, 2016

Actually Xeon Phi is almost the same thing except it's not sold as GPU. You can ssh into it or run software rasterizer on it's cores and only thing missing would be host graphics driver.

phire · on Aug 15, 2016

It's also missing the texture units.

What I hear from my friend who works in the GPU industry, one of the main outcomes from the Larrabee research project was: It is possible to do most of graphical operations on a SIMDed general purpose CPU, with good performance. Everything except texture decoding, which really needs dedicated texture decoding silicon. And with texture units taking up 10% of the silicon, Intel really needs to decide if they want to sell it as a GPU or as a general purpose compute unit with 10% more cores. Intel chose the latter, and you can't really blame them, as you can sell dedicated compute units for more money.

Sony ran into the same problem with the PS3 and the cell. They originally designed it so the game developers could implement whatever rendering method they wanted, in software, on the SPUs. But the performance wasn't high enough, partly due to texture decoding taking up too much time. By the time it was discovered this was a problem, the cell was more or less finalises. Sony were considering adding a second cell to the console, to brute force through the problem, but eventually they asked Nvidia to hack together a traditional GPU.

cma · on Aug 16, 2016

> Intel really needs to decide if they want to sell it as a GPU or as a general purpose compute unit with 10% more cores. Intel chose the latter, and you can't really blame them, as you can sell dedicated compute units for more money.

From the article:

> Remember - KNC is literally the same chip as LRB2. It has texture samplers and a video out port sitting on the die. They don't test them or turn them on or expose them to software, but they're still there - it's still a graphics-capable part.

So the core space is still used right? They didn't choose 10% more cores, they just chose to turn it off an not test it, but it still uses the die space.

phire · on Aug 16, 2016

On knights corner, yes. The silicon design was finished by the time this decision was made.

It was a long term advantage, they removed the texture samplers for the next version, knights landing, which allowed them to fit more cores on that chip.

pjc50 · on Aug 16, 2016

What does "texture decoding" actually consist of here - mipmap+interpolation?

phire · on Aug 17, 2016

Nothing terribly complex.

For the typical case of 2D uncompressed textures with some kind of trilinear/anisotropic filtering enabled, you are basically just calculating 8 addresses based on texture coordinates and clamping and mipmap levels, doing a gathered load (and CPUs hate doing gathered loads). Remember to optimise for the case that each group of 4 addresses are typically, but not always right next to each other in memory.

Then you use trilinear interpolation to filter your 8 raw texels into a single filtered texel. With the exception of the gathered load, none of these operations is actually that expensive on CPUs, but shaders do a lot of these texture reads and it's really cheap to and faster to implement it all in dedicated hardware.

You can also put these texture decoders closer to the memory, so the full 8 texels don't need to travel all the way to the shader core, just the final filtered texel. And since each texture decoder serves many threads of execution, you have chances to share resources between similar/identical texture sample operations.

And while the texture sampler is doing it's thing, the shader core is free to do other operations (typical another thread of execution). It's not that the CPU can't decode textures, its just that CPU cores without dedicated texture decoding hardware can't compete with the hardware which has the dedicated texture decoding hardware.

Const-me · on Aug 16, 2016

Also borders/mirroring, anisotropic filtering, decoding compressed formats like DXT# and BC#. To complicate things, there're also 1D and 3D textures, cubemaps, and texture arrays.

SXX · on Aug 15, 2016

And most games on PS3 end up using SPU very little or completely leaving it for middleware used since it's was too complicated to work with for average programmer. Though there was really nice DICE presentation on their deferred shading using PS3 SPUs: http://www.dice.se/wp-content/uploads/2014/12/Christina_Coff...

phire · on Aug 16, 2016

The SPUs were basically the next evolution if the ps2's two vector units, which were often running middleware too. It was really hard to write code that ran fast (Sony were advocate of writing assembly in an excel spreadsheet), but generally the vector units were doing about the same thing for every game/developer (bruteforce stuff: transforming vertices, calculating vertex lighting, generating multi-pass command streams for the rasterizer). So most developers were just using Sony provides examples, or improved versions from middleware developers. Very few developers wrote their own VU programs, or even needed to.

The cell now has 7 vector units, with comparatively more memory, but there was no default job for them, all the vertex transformation now ran on the GPU's vertex shaders. And Sony initially stuck to their guns of "SPU programs should be written in assembly, in a spreadsheet"

Because the single PPU really sucked and was nowhere near fast enough for anything, Sony eventually relented and releases a version of GCC which would compile c++ code to the SPUs. Fast to develop, but nowhere near the performance of an excel spreadsheet designed SPU program.

This resulted in a whole bunch of games running code on the SPUs that was really badly optimised. But at least it reduced load on the PPU.

clevernickname · on Aug 16, 2016

>Sony were advocate of writing assembly in an excel spreadsheet

What was supposed to be the advantage of this over a traditional plain text assembly language? Using Excel macros to generate code?

phire · on Aug 16, 2016

I'm not actually sure how they had their spreadsheets set up, probably had a bunch of conditional formatting setup to highlight pipeline hazards, along with formulas to show total and wasted cycles counts.

Both Architectures had exposed pipelines, meaning the result of an operation would take a few cycles to show up in the destination register and some operations would take longer than others. You might have to insert a bunch of NOPs to make sure the data would be ready for the next instruction that needed it. Both Architectures were also dual issue, meaning two completely independent operations, operating on completely independent registers would be manually packed into a single instruction by the programmer. There also would be restrictions on which types of instructions could go in each half of the instruction, if you didn't have an instruction, you have to put a NOP there.

I'm pretty sure Sony liked the spreadsheets because it forced the programmer to see where all the NOPs were. The programmer would be expected to refactor things and manually unroll loops until all the NOPs were filled with useful instructions and peak performance was reached.

wmf · on Aug 16, 2016

Maybe the spreadsheet was used to track the pipeline state which is necessary for good performance.

flamedoge · on Aug 16, 2016

And impossible to finish any game by the schedule

creshal · on Aug 16, 2016

Most PS3 games I've played had terrible performance problem, I guess now I know why.

creshal · on Aug 16, 2016

Why the downvotes? It's true. If you stray from AAA titles, you get plenty of PS3-only games that had terrible lags and struggled to even manage 30fps on reduced resolution. GUST titles like the Atelier series come to mind.

phatfish · on Aug 16, 2016

Apparently Naughty Dog used the SPUs heavily for calculating animations if I remember correctly from when I watched this:

https://www.youtube.com/watch?v=4ZFtP8LbUYc (An Uncharted Tech Retrospective)

pjmlp · on Aug 16, 2016

On the PS3 they got around the problem by working together with the likes of Gran Turismo and other top AAA studios, to create a GPU profiler and also introduced their Phyre Engine.

I remember by the time Gran Turismo was launched, there was an event showing the new tooling to developers.

0x0 · on Aug 15, 2016

But the fun part is when there is a windows/mac/linux graphics driver talking to it and supplying the latest DirectX and OpenGL APIs.

SXX · on Aug 15, 2016

VirtualGL is here and working so I don't think it's hard to do. Sadly hardware pricing and lack of interest leave very little chance that anyone would ever implement it.

cm3 · on Aug 15, 2016

Is it still running embedded FreeBSD?

zymhan · on Aug 15, 2016

I love how there are 3 different, mutually exclusive replies to your question so far.

colechristensen · on Aug 15, 2016

It's a "normal" socket processor which happens to include graphics-like cores on-die.

You could probably install FreeBSD on it.

SXX · on Aug 15, 2016

No. As far as I aware it's run Linux.

sseagull · on Aug 16, 2016

Not sure of the downvotes. If I am following the thread correctly, this is about KNC/KNL?

From a KNC card:

Linux <hostname>-mic0 2.6.38.8+mpss3.4.3 #1 SMP Mon Feb 16 16:08:55 PST 2015 k1om GNU/Linux

From KNL:

Linux <hostname> 3.10.0-229.20.1.el6.x86_64.knl2 #2 SMP Tue Dec 8 22:27:38 MST 2015 x86_64 x86_64 x86_64 GNU/Linux

SXX · on Aug 16, 2016

I was talking about KNC since it's only hardware I ever had access to since Intel sold them 90% off on Amazon.

g0xA52A2A · on Aug 15, 2016

KNC effectively runs BusyBox with tweaks.

zymhan · on Aug 15, 2016

Busybox is not an OS.

stonogo · on Aug 15, 2016

If Busybox is not an OS, then neither is GNU.

pavanky · on Aug 15, 2016

GNU is not an OS either. GNU/Hurd is. GNU/Linux is.

haldean · on Aug 16, 2016

That is factually correct.

zymhan · on Aug 16, 2016

I do not disagree.

I think you should do some reading on what an operating system is.

stonogo · on Aug 18, 2016

I think you should do some reading on my comment.

pjmlp · on Aug 16, 2016

All language runtimes are just like OSes when targeted to run bare metal.

However I don't know if that is the case here.

profeta · on Aug 16, 2016

why the heck even use openCL?!

leave a openCL layer there for the unwashed masses, but provide a decent API for anyone wanting exactly what the product is.

m_mueller · on Aug 16, 2016

Well, the problem is that Intel started with something way worse than OpenCL when they brought their first Phi to market: OpenMP. The claim was to just wrap your loops in OpenMP directives and everything will be fine. Yea well...

If they had went full on OpenCL, including a unified programming for vector and multicore parallelism (the way CUDA works), I think Phi might have taken off much more already.

loser777 · on Aug 15, 2016

I'm not sold on the " 1. Make the most powerful flops-per-watt machine." success.

From my knowledge that claim only makes sense for the ever-shrinking domain of applications that haven't been ported to GPGPUs with _simpler_ cores. I also don't understand how the best "flops-per-watt" can be extrapolated from the fact that the Phi is used in top ranked TOP500 machines.

The Phi really just seems like an intermediate between CPUs and GPGPUs in the ease-of-programming vs. perf/watt pareto curve.

wmf · on Aug 15, 2016

Here's the list: http://www.green500.org/lists/green201606&green500from=1&gre...

I see NVIDIA near the top at 4.7 GFLOPS/W and the old Xeon Phi at #44 at 2.4 GFLOPS/W which supports your argument.

g0xA52A2A · on Aug 15, 2016

Correct, GPUs dominate the flops-per-watt charts. IBM's Blue Gene systems were also pretty good in this regard but that product line is now a dead end.

captaincrowbar · on Aug 16, 2016

Back in the days of the Itanium debacle, Intel made the mistake of assuming everybody would be willing to rewrite their code to work efficiently on whatever hardware Intel wanted to give them. Instead, AMD won (briefly) by making a 64-bit chip that was much more x86-compatible.

Now, though, it sounds as if Intel may be making the opposite mistake. The Larrabee/KnightsThingummy family was built around x86 cores to make it easier to port existing code to it, but now it looks like this time people really are willing to rewrite their code to work on GPGPUs after all.

WhitneyLand · on Aug 16, 2016

Tom is conflating two premises - As engineers the team has done some nice work, but as a product Larrabee has failed.

- The perf/watt has not been competitive

- For the most part it has not been loved by its users

- The market share/penetration is small

- There is no evidence sales have been material to Intel; In fact it's not even clear it's been profitable

On the upside, though the product has failed to date, it's not dead. Maybe the 2016 release can turn things around.

Also it's nice that AVX512 came out of Larrabee, but the last line claiming credit to a single person seems dubious.

pjmlp · on Aug 15, 2016

Yet, from the point of view of someone seating on the GDC Europe rooms hearing how Larrabee was going to change the world of game development, back in 2009, I think we can state it did fail.

wolfgke · on Aug 15, 2016

> I think we can state it did fail.

I think we can state it did fail as a graphics processor.

tw04 · on Aug 15, 2016

I think the point is the author is being EXTREMELY disingenuous when he claims Intel didn't even want to make a GPU.

binarycrusader · on Aug 15, 2016

Read again, the claims were very specific in regards to a discrete GPU. Also, the author worked at Intel on Larrabee so unless you have proof otherwise I think it's best to give them the benefit of the doubt as to accuracy of the claim.

tw04 · on Aug 16, 2016

Did you not read the original post you responded to? They were ALL OVER this in the middle part of 2000 proclaiming larrabee was a discreet GPU.

http://www.anandtech.com/show/2580

http://www.anandtech.com/show/3738/intel-kills-larrabee-gpu-... http://www.cnet.com/news/intels-larrabee-more-and-less-than-...

http://www.pcper.com/reviews/Graphics-Cards/Larrabee-New-Ins...

binarycrusader · on Aug 16, 2016

Read the original post again. I don't care what the news sites said at the time, the guy who wrote the post literally worked on it.

Read the post carefully.

pjmlp · on Aug 16, 2016

Do you know who is Michael Abrash?

GDCE 2009, "Rasterization on Larrabee: A First Look at the Larrabee New Instructions (LRBni) in Action"

http://www.gdcvault.com/play/1402/Rasterization-on-Larrabee-...

There were another Larrabee talks, but only this one is online.

The marketing message was not only about graphics, but how the GPGPU features of Larrabee would revolutionize the way of writing games.

binarycrusader · on Aug 16, 2016

Yes, I know who Abrash is -- I have one of his most famous books on my shelf. I'll say it again, the author's claims were very specific:

in terms of engineering and focus, Larrabee was never primarily a graphics card

Note the author says primarily and is specifically talking about it in terms of engineering and focus.

pjmlp · on Aug 16, 2016

Programing games is much more than just graphics.

At GDCE it was being sold as a way of doing graphics, AI and vector optimizations of code in more developer friendly than GPGPU.

Of course, framing the graphics feature is a nice way of sidelining the issue that it also didn't delivered those other features to the games development community.

wolfgke · on Aug 16, 2016

> At GDCE it was being sold as a way of doing graphics, AI and vector optimizations of code in more developer friendly than GPGPU.

> Of course, framing the graphics feature is a nice way of sidelining the issue that it also didn't delivered those other features to the games development community.

There is clear evidence that the high-end cards that NVidia delivered at these time could outcompete (or keep pace) with Larrabee at that time. Thus when released Larrabee would not have been a strong contender to NVidia or AMD at that time. In this sense I stand by my position that Larrabee failed as GPU.

On the other hand I see no evidence that the rival products by AMD and NVidia could keep pace with Larrabee for AI and vector optimizations. Thus Larrabee was not a failure here. So Intel probably just concluded that in HPC there is much more money to be earned than for consumer devices and thus Larrabee was not released to consumers. And I can see good reasons: If game developers want to exploit the capabilities Larrabee has to offer, they have to depend on consumers having a Larrabee card in their computer. If Larrabee were an outstanding GPU the probability that some enthusiasts will get it was much higher than if Larrabee is just an add-on that some exotic applications/games additionally require.

kevin_thibedeau · on Aug 15, 2016

> This page requires JavaScript to function properly.

Hint: For a page with nothing but text content you're doing it wrong. Even more so since this is a static site and you have to go to extra effort to break normal degradation of HTML content.

angry_octet · on Aug 16, 2016

Once again someone down votes you because they disagree.

Personally I also browse with javascript disabled and thought the mandatory JS was silly. Hence I did view source, and lo and behold it is a tiddly wiki, with 2MB of html containing every article.

While github now has pages.github.com and the ability git pull/push your wiki (git clone git@github.com:username/project.wiki.git), I'm not sure how well that works for using on another web server or locally.

Saad_M · on Aug 15, 2016

Isn't this a case where Intel should not of talked about their internal R&D until they had a clear product & message to bring the market? I wonder if this lack of discipline cost Larrabee the correct narrative that it should have.

ihaveajob · on Aug 15, 2016

If I remember correctly, the idea at the time was to catch up to Nvidia/ATI (leapfrog them, really) in the discrete GPU market. Thus by announcing the technology long before it was ready, this would put a damper on their competitor's sales and lead. Kind of like an externalized Osborne effect.

https://en.wikipedia.org/wiki/Osborne_effect

ericjang · on Aug 15, 2016

But Intel management didn't want one, and still doesn't.

Why is this the case? It seems counter-intuitive given the recent Nervana acquisition (presumably to compete with NVIDIA)

SXX · on Aug 15, 2016

This might have something to do with their patent dispute with Nvidia that cost them $1.6 billion back in 2011. Probably Intel still want (able to) compete with Nvidia on compute, but not graphics market.

csours · on Aug 15, 2016

Do intel integrated graphics suck because of patents?

DigitalJack · on Aug 15, 2016

Mostly it's because of the limited power budget.

loosescrews · on Aug 17, 2016

The modern ones don't suck. The highest end Intel integrated graphics offerings complete pretty well with the AMD/Nvidia mobile offerings.

g0xA52A2A · on Aug 15, 2016

O.K. fucking major rant below. Disclaimer I worked for an vendor selling large quantities pretty much anything in the HPC space at the time KNC came to market and I now work for a different (larger) HPC vendor.

  When I say "Larrabee" I mean all of Knights, all of MIC, all of Xeon Phi, all of
  the "Isle" cards - they're all exactly the same chip and the same people and the
  same software effort. Marketing seemed to dream up a new codeword every week,
  but there was only ever three chips:
  
      Knights Ferry / Aubrey Isle / LRB1 - mostly a prototype, had some performance gotchas, but did work, and shipped to partners.
      Knights Corner / Xeon Phi / LRB2 - the thing we actually shipped in bulk.
      Knights Landing - the new version that is shipping any day now (mid 2016).

I can kind of see what the author is getting at here but it's worth reminding general readers KNL is a wildly different beast from KNF or KNC.

1. It's self hosting, it runs the OS itself/ there is no "host CPU".

- Yes KNC ran it's own OS but it was in the form of a PCI-E card which still had to be put into a server of some kind.

- Yes KNL is slated to have PCI=E card style variants but the self hosting variants are what will be appearing first and I honestly don't think the PCI-E variants will gain much if any market traction.

2. It features MCDRAM which provides a massive increase in total memory bandwidth within the chip. Arguably this is necessary with such large core counts.

3. Some KNL variants will have Intel's Omni-Path interconnect directly integrated which should further drive down latency in HPC clusters.

  Behind all that marketing, the design of Larrabee was of a CPU with a very wide
  SIMD unit, designed above all to be a real grown-up CPU - coherent caches,
  well-ordered memory rules, good memory protection, true multitasking, real
  threads, runs Linux/FreeBSD, etc.

Here's an interesting snippet, drop "Larrabee" from the above and replace it with "Xeon". I'm just wanting the point here that the Xeon and Xeon Phi product lines are only going to look more similar over time. For a while now the regular Xeons have been getting wider. Which is to say more cores and fatter vector units. The Xeon Phi line started out with more cores but is having to drive the performance of those cores up. Over time we're going to see one line get features before the other e.g. AVX512 is in KNL and will be in a "future Xeon" (Skylake). And it's widely rumored that "future Xeons" will support some form of non-volatile main memory, something not on any Xeon Phi product roadmap today. The main differentiator will be that Xeon products need to continue to cater to the mass market whereas Xeon Phi can do more radical things to drive compute intensive workloads (HPC).

  Larrabee, in the form of KNC, went on to become the fastest supercomputer in the
  world for a couple of years, and it's still making a ton of money for Intel in
  the HPC market that it was designed for, fighting very nicely against the GPUs
  and other custom architectures.

Yes at the time Tianhe 2 made it's debut it was largely powered by KNC and it placed at number 1 on the top 500.

Did KNC as a product make a lot of money for Intel? Probably not. Yes they shipped a lot of parts for Tianhe 2 but for other HPC customers, not so much. At the time putting a KNC card against a Sandy Bridge CPU it wasn't leaps and bounds faster. Yes you're always going to have to re-factor when making such a change in architecture but the gains just weren't worth it in most cases. Not to say there haven't been useful deployments of KNC that have contributed to "real science", but it was not a wild success by any measure.

As for "fighting very nicely against the GPUs and other custom architectures", I don't think so. When KNC got to the market Nvidia GPUs already had a pretty well established base in the HPC communities. Many scientific domains already had GPU optimized libraries users could pick off the self and use. Whilst KNC was cheaper than the Kepler GPUs on the market at the time it wasn't much cheaper. Again back to regular Xeons, for a cost/ performance benefit KNC wasn't worth it for a lot of people.

  Its successor, KNL, is just being released right now (mid 2016) and should do
  very nicely in that space too.

I do agree with this, KNL is going to do very well. There are whole systems being built with KNL rather than large clusters with some KNC or some GPUs.

  1. Make the most powerful flops-per-watt machine.

Not sure what is mean here, most efficient machine in the top 10? Tianhe 2 is not an efficient machine by any stretch of the imagination. For reference here's the top 10 of the Green 500 for when Tianhe 2 placed as number 1 on the Top 500.

http://www.green500.org/lists/green201306

  SUCCESS! Fastest supercomputer in the world, and powers a whole bunch of the
  others in the top 10. Big win, covered a very vulnerable market for Intel, made
  a lot of money and good press.

Yes KNC provides the bulk of the computing power for Tianhe 2 but at the time of writing of the above is number 2, not number 1. Secondly that is the ONLY machine in the top 10 that uses KNC AT ALL.

chx · on Aug 16, 2016

Just yesterday ASRock announced a Xeon Phi x200 (Knights Landing) system. This is now using a main CPU socket instead of a PCIe extension card and it seems to support only a single LGA-3647 socket (that's a LOT of tiny legs). So ASRock took four of these, each taking half a U (Supermicro had similar formats with their Twins) and packed four into 2U.

nkurz · on Aug 16, 2016

Thanks for pointing this out. It's an interesting design: http://www.anandtech.com/show/10553/asrock-rack-launches-2u4...

rbanffy · on Aug 16, 2016

Thinking of the Phi as a special-purpose part misses an important point: CPUs will not get much faster. Vector units will get wider and core count will increase. Programming for Xeon Phi gives you a view of what a CPU will be a couple years from now. Right now my laptop has 4 cores running 8 threads and the only reason it has exquisitely fast cores that dedicate large areas to extract performance from every thread (rather than investing all those transistors on something else) is because our software is built for ridiculously fast versions of computers from the 90s.

SXX · on Aug 15, 2016

I suppose a lot of ideas behind graphics on Larrabee end up in OpenSWR project. So anyone interested in scientific visualization might check it.

rubber_duck · on Aug 15, 2016

I think Larrabee was interesting because it had the potential to go the other way - it could emulate existing rasterization APIs decently but would allow for new realtime rendering approaches - like raytracing in to some sparse voxel tree structure for non-animated geometry - maybe do hybrid rendering where you would do deferred shading/rasterization for dynamic scene and then ray trace in to the tree for global illumination with static scene and similar stuff.

I'm sure emulating OpenGL on the CPU has it's uses outside of being a reference for drivers and such - but it's not really that exciting, GPUs are already excellent at doing that kind of rendering.

SXX · on Aug 15, 2016

Actually it's was not only possible to use it for ray tracing, but Intel had it working back in 2008. They also sponsored Uber Entertainment implement it into their game and that was ready before Intel cancelled it: http://www.polygon.com/features/2013/3/19/4094472/uber-hail-...

elihu · on Aug 16, 2016

I also wish that had taken off; 3d-graphics via a software ray tracing engine would in a sense be a return to the way things worked in the 90's before hardware rendering, where every game had its own custom engine and developers were always trying crazy new things. In some sense, GPU hardware has gotten flexible enough that we're sort of back in that world again, but writing GPU code isn't quite the same as writing a renderer from scratch.

When it comes to ray-tracing in particular, as far as I know there really doesn't exist (except perhaps as research prototypes) any hardware platform that's great at ray tracing. CPUs are okay but they need more parallelism. GPUs are okay but (as far as I understand it) they aren't good with lots of branches and erratic memory accesses. It seems like KNC/KNL ought to be ideal for this kind of thing (and in fact, it appears a fair bit of effort has gone into optimizing Embree for Xeon Phi).

pjmlp · on Aug 16, 2016

You can still do that kind of stuff, but on the GPU with geometry/compute shaders.

Now with Metal, DX12 and Vulkan GPU languages, maybe that is ever more approachable, specially in cards like Pascal.

thechao · on Aug 16, 2016

I joined up in 2009 (really too late), and somehow ended up with the "non polygon" rasterization path (Mike Abrash split rasterization: he had the polygon path). After the second time LRB was canceled, my team was shuttled off elsewhere.

I had a number of long conversations with D about soft texturing; I didn't understand Tom Forsyth's arguments about why soft texturing wouldn't work. (My back-of-the-envelope numbers were encouraging.) So, in winter of 2011 (12?) I decided to write my own soft texturing unit. What I found was that soft texturing is completely do-able---just not in any sort of reasonable power budget.

At the same time a friend, S, decided he wanted to implement a rasterizer (I'd gotten my fill). Between the two of us, we had a rockin' still-frame bilinearly sampled Stanford bunny. A third friend, W, decided to implement a threading model. Threading models for very-high count processors are enormously difficult. I ended up writing much of an OpenGL driver, a shader compiler (through LLVM), and part of a pipeline JIT, i.e., a vertex loader, fixed-function-to-programmable compiler, etc. That project was called SWR; the sanitized projection is OpenSWR.

(Looking at the current sources, I'd say that I probably no longer have any code in there.)

jeffbush · on Aug 16, 2016

"decided he wanted to implement a rasterizer (I'd gotten my fill)"

I see what you did there. :)

unsignedqword · on Aug 15, 2016

Larabee was a glimpse into the future. 10, 15 years from now, are we really going to be discretizing our computers into seperate CPUs and GPUs?

pjmlp · on Aug 16, 2016

Sure we are, that is the path of the last say 5 years, even on mobiles.

rhaps0dy · on Aug 16, 2016

Only slightly related, but: why do all Intel consumer-oriented processors now feature an integrated GPU?

guardian5x · on Aug 16, 2016

A simple answer would be to raise market share in the gpu market. Also a lot of non-gaming software uses the gpu for various tasks, even office and photoshop nowadays. But they do not need a full-fledged gpu.

castratikron · on Aug 16, 2016

My comp sci professor has a good rule of thumb: "If I can't buy it on Newegg, it doesn't exist."

Searched for 'Xeon Phi' on Newegg just now and only desktop CPUs show up. 'GTX 1080' brings up 59 results.

zump · on Aug 16, 2016

Ugh. This guy has been spruiking Intel for atleast a five years.

crb002 · on Aug 15, 2016

Govt HPC buys.