Facebook is the first to jump into ARM servers

whakojacko · on Aug 23, 2010

I doubt these are for running php or other services-more likely acting as a key-value stores (which is what they already used memcached for). See http://www.cs.cmu.edu/~fawnproj/ for a cmu research project that did this and produced great perf/watt numbers.

Andys · on Aug 24, 2010

Hardware key/value store appliances is a great startup idea.

strlen · on Aug 24, 2010

Disclaimer: I work on a key/value store, so take my opinion with a grain of salt even though I'll stay clear from self-promotion, yadda yadda.

Without distribution a key/value store is a hash table. To have any value added it's going to need to be a distributed key/value store. Distributed key/value stores, however, run just as well on commodity hardware. Problem is very few companies work at a scale at which per machine efficiency, power or other cost savings are going to matter; that's a very small market and one that's difficult to sell to.

On the other hand, a customized Linux distribution with a nice UI for deploying key/value stores would be a good idea.

Andys · on Aug 24, 2010

My back-of-the-envelope calculations showed that a hardware-based key value store (custom chips driving DRAM plus ethernet interfaces in a 1RU chassis), would use so much less power than an average x86 server that you could charge a great price thanks to the TCO / power savings.

ovi256 · on Aug 24, 2010

Can you check that you're not becoming obsolete due to Moore's law before savings recoup costs ? Also, consider the potentially reduced product flexibility of hardware vs software.

stuntprogrammer · on Aug 24, 2010

Well lets say you are operating at small scale and you can reduce your 32 servers sucking 20 kilowatts to 4 appliances sucking 5kW. Much reduced footprint, power cost, admin cost (at small scale your per server cost is going to be higher than the large scale guys). Plus someone else will have done the integration and tuning work.

I think even on a small scale, or perhaps that should be especially on a small scale, people could profitably look at the mysql/nosql/memcached (the latter optionally with persistence via flash) from some of the appliance vendors.

I have no financial interest or direct experience, but the data sheet on this:

http://www.schoonerinfotech.com/datasheets/Schooner_DS_Memca...

looks worthy of a closer read.

strlen · on Aug 24, 2010

The thing is that there aren't very many companies running 32-server (per colo) storage clusters (unless they're using grossly underpowered machines).

stuntprogrammer · on Aug 24, 2010

Granted, but it can make sense at even smaller scale. Say 8 servers. Or, I know one group of folks that were tempted by a beefier database machine and instead happily moved to such an appliance. There's a lot of ways to spend dollars that makes sense depending on what you are trying to do. (Trading dev cycles off against hardware etc).

Personally, I don't use the appliances because for our extremely latency-sensitive apps we do it all from scratch for various reasons. For large scale storage we build off commodity machines. One of our prototypes runs very happily off machines from scalableinformatics.com - highly recommended.

Horses for courses and all that..

coderdude · on Aug 24, 2010

Just an FYI: Facebook uses HipHop to convert PHP to C++ in order to lighten the workload per request on their webservers. In a blog post debuting the project a developer noted that 90% of the site was using HipHop-generated C++ as the backend, and that number has likely increased.

drv · on Aug 24, 2010

Most recent comment on the story, (purporting to be) from Jonathan Heiliger:

This story is completely false. Facebook continuously evaluates and helps develop new technologies we believe will improve the performance, efficiency or reliability of our infrastructure. However, we have no plans to deploy ARM servers in our Prineville, Oregon data center.

xtacy · on Aug 24, 2010

"One size does not fit all." Eventually, we're going to see a heterogenous data centre consisting of machines with different specs, that are suited for different workloads. (minor edit.)

whatusername · on Aug 24, 2010

Hint: The rest of the world already works like that. See: zOS on Power, Solaris on Sparc, HP/Tandem Nonstops (I think are now on x86), IBM i on whatever it runs on, HPUX on x86 (I think), AIX on Power, Linux on Mainframe, Windows NT4, XP, 2000, 2003, 2008 (all running critical tasks)...

Cheap Linux/x86 servers own most of the web space. But the rest of the world runs much more heterogeneously. (And that list above ignores real legacy stuff.. You can buy all of that currently (with the exception of the older windows stuff))

_delirium · on Aug 24, 2010

I don't think it's just the web space. Maybe academia's different, but over the past 10 years, a lot of infrastructure we used to use Solaris/Sparc for (and miscellaneous other more exotic things) has all been transitioned to a mix of Linux/x86 and Windows Server/x86--- email servers, compute clusters, payroll, course registration, pretty much everything.

whatusername · on Aug 24, 2010

Oh I agree - the general trend is in the direction of win/linux on x86. But it feels like the web world has gone almost exclusively x86. Maybe it's just the web got their quicker -- but acting like linux on ARM is revolutionary is perhaps overdone. (Especially considering Linux runs on x86, x86-64, IA64, PowerPC, Mainframe, Cell, and probably some other processors that are used in datacenters)

Seth_Kriticos · on Aug 24, 2010

The list of supported Linux architectures (form kernel.org):

Although originally developed first for 32-bit x86-based PCs (386 or higher), today Linux also runs on (at least) the Alpha AXP, Sun SPARC, Motorola 68000, PowerPC, ARM, Hitachi SuperH, IBM S/390, MIPS, HP PA-RISC, Intel IA-64, AMD x86-64, AXIS CRIS, Renesas M32R, Atmel AVR32, Renesas H8/300, NEC V850, Tensilica Xtensa, and Analog Devices Blackfin architectures; for many of these architectures in both 32- and 64-bit variants.

We could also say: pretty much everything under the sun.

Seth_Kriticos · on Aug 24, 2010

Cheap Linux/x86 servers own most of the web space.

oh, and supercomputers: http://www.top500.org/stats/list/35/osfam

akadruid · on Aug 24, 2010

We're running HP-UX on PA-RISC and AIX on RS64. Don't know whether you can buy these architectures still, but these are real production machines in a datacentre.

barrkel · on Aug 24, 2010

The trend has rather been the reverse, with specialized platforms being overtaken by commodity platforms which are better positioned to take advantage of economies of scale. Mostly that's been x86, but for low power applications it's largely been ARM. As power becomes a more important element, and with ARM relatively more performant, I think we'll see an increasing share of it in what would otherwise be x86 territory, but I don't think we'll see real heterogeneity, unless it's a better competitor to either of these.

I see an analogy between x86 and ARM and Porter's generic strategies. ARM is like the cost leadership, with x86 being the differentiation strategy - where the cost is the cost of energy. x86 is able to do a lot more things than ARM, but with higher energy costs. And then there is perhaps CUDA etc. for the segmentation strategy.

borisk · on Aug 24, 2010

I highly doubt it. Having multiple hardware architectures in a datacenter is very costly. One gets better prices buying more items from the same supplier. It's a lot easier to maintain replacement parts/units for 1 architecture. Developing, testing, fire fixing bugs on multiple architectures is slower.

mrtron · on Aug 24, 2010

Google would argue with you.

They roll a heterogenous server setup intentionally to minimize impact from hardware faults, bugs, etc. You build your software at a high level where the hardware under is abstracted away.

gaius · on Aug 24, 2010

Google's experience is only directly transferrable if you too are a search engine.

mkr-hn · on Aug 24, 2010

They do a lot more than search, and as far as I know App Engine runs on the same distributed system.

gaius · on Aug 24, 2010

OK, but is your workload like App Engine either? I can tell you right now that mine isn't; we'd grind to a halt if we tried to pass and share as much state as we do on it.

jsz0 · on Aug 24, 2010

Depends how good of a deal they can offer you. Saving 20% on a fleet of x86 servers is great but over the long term the supplier isn't going to be paying your electricity bills.

borisk · on Aug 24, 2010

Right, but say FB can move 100% to ARM. Unless there are big differences in power consumption per x64/ARM unit in Web, Cashe and DB servers.

tlack · on Aug 24, 2010

but when you have thousands of machines of each type, perhaps those economies of scale level out a bit

CountHackulus · on Aug 24, 2010

The heterogeneous nature of a data centre has really come to an interesting crossroads. If you pay attention to the mainframe world (and let's face it, nearly no one does) you might've noticed that IBM recently announced their new Z system [1] and it had an interesting feature.

Namely, it can run Z books along with POWER7 blades and x86 blades in the same (admittedly very large) box. Basically, run your compute workloads on Power, database on x86, and control on Z. Seems like a pretty good idea to me!

[1] http://www-03.ibm.com/systems/z/news/announcement/20100722_a...

patrickgzill · on Aug 24, 2010

My thought was that they want enough separate cores to handle a case where 1 core is handling 1 user's page load.

So for a midsized datacenter with say 200 racks, each ARM based rack could have 320 cores in it, giving 64K cores.

Since a FB page may take 200ms to load (or some other load time internal to FB response time targets), such a setup could handle 64K * 5 (i.e. 5 users per second at 200ms/page) or 320K simultaneous users per second; figuring 30 seconds of viewing time per user, this gives you almost 10 million users' worth of capacity.

(all numbers back of the envelope, hypothetical)

stuntprogrammer · on Aug 24, 2010

For a fleet of "scale-down" servers to be practical the first problem is that the admin cost per server must be very low. With, ohhh guesstimate 70,000+ servers, Facebook presumably has that metric under control.

Next, certain types of workloads don't make sense. Anything CPU-bound or poorly scalable (eg. traditional database workloads). Again FB should have plenty of work that scales out relatively effortlessly (note the relatively) and have moderate memory and bandwidth requirements per process. Though in the aggregate you'd expect very large usage of both!

For suitable workloads, working back of the envelope, ARM will hopefully lead to some highly competitive, if not new record, scores on metrics like requests/joule or requests/$. Enery consumption and server cost being important at that scale..

Now for the in-memory caching or database workloads, which want either more memory or faster CPUs, Flash can be used to address capacity at the cost of a couple of orders of magnitude extra latency - albeit a couple less than hitting disk. Back of the envelope, anything that looks too much like a traditional database workload I'd leave on grunty x86 machines. Ditto for any CPU-bound.

So, lets speculate on how to build a machine based on ARM for the types of workloads we care about. Lets assume we don't design a new core but work with a vendor on a System-on-Chip using ARM hard macros. This is very back of the envelope and we'd need to break out spreadsheets to get this nailed down right.. but lets have some fun..

Our hypothetical SoC would be

. Cache-coherent quad-core Cortex A-9 @ 2GHz . PCIe interface on chip . 1Gb ethernet interface on chip . SATA ports on chip . Memory controller

connected to 4GB ECC memory per server. I'll get back to storage.

Now, this should be small and quite low power server. Within a 1U sled we should be able to pack say 6 or perhaps 8 of these, along with dual power supplies (for the entire sled of machines). If possible I'd have distributed power redundancy by including a battery in the sled rather than hooking up to external UPS.

I'd use an internal 1Gb switch which itself is connected to the top of rack switch. We get local cheap communication between the servers, plus we make cabling significantly easier and keep the cost of the top-of-rack switch down. A more whacky alternative would be to use short-range radio. Fewer wires, potentially more bandwidth, but something I'd like to hammer on in the lab before going anywhere near the datacenter.

Now, for the pesky 4GB per server memory limit and storage. We have a few interesting options. We can add flash per server, or, given the $/GB perhaps one machine in the sled gets it and acts as a local memcached server with the ability to fall back to accessing remote ones. We could also have a local file server with, for example, 1TB of storage via 2 flash-augmented disks (eg. seagate momentus xt disks). With good staging of data, we could even make this sled a good building block for throughput-oriented data-intensive work (eg. mapreduce type work). We have lower IO bandwidth but we've also kept the CPU performance down to levels where we have a fighting chance of feeding them.

Obviously, you need your software to run there. Chalk one up for relatively easily ported open source code without being dependent on a slow-moving vendor.

The above is rampant speculation and there are many interesting design points - it's great to see someone trying new things and taking advantage of changing hardware ratios to profit.

nailer · on Aug 24, 2010

> With, ohhh guesstimate 70,000+ servers

That could be about right. I know the number was 20,000 at the end of 2008 from sources within Facebook.

Tamerlin · on Aug 24, 2010

I wonder how long it will be until someone designs a multi-core ARM processor with lockstep processing.

One application for this is in mission-critical transaction processing where computing power is not the active limitation. Instead, the idea is that each core executes the same set of instructions with the same set of input data, and in the end if one disagrees with the other, the entire CPU rolls back the current transaction, takes itself out of the mesh, and alerts an operator... who, if it's an IBM mainframe, walks over with a new processor, yanks the old one, and replaces it. Hot. Downtime: none. Transactions lost: none.

Sure, it would need more I/O capability + ECC, but still -- it's potentially a low-cost, low-power, highly reliable competitor to some of IBM's POWER and PowerPC processors.

stcredzero · on Aug 24, 2010

Perhaps predictions from the mid 90's will finally come true: RISC will win and x86 will be held back by its legacy baggage. This is more notable for how long the predictions failed to come true. (The turnabout would make it even more interesting.)

MichaelSalib · on Aug 24, 2010

Precisely what baggage is holding x86 back? The only thing that I know of is the variable length instruction encoding and these days, that seems helpful: it allows you to use smaller instructions which benefits caching. The extra cost of decoding instructions on silicon is negligible.

In general, I don't understand your comparison of RISC vs x86. The original premise of RISC, namely having very simple chips that can be clocked real fast because of their simplicity has long since been abandoned. Modern RISC chips have all the crazy complexity of x86 chips: you'll find big pipelines and dynamic register renaming and multiple execution units galore. And modern x86 chips internally look a lot like modern RISC chips as well: after they convert native instructions into micro-ops, there doesn't seem to be much difference.

stcredzero · on Aug 24, 2010

Precisely what baggage is holding x86 back?

A part of my point is that it hasn't been holding x86 back for a long time. A part of the article's point is that perhaps it will soon. Reread my comment with a skeptical tone.

MichaelSalib · on Aug 24, 2010

Ah, thanks for explaining.

zwieback · on Aug 24, 2010

Still have a working Netwinder in my drawer. Maybe this time around it'll work out for ARM, would be cool.

borisk · on Aug 24, 2010

Expect Intel 'Empire strikes back' in 2011 - they have the talent, fabs, money, patents.

jbarham · on Aug 24, 2010

All that and they still couldn't ship Larrabee. There's only so much you can do with a haphazardly designed ISA that dates from the late 70's. ARM is overwhelming dominant in the mobile market and very, very cheap.

To my mind Intel's quixotic purchase of McAfee only shows that they've cornered themselves into a market that's peaked.

hga · on Aug 24, 2010

They couldn't ship Larrabee to hit it's market window (i.e. to compete with the GPUs when it would/might have been ready). They haven't given up yet and in the meantime they're shipping it for HPC experimentation and the like.

You're right about the "haphazardly designed ISA" but they've still made it run very fast by throwing a lot of engineering effort at it.

As for McAfee, I don't know what they bleep they're trying to do there, but I suspect they can afford it (especially since AMD didn't keep their eye on the ball for so long). Their previous failed communications and media ventures didn't seem to materially hinder their bread and butter CPU/chipset business.

Andys · on Aug 24, 2010

For sure... Atom is one of Intel's most backward products, which has received the smallest investment out of nearly all of their CPU lines. They have the capability to scale it much much higher, if there's a market for it.

jonursenbach · on Aug 23, 2010

HipHop is going to get a lot less production-ready.

alecco · on Aug 23, 2010

Why? [I don't know much about it] Facebook's HipHop JIT transforms PHP to C++ and then they compile it with g++ (or at least that's what they say.) GCC supports ARM already.

viraptor · on Aug 24, 2010

For some reason HipHop messed with direct memory locations enough to work only on 64b to start with. If they couldn't make it work cleanly on 32b, ARM is probably also going to be affected.

Aegean · on Aug 24, 2010

Who says ARM is going to be 32 bits? Rumors say they're going for 64 bits especially for servers. They will need to surpass 4GB limit on >=8 cores per chip.

viraptor · on Aug 24, 2010

I didn't mean the 32/64 as the only difference. They targeted an architecture - it was i86_64/amd64. That affects memory layouts, pointer lengths, available instruction sets, etc. Changing from amd64 -> armel for example will also affect some of those properties.

If you write clean ANSI C code, you're basically compatible by default. If they failed to run on i386, that means they did something special that assumed a different architecture. In that case you're likely to hit the same problem whether they use arm, alpha, infineons, intels, or whatever else.

vetinari · on Aug 24, 2010

ARM is going for 40-bit extension, to allow addressing 1 TB of RAM.

Source: http://www.eetimes.com/electronics-news/4206387/ARM7-40bit-v...

kingkilr · on Aug 24, 2010

HipHop is not a JIT, it is an AOT compiler.

coderdude · on Aug 24, 2010

While we're correcting people, HipHop isn't an AOT compiler, it is a source code transformer[1].

[1] http://developers.facebook.com/blog/post/358

wlievens · on Aug 24, 2010

If your definition of Compiler is broad enough, there's no difference.

kingkilr · on Aug 25, 2010

That doesn't even require a broad definition. From Aho, Sethi, and Ullman: "a compiler is a program that reads a program written in one language - the source language - and translates it into an equivalent program in another language - the target language."

wlievens · on Aug 25, 2010

Except that if you interpret that literally, pretty much any program is a compiler. I can translate a text file in the "ASCII language" to the "UTF-8 language" with Notepad++, for instance.

kingkilr · on Aug 26, 2010

I don't remember my models of computation class so well, but wouldn't ASCII and UTF-8 be alphabets, not languages?