Meta reveals serverless platform processing trillions of function calls a day

type_Ben_struct · on Oct 23, 2023

As an aside, I’m always surprised at just how bad Meta’s software quality is, especially for a FAANG company with lots of expensive engineers.

Instagram and Facebook are some of the buggiest platforms I use. I often run into bugs in several key user journeys that go unresolved and it baffles me how they don’t pick up on them.

One of the problems is it’s basically impossible to get in touch with Meta to report them.

DoingIsLearning · on Oct 23, 2023

No positive incentives, nobody gets promoted doing bug fixes, everybody tries to focus on hot features.

No negative incentives, it's not safety-critical so nobody will recall or sue.

KingOfCoders · on Oct 23, 2023

If you think they are buggy, become a seller on Amazon. The buggiest software I have ever had. And they are influencers with the "two pizza rule". The second worst is Spotify. It still does not remember where you've stopped listening on radio plays after a decade. From the influencer people who brought you "tribes".

saiya-jin · on Oct 23, 2023

Strong agree here, at least on FB (don't use Instagram because why). I fully tolerated their growing pains back in the day when site went down for hours, or frequent blank pages after some standard click on something. The thing is, even when things improved over time its still buggy as hell.

I had so many problems with uploading photo albums for example, all the time for past 15 years. I used to put thousands of pictures from my full frame camera travel and mountain adventures on FB till I got fed up with wasting so much time redoing it all and battling with the site. They literally pushed away genuine content creator, at least in my social graph I was by far the most active one, and folks liked what I uploaded.

Sometimes it uploaded twice. Sometimes some photos refused to upload, while being same in every aspect as rest which worked OK. Wiping out descriptions I painstakingly put to every photo. Sometimes, even these days, whole feed is blank, just menus on the side. I deleted their mobile app since if was snooping for all the data it could get and draining battery while not used at all - this was actually a great move for my own personal happiness so not complaining about this one.

I'd say that FB is so successful despite their consistent lack of technical quality. They just nailed perfectly the hole in the market people didn't even know they wanted and their timing to market. This is not unique to FB - if there is little pressure from cca equal competition, every business I've seen is subpar, and/or overpriced.

Then you use google's products and its night and day. I don't recall a single bug that affected me, ever. Too bad Google+ never stood a chance.

benterix · on Oct 23, 2023

> I had so many problems with uploading photo albums for example

I suspect it is by design, in a way. FB is not a photo storage platform, they need to compress the images a lot. A single photo may increase interaction and hence metrics like MAU but a whole album is unlikely to do so.

kristiandupont · on Oct 23, 2023

I can't speak to what is going on but one thing I've heard that makes things difficult for at least the frontend teams is that they are battling ad blockers. In order to work around them, they create incredibly obfuscated HTML that no one would normally ever create.

cedws · on Oct 23, 2023

And this is where some of the greatest, most educated minds in computing are going. What a waste.

benterix · on Oct 23, 2023

So how is this working for them? Because thanks for uBlock Origin I haven't seen an ad on FB for years.

disgruntledphd2 · on Oct 23, 2023

Adblock plus is what they focused on, back in the day at least.

foolfoolz · on Oct 23, 2023

somehow every comment in here is downplaying this achievement as “low” or unimpressive

you truly underestimate the scale of an operation like this

the vast majority of software companies will never count a trillion of anything. even big companies that scale will only have a small subset of teams work on something this large

fxtentacle · on Oct 23, 2023

The article is too light on details to estimate if trillions is impressive or not. For example, if my single-server system easily handles 100 mio per day and the load is almost exclusively CPU bound (like with most AI tasks) then scaling to 1 trillion per day might be as easy as buying 10k servers, which is totally a thing that mid to large sized companies do to scale up.

The fact that makes this Meta paper impressive is NOT scaling up to 1 trillion per day, it's that they manage to do so while keeping request latency low and CPU utilization high. Anyone who's been with Heroku long enough probably remembers when suddenly instances would be 80% idle and still requests were slow. That was when Heroku changed their routing from intelligent to dumb. And Meta is doing the opposite here, reducing overall deployment costs by squeezing more requests out of each instance than what would have been possible with a simple random load balancer.

kristiandupont · on Oct 23, 2023

>then scaling to 1 trillion per day might be as easy as buying 10k servers [...]

..I doubt that. How would you distribute the requests between those? An instance of mod_proxy_balancer?

fxtentacle · on Oct 23, 2023

DNS round robin so that clients get randomly distributed among multiple load balancers

They have 12m RPS so about 10 HAProxy servers should do the trick.

DetroitThrow · on Oct 23, 2023

It's an interesting paper, but they've made some weird trade offs regarding latency for resource efficiency that make it seem niche especially for FaaS tech, and the TPS they're hitting is surprisingly low for something that is supposedly in widespread use at their company. Some of their suggestions at the end are also already features for FaaS products in some form or another too.

I think this paper is awesome and this platform is not a trivial piece of engineering to be clear, but it doesn't seem particularly novel or even reaching close to the larger workloads that public cloud services offer.

>the vast majority of software companies will never count a trillion of anything

As others have noted, it's not impossible that many of our own laptops have run a "trillion functions".. the devil is in the details here for systems researchers and engineers, and based on the details XFaaS isn't nearly as novel as say Presto was.

almost_usual · on Oct 23, 2023

This is HN, there are definitely users on this site who have experienced or have worked at places with these workloads.

darkwater · on Oct 23, 2023

And this is also the HN where people boosts they can rebuild on their own $SUCCESSFUL_SOFTWARE over a 3-days weekend.

There are tons of very brilliant and very smart people here, but there are also many that are too fond of themselves or have big issues understanding problems' ramifications in real life/real business.

kyleyeats · on Oct 23, 2023

How do you know the difference between stupidity and ambition unless you try?

darkwater · on Oct 23, 2023

Ambition is saying "I think I can bootstrap a successful competitor to $POPULAR_SOFTWARE if I work hard enough, I have the talent and perseverance".

Adding "in 3-days" is stupidity.

rokkitmensch · on Oct 23, 2023

The best inoculation against hubris is trying to fly to the sun.

Humans, to generalize, bias towards talking over walking.

hiddencost · on Oct 23, 2023

12 million QPS isn't nothing but it's pretty common at big companies.

Trow83949 · on Oct 23, 2023

So what? Meta is a trillion dollar company. It should be able to create website, that works.

Compare its budget to WhatsApp before it was acquired!

fnordsensei · on Oct 23, 2023

“Trillions of functions” is a metric that’s hard to know whether to be impressed by or not. I don’t think it’s impossible that my laptop runs “trillions of functions” every day.

But the callers on this platform are likely remote, and therefore it handles I/O as well, etc. Like I said, hard to understand whether it’s impressive or not.

I’ll assume it’s impressive.

hiddencost · on Oct 23, 2023

TBH anyone using "per day" I assume is dishonest.

QPS is the standard measure, and per day is just a sly way to multiply by 86400.

If you want an average measure then report the average and peak QPS ...

zmgsabst · on Oct 23, 2023

I divided through, assuming sustained rate:

1T/day = 11.575M/s

I personally find 11.5M/s a lot more impressive sounding. Though another comment suggests 100k servers — for about 100/s per server.

10ms per request isn’t particularly good or bad; volume is still impressive.

xxs · on Oct 23, 2023

servers have lots and lots of cores, you might as well divide by 64 or 128, which would set it to 1s. Overall 4req/s per core is nothing.

consp · on Oct 23, 2023

Servers have no parallelism?

eddtries · on Oct 23, 2023

Works out to around 115 function calls a second per server

kyeb · on Oct 23, 2023

The calls per server is probably not the difficult part - this is the type of scale where you start hitting much, much harder problems, e.g.:

- Load balancing across regions [0] without significant latency overhead

- Service-to-service mesh/discovery that scales with less than O(# of servers)

- Reliable processing (retries without causing retry storms, optimistic hedging)

- Coordinating error handling and observability

All without the engineers actually writing the functions needing to know anything about the system (which requires airtight reliability, abstractions, observability).

I don't mean to comment on whether this is impressive or not, just pointing out that per-server throughput would never be the difficult part of reaching this scale.

[0] And apparently for this system, load balancing across time, which is at least a mildly interesting way of thinking about it

booi · on Oct 23, 2023

That doesn’t sound right… maybe per user?

eddtries · on Oct 23, 2023

1 Trillion function calls over 100,000 servers. Technically they say trillions over hundreds of thousands, but I went for the lower case.

1,000,000,000,000 / 100,000 = 10,000,000

10,000,000 / 24 / 3600 = 115.7

xxs · on Oct 23, 2023

We don't know what the servers are; dual socket EPIC, i.e. 128cores, makes the number look beyond trivial.

kunley · on Oct 23, 2023

So... 115.7 rps doesn't sound groundbreaking, right?

eddtries · on Oct 23, 2023

Was just trying to break the number down to something more easy to understand, I don’t know enough if this is impressive or not! Depends on the complexity of the request, and I guess the complexity of routing that many requests over such a large network. I’ve never worked at that scale.

nick0garvey · on Oct 23, 2023

This isn't unreasonable. ML workloads benefit from more computational time per request. Lower QPS = better results.

almost_usual · on Oct 23, 2023

> One example of load they demonstrate has 20 million function calls submitted to XFaaS within 15 minutes.

> Meta’s XFaaS is their serverless platform that “processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions.”

This seems very low per server?

fhd2 · on Oct 23, 2023

Just for one trillion (not "trillions"), that'd be 10 million function calls per server per day, ~7k per minute. Sounds about right to me. They'll surely want to leave some room for traffic spikes, server failures and unforeseen issues. Server uplink could also be a factor - at least that was the major bottleneck the last time I ran infrastructure serving lots of clients (about 100 million), but they are probably smarter about this than I was back then.

hexo · on Oct 23, 2023

The only impressive about this thing is technology abuse to do such a slow job on such a large scale. I recall ruby on rails doing at most 150 requests per second on a laptop 15 years ago, which was laughably low even for those days. I could do hundreds times more using c++, or at least tens times more using unoptimized python.

Their achievement should not be praised, on contrary they need to be told they're wasting resources, heating up planet and emit carbon dioxide for no good reason. This is not how optimization looks like. And no, it is not serverless, since obviously something serves that stuff, stop lying already.

klodolph · on Oct 23, 2023

Trillion per day is 10M per second. That’s hard.

“Serverless” just means you’re not running a long-running process that handles these specific requests.

hexo · on Oct 23, 2023

On multiple 100k servers? Not as much as you think.

Serverless means you have no server. Stop repeating amazons lies. It is nonsense.

klodolph · on Oct 23, 2023

“Server” is a long-running process that serves requests. “Serverless” means that your requests are handled by short-lived processes of some kind.

This is just the name for it. If you want to poke holes in the definition I gave, it won’t do any good, because definitions are kinda fuzzy and incomplete to begin with. If you try to argue about what “serverless” means, you’re just going to end up drifting away from the consensus about what words mean. Consensus is the only thing that matters here. Consensus generally won’t be universal—I know people that refuse to accept “literally” as an intensifier—but that is a problem we have to deal with.

kunley · on Oct 23, 2023

But he has a point.

There is definitely a layer of long-running processes queuing and distributing these requests. In this sense, "serverless" is a buzzword.

klodolph · on Oct 23, 2023

Likewise, every pair of “wireless headphones” have many sets of wires inside them, if you cut them open.

“Vegan” products you buy at the store are made by humans, who are a type of animal, so you cannot say that vegan products contain no animal products. Plastic is an animal product because humans make it, and humans are animals.

If we go down this road, the only conclusions we get are absurd and useless ones. Language is a consensus process dominated by metaphor and shared context. It is not a mathematical process.

lostmsu · on Oct 23, 2023

Your comment actually made me wonder if there are people who don't eat honey because it is made by bees.

hnburnsy · on Oct 23, 2023

>“Server” is a long-running process that serves requests. “Serverless” means that your requests are handled by short-lived processes of some kind.

Stateless?

angrais · on Oct 23, 2023

Why are you defining server in that way? Do you have a reference?

A server is a physical machine with an operating system. It can run many processes.

klodolph · on Oct 23, 2023

> A server is a physical machine with an operating system. It can run many processes.

That is an acceptable and different definition of “server”. Words have multiple definitions, that’s why any time you look something up in the dictionary, you are likely to see multiple entries for the same word.

I generally don’t use the word “server” that way. I say “machine” when I am talking about machines, and “server” when I am talking about daemon processes. A great many people I have worked with in the past ten years or so, in multiple parts of the country, have adopted similar terminology to lessen confusion.

If you accept that words have multiple definitions, and accept that “server” has multiple definitions, then it is easy to accept that “serverless” means something like “the response is not ultimately processed by a daemon process”. Just like we accept that wireless headphones are called “wireless headphones”, even though they have wires inside. They are only missing one particular set of wires. Like, “the audio is not transmitted to the headphones over a wire” wireless, “the request is ultimately processed by something other than a long-running daemon” serverless.

I find these definitions easy to understand, natural to use, and I find that they don’t confuse listeners when I use them.

I avoid calling the machine a “server” because it is sometimes confusing, but I’m not going to argue with someone who uses it that way and tell them that they are wrong. That would be horrible.

angrais · on Oct 24, 2023

My point was that I have never heard anyone describe a server or serverless in that way: as simply a process. I agree that thinking about server as a process/Daemon is a great way to frame serverless.

I've also not considered that wireless headphones have wires so it's been fun to think about that too.

CommanderData · on Oct 23, 2023

What sort of isolation did you achieve with your Ruby experiments?

choeger · on Oct 23, 2023

I assume these are mostly procedures, not functions, (that is, they have side effects) as applying a function 1.3M times per minute immediately raises the question of caching which doesn't seem to be important for them.

But if you execute that many procedure calls, how do you guard against them influencing each other due to their side effects? Memory leaks come to mind, or other weird bugs. Also, how do you manage the credentials to effectively issue side effects in a (hopefully) zero-trust environment?

sspiff · on Oct 23, 2023

Off topic: I ended up not finishing the article, because first it presented me with a full screen overlay asking me to sign up to their news letter, forcing me to click on "I want to read it first" to see the article.

Then after scrolling down two paragraphs, it presented me with another overlay blocking the article, again asking me to sign up for the newsletter.

Can we please stop rewarding these kinds of practices with traffic by posting and upvoting them?

Nextgrid · on Oct 23, 2023

A solution to these would be to build a browser extension that automatically submits any such forms with bogus email addresses.

It can't easily be blocked by a captcha as that would hurt conversion rates of legitimate users.

sspiff · on Oct 23, 2023

I wonder how much actual conversion happens in these kind of forms, anyway. Who are the people that sign up to newsletters of a random website before even reading anything? And who signs up after reading just an intro?

I get the idea of the mailing lists - generate traffic that does not rely on external platforms, with their ever changing algorithms that need to be appeased and courted, and who might even change the rules or ban you from promoting your content.

But sure the conversion rates for these full screen modal overlays must be lower than the attrition you get from scaring away passer by readers?

samsquire · on Oct 23, 2023

Thank you for this.

I am a beginner hobbyist in server tech.

I am running into the perennial latency Vs throughout problem.

A batch of queue can hide latency problems. If you process 1 million per second (1 batch or 1 million separate requests) with 100 nanosecond latency or 10 microseconds per batch.

In XFaas do they run an interpreter in a thread ? Or a warm process?

charcircuit · on Oct 23, 2023

What's the difference between a PHP function and regular PHP?

friendzis · on Oct 23, 2023

Not having to spin up and down whole interpreter/VM is, apparently, such a performance boost that some companies used to make whole business out of it

rightbyte · on Oct 23, 2023

You don't have to restart the VM after each request?

friendzis · on Oct 23, 2023

> interpreter/VM

IOW, runtime. It is rather typical to refer to runtimes of languages not compiling to native code as VMs. Runtime for Facebook's fork of PHP is literally called "HipHop Virtual Machine"

fxtentacle · on Oct 23, 2023

about 500x price markup

going from a dumb bare metal server to AWS Lambda

I believe for PHP there is little difference. Those function hosting solutions are typically used with Java where starting the JVM for every single request would be too much overhead.

AndrewDucker · on Oct 23, 2023

My first question there is...could they do the same work with fewer calls?