Hacker News new | past | comments | ask | show | jobs | submit login
Meta reveals serverless platform processing trillions of function calls a day (engineercodex.substack.com)
61 points by thunderbong on Oct 23, 2023 | hide | past | favorite | 59 comments



As an aside, I’m always surprised at just how bad Meta’s software quality is, especially for a FAANG company with lots of expensive engineers.

Instagram and Facebook are some of the buggiest platforms I use. I often run into bugs in several key user journeys that go unresolved and it baffles me how they don’t pick up on them.

One of the problems is it’s basically impossible to get in touch with Meta to report them.


No positive incentives, nobody gets promoted doing bug fixes, everybody tries to focus on hot features.

No negative incentives, it's not safety-critical so nobody will recall or sue.


If you think they are buggy, become a seller on Amazon. The buggiest software I have ever had. And they are influencers with the "two pizza rule". The second worst is Spotify. It still does not remember where you've stopped listening on radio plays after a decade. From the influencer people who brought you "tribes".


Strong agree here, at least on FB (don't use Instagram because why). I fully tolerated their growing pains back in the day when site went down for hours, or frequent blank pages after some standard click on something. The thing is, even when things improved over time its still buggy as hell.

I had so many problems with uploading photo albums for example, all the time for past 15 years. I used to put thousands of pictures from my full frame camera travel and mountain adventures on FB till I got fed up with wasting so much time redoing it all and battling with the site. They literally pushed away genuine content creator, at least in my social graph I was by far the most active one, and folks liked what I uploaded.

Sometimes it uploaded twice. Sometimes some photos refused to upload, while being same in every aspect as rest which worked OK. Wiping out descriptions I painstakingly put to every photo. Sometimes, even these days, whole feed is blank, just menus on the side. I deleted their mobile app since if was snooping for all the data it could get and draining battery while not used at all - this was actually a great move for my own personal happiness so not complaining about this one.

I'd say that FB is so successful despite their consistent lack of technical quality. They just nailed perfectly the hole in the market people didn't even know they wanted and their timing to market. This is not unique to FB - if there is little pressure from cca equal competition, every business I've seen is subpar, and/or overpriced.

Then you use google's products and its night and day. I don't recall a single bug that affected me, ever. Too bad Google+ never stood a chance.


> I had so many problems with uploading photo albums for example

I suspect it is by design, in a way. FB is not a photo storage platform, they need to compress the images a lot. A single photo may increase interaction and hence metrics like MAU but a whole album is unlikely to do so.


I can't speak to what is going on but one thing I've heard that makes things difficult for at least the frontend teams is that they are battling ad blockers. In order to work around them, they create incredibly obfuscated HTML that no one would normally ever create.


And this is where some of the greatest, most educated minds in computing are going. What a waste.


So how is this working for them? Because thanks for uBlock Origin I haven't seen an ad on FB for years.


Adblock plus is what they focused on, back in the day at least.


somehow every comment in here is downplaying this achievement as “low” or unimpressive

you truly underestimate the scale of an operation like this

the vast majority of software companies will never count a trillion of anything. even big companies that scale will only have a small subset of teams work on something this large


The article is too light on details to estimate if trillions is impressive or not. For example, if my single-server system easily handles 100 mio per day and the load is almost exclusively CPU bound (like with most AI tasks) then scaling to 1 trillion per day might be as easy as buying 10k servers, which is totally a thing that mid to large sized companies do to scale up.

The fact that makes this Meta paper impressive is NOT scaling up to 1 trillion per day, it's that they manage to do so while keeping request latency low and CPU utilization high. Anyone who's been with Heroku long enough probably remembers when suddenly instances would be 80% idle and still requests were slow. That was when Heroku changed their routing from intelligent to dumb. And Meta is doing the opposite here, reducing overall deployment costs by squeezing more requests out of each instance than what would have been possible with a simple random load balancer.


>then scaling to 1 trillion per day might be as easy as buying 10k servers [...]

..I doubt that. How would you distribute the requests between those? An instance of mod_proxy_balancer?


DNS round robin so that clients get randomly distributed among multiple load balancers

They have 12m RPS so about 10 HAProxy servers should do the trick.


It's an interesting paper, but they've made some weird trade offs regarding latency for resource efficiency that make it seem niche especially for FaaS tech, and the TPS they're hitting is surprisingly low for something that is supposedly in widespread use at their company. Some of their suggestions at the end are also already features for FaaS products in some form or another too.

I think this paper is awesome and this platform is not a trivial piece of engineering to be clear, but it doesn't seem particularly novel or even reaching close to the larger workloads that public cloud services offer.

>the vast majority of software companies will never count a trillion of anything

As others have noted, it's not impossible that many of our own laptops have run a "trillion functions".. the devil is in the details here for systems researchers and engineers, and based on the details XFaaS isn't nearly as novel as say Presto was.


This is HN, there are definitely users on this site who have experienced or have worked at places with these workloads.


And this is also the HN where people boosts they can rebuild on their own $SUCCESSFUL_SOFTWARE over a 3-days weekend.

There are tons of very brilliant and very smart people here, but there are also many that are too fond of themselves or have big issues understanding problems' ramifications in real life/real business.


How do you know the difference between stupidity and ambition unless you try?


Ambition is saying "I think I can bootstrap a successful competitor to $POPULAR_SOFTWARE if I work hard enough, I have the talent and perseverance".

Adding "in 3-days" is stupidity.


The best inoculation against hubris is trying to fly to the sun.

Humans, to generalize, bias towards talking over walking.


12 million QPS isn't nothing but it's pretty common at big companies.


So what? Meta is a trillion dollar company. It should be able to create website, that works.

Compare its budget to WhatsApp before it was acquired!


“Trillions of functions” is a metric that’s hard to know whether to be impressed by or not. I don’t think it’s impossible that my laptop runs “trillions of functions” every day.

But the callers on this platform are likely remote, and therefore it handles I/O as well, etc. Like I said, hard to understand whether it’s impressive or not.

I’ll assume it’s impressive.


TBH anyone using "per day" I assume is dishonest.

QPS is the standard measure, and per day is just a sly way to multiply by 86400.

If you want an average measure then report the average and peak QPS ...


I divided through, assuming sustained rate:

1T/day = 11.575M/s

I personally find 11.5M/s a lot more impressive sounding. Though another comment suggests 100k servers — for about 100/s per server.

10ms per request isn’t particularly good or bad; volume is still impressive.


servers have lots and lots of cores, you might as well divide by 64 or 128, which would set it to 1s. Overall 4req/s per core is nothing.


Servers have no parallelism?


Works out to around 115 function calls a second per server


The calls per server is probably not the difficult part - this is the type of scale where you start hitting much, much harder problems, e.g.:

- Load balancing across regions [0] without significant latency overhead

- Service-to-service mesh/discovery that scales with less than O(# of servers)

- Reliable processing (retries without causing retry storms, optimistic hedging)

- Coordinating error handling and observability

All without the engineers actually writing the functions needing to know anything about the system (which requires airtight reliability, abstractions, observability).

I don't mean to comment on whether this is impressive or not, just pointing out that per-server throughput would never be the difficult part of reaching this scale.

[0] And apparently for this system, load balancing across time, which is at least a mildly interesting way of thinking about it


That doesn’t sound right… maybe per user?


1 Trillion function calls over 100,000 servers. Technically they say trillions over hundreds of thousands, but I went for the lower case.

1,000,000,000,000 / 100,000 = 10,000,000

10,000,000 / 24 / 3600 = 115.7


We don't know what the servers are; dual socket EPIC, i.e. 128cores, makes the number look beyond trivial.


So... 115.7 rps doesn't sound groundbreaking, right?


Was just trying to break the number down to something more easy to understand, I don’t know enough if this is impressive or not! Depends on the complexity of the request, and I guess the complexity of routing that many requests over such a large network. I’ve never worked at that scale.


This isn't unreasonable. ML workloads benefit from more computational time per request. Lower QPS = better results.


> One example of load they demonstrate has 20 million function calls submitted to XFaaS within 15 minutes.

> Meta’s XFaaS is their serverless platform that “processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions.”

This seems very low per server?


Just for one trillion (not "trillions"), that'd be 10 million function calls per server per day, ~7k per minute. Sounds about right to me. They'll surely want to leave some room for traffic spikes, server failures and unforeseen issues. Server uplink could also be a factor - at least that was the major bottleneck the last time I ran infrastructure serving lots of clients (about 100 million), but they are probably smarter about this than I was back then.


The only impressive about this thing is technology abuse to do such a slow job on such a large scale. I recall ruby on rails doing at most 150 requests per second on a laptop 15 years ago, which was laughably low even for those days. I could do hundreds times more using c++, or at least tens times more using unoptimized python.

Their achievement should not be praised, on contrary they need to be told they're wasting resources, heating up planet and emit carbon dioxide for no good reason. This is not how optimization looks like. And no, it is not serverless, since obviously something serves that stuff, stop lying already.


Trillion per day is 10M per second. That’s hard.

“Serverless” just means you’re not running a long-running process that handles these specific requests.


On multiple 100k servers? Not as much as you think.

Serverless means you have no server. Stop repeating amazons lies. It is nonsense.


“Server” is a long-running process that serves requests. “Serverless” means that your requests are handled by short-lived processes of some kind.

This is just the name for it. If you want to poke holes in the definition I gave, it won’t do any good, because definitions are kinda fuzzy and incomplete to begin with. If you try to argue about what “serverless” means, you’re just going to end up drifting away from the consensus about what words mean. Consensus is the only thing that matters here. Consensus generally won’t be universal—I know people that refuse to accept “literally” as an intensifier—but that is a problem we have to deal with.


But he has a point.

There is definitely a layer of long-running processes queuing and distributing these requests. In this sense, "serverless" is a buzzword.


Likewise, every pair of “wireless headphones” have many sets of wires inside them, if you cut them open.

“Vegan” products you buy at the store are made by humans, who are a type of animal, so you cannot say that vegan products contain no animal products. Plastic is an animal product because humans make it, and humans are animals.

If we go down this road, the only conclusions we get are absurd and useless ones. Language is a consensus process dominated by metaphor and shared context. It is not a mathematical process.


Your comment actually made me wonder if there are people who don't eat honey because it is made by bees.


>“Server” is a long-running process that serves requests. “Serverless” means that your requests are handled by short-lived processes of some kind.

Stateless?


Why are you defining server in that way? Do you have a reference?

A server is a physical machine with an operating system. It can run many processes.


> A server is a physical machine with an operating system. It can run many processes.

That is an acceptable and different definition of “server”. Words have multiple definitions, that’s why any time you look something up in the dictionary, you are likely to see multiple entries for the same word.

I generally don’t use the word “server” that way. I say “machine” when I am talking about machines, and “server” when I am talking about daemon processes. A great many people I have worked with in the past ten years or so, in multiple parts of the country, have adopted similar terminology to lessen confusion.

If you accept that words have multiple definitions, and accept that “server” has multiple definitions, then it is easy to accept that “serverless” means something like “the response is not ultimately processed by a daemon process”. Just like we accept that wireless headphones are called “wireless headphones”, even though they have wires inside. They are only missing one particular set of wires. Like, “the audio is not transmitted to the headphones over a wire” wireless, “the request is ultimately processed by something other than a long-running daemon” serverless.

I find these definitions easy to understand, natural to use, and I find that they don’t confuse listeners when I use them.

I avoid calling the machine a “server” because it is sometimes confusing, but I’m not going to argue with someone who uses it that way and tell them that they are wrong. That would be horrible.


My point was that I have never heard anyone describe a server or serverless in that way: as simply a process. I agree that thinking about server as a process/Daemon is a great way to frame serverless.

I've also not considered that wireless headphones have wires so it's been fun to think about that too.


What sort of isolation did you achieve with your Ruby experiments?


I assume these are mostly procedures, not functions, (that is, they have side effects) as applying a function 1.3M times per minute immediately raises the question of caching which doesn't seem to be important for them.

But if you execute that many procedure calls, how do you guard against them influencing each other due to their side effects? Memory leaks come to mind, or other weird bugs. Also, how do you manage the credentials to effectively issue side effects in a (hopefully) zero-trust environment?


Off topic: I ended up not finishing the article, because first it presented me with a full screen overlay asking me to sign up to their news letter, forcing me to click on "I want to read it first" to see the article.

Then after scrolling down two paragraphs, it presented me with another overlay blocking the article, again asking me to sign up for the newsletter.

Can we please stop rewarding these kinds of practices with traffic by posting and upvoting them?


A solution to these would be to build a browser extension that automatically submits any such forms with bogus email addresses.

It can't easily be blocked by a captcha as that would hurt conversion rates of legitimate users.


I wonder how much actual conversion happens in these kind of forms, anyway. Who are the people that sign up to newsletters of a random website before even reading anything? And who signs up after reading just an intro?

I get the idea of the mailing lists - generate traffic that does not rely on external platforms, with their ever changing algorithms that need to be appeased and courted, and who might even change the rules or ban you from promoting your content.

But sure the conversion rates for these full screen modal overlays must be lower than the attrition you get from scaring away passer by readers?


Thank you for this.

I am a beginner hobbyist in server tech.

I am running into the perennial latency Vs throughout problem.

A batch of queue can hide latency problems. If you process 1 million per second (1 batch or 1 million separate requests) with 100 nanosecond latency or 10 microseconds per batch.

In XFaas do they run an interpreter in a thread ? Or a warm process?


What's the difference between a PHP function and regular PHP?


Not having to spin up and down whole interpreter/VM is, apparently, such a performance boost that some companies used to make whole business out of it


You don't have to restart the VM after each request?


> interpreter/VM

IOW, runtime. It is rather typical to refer to runtimes of languages not compiling to native code as VMs. Runtime for Facebook's fork of PHP is literally called "HipHop Virtual Machine"


about 500x price markup

going from a dumb bare metal server to AWS Lambda

I believe for PHP there is little difference. Those function hosting solutions are typically used with Java where starting the JVM for every single request would be too much overhead.


My first question there is...could they do the same work with fewer calls?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: