As an aside, I’m always surprised at just how bad Meta’s software quality is, especially for a FAANG company with lots of expensive engineers.
Instagram and Facebook are some of the buggiest platforms I use. I often run into bugs in several key user journeys that go unresolved and it baffles me how they don’t pick up on them.
One of the problems is it’s basically impossible to get in touch with Meta to report them.
If you think they are buggy, become a seller on Amazon. The buggiest software I have ever had. And they are influencers with the "two pizza rule". The second worst is Spotify. It still does not remember where you've stopped listening on radio plays after a decade. From the influencer people who brought you "tribes".
Strong agree here, at least on FB (don't use Instagram because why). I fully tolerated their growing pains back in the day when site went down for hours, or frequent blank pages after some standard click on something. The thing is, even when things improved over time its still buggy as hell.
I had so many problems with uploading photo albums for example, all the time for past 15 years. I used to put thousands of pictures from my full frame camera travel and mountain adventures on FB till I got fed up with wasting so much time redoing it all and battling with the site. They literally pushed away genuine content creator, at least in my social graph I was by far the most active one, and folks liked what I uploaded.
Sometimes it uploaded twice. Sometimes some photos refused to upload, while being same in every aspect as rest which worked OK. Wiping out descriptions I painstakingly put to every photo. Sometimes, even these days, whole feed is blank, just menus on the side. I deleted their mobile app since if was snooping for all the data it could get and draining battery while not used at all - this was actually a great move for my own personal happiness so not complaining about this one.
I'd say that FB is so successful despite their consistent lack of technical quality. They just nailed perfectly the hole in the market people didn't even know they wanted and their timing to market. This is not unique to FB - if there is little pressure from cca equal competition, every business I've seen is subpar, and/or overpriced.
Then you use google's products and its night and day. I don't recall a single bug that affected me, ever. Too bad Google+ never stood a chance.
> I had so many problems with uploading photo albums for example
I suspect it is by design, in a way. FB is not a photo storage platform, they need to compress the images a lot. A single photo may increase interaction and hence metrics like MAU but a whole album is unlikely to do so.
I can't speak to what is going on but one thing I've heard that makes things difficult for at least the frontend teams is that they are battling ad blockers. In order to work around them, they create incredibly obfuscated HTML that no one would normally ever create.
somehow every comment in here is downplaying this achievement as “low” or unimpressive
you truly underestimate the scale of an operation like this
the vast majority of software companies will never count a trillion of anything. even big companies that scale will only have a small subset of teams work on something this large
The article is too light on details to estimate if trillions is impressive or not. For example, if my single-server system easily handles 100 mio per day and the load is almost exclusively CPU bound (like with most AI tasks) then scaling to 1 trillion per day might be as easy as buying 10k servers, which is totally a thing that mid to large sized companies do to scale up.
The fact that makes this Meta paper impressive is NOT scaling up to 1 trillion per day, it's that they manage to do so while keeping request latency low and CPU utilization high. Anyone who's been with Heroku long enough probably remembers when suddenly instances would be 80% idle and still requests were slow. That was when Heroku changed their routing from intelligent to dumb. And Meta is doing the opposite here, reducing overall deployment costs by squeezing more requests out of each instance than what would have been possible with a simple random load balancer.
It's an interesting paper, but they've made some weird trade offs regarding latency for resource efficiency that make it seem niche especially for FaaS tech, and the TPS they're hitting is surprisingly low for something that is supposedly in widespread use at their company. Some of their suggestions at the end are also already features for FaaS products in some form or another too.
I think this paper is awesome and this platform is not a trivial piece of engineering to be clear, but it doesn't seem particularly novel or even reaching close to the larger workloads that public cloud services offer.
>the vast majority of software companies will never count a trillion of anything
As others have noted, it's not impossible that many of our own laptops have run a "trillion functions".. the devil is in the details here for systems researchers and engineers, and based on the details XFaaS isn't nearly as novel as say Presto was.
And this is also the HN where people boosts they can rebuild on their own $SUCCESSFUL_SOFTWARE over a 3-days weekend.
There are tons of very brilliant and very smart people here, but there are also many that are too fond of themselves or have big issues understanding problems' ramifications in real life/real business.
“Trillions of functions” is a metric that’s hard to know whether to be impressed by or not. I don’t think it’s impossible that my laptop runs “trillions of functions” every day.
But the callers on this platform are likely remote, and therefore it handles I/O as well, etc. Like I said, hard to understand whether it’s impressive or not.
The calls per server is probably not the difficult part - this is the type of scale where you start hitting much, much harder problems, e.g.:
- Load balancing across regions [0] without significant latency overhead
- Service-to-service mesh/discovery that scales with less than O(# of servers)
- Reliable processing (retries without causing retry storms, optimistic hedging)
- Coordinating error handling and observability
All without the engineers actually writing the functions needing to know anything about the system (which requires airtight reliability, abstractions, observability).
I don't mean to comment on whether this is impressive or not, just pointing out that per-server throughput would never be the difficult part of reaching this scale.
[0] And apparently for this system, load balancing across time, which is at least a mildly interesting way of thinking about it
Was just trying to break the number down to something more easy to understand, I don’t know enough if this is impressive or not! Depends on the complexity of the request, and I guess the complexity of routing that many requests over such a large network. I’ve never worked at that scale.
> One example of load they demonstrate has 20 million function calls submitted to XFaaS within 15 minutes.
> Meta’s XFaaS is their serverless platform that “processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions.”
Just for one trillion (not "trillions"), that'd be 10 million function calls per server per day, ~7k per minute. Sounds about right to me. They'll surely want to leave some room for traffic spikes, server failures and unforeseen issues. Server uplink could also be a factor - at least that was the major bottleneck the last time I ran infrastructure serving lots of clients (about 100 million), but they are probably smarter about this than I was back then.
The only impressive about this thing is technology abuse to do such a slow job on such a large scale.
I recall ruby on rails doing at most 150 requests per second on a laptop 15 years ago, which was laughably low even for those days. I could do hundreds times more using c++, or at least tens times more using unoptimized python.
Their achievement should not be praised, on contrary they need to be told they're wasting resources, heating up planet and emit carbon dioxide for no good reason. This is not how optimization looks like.
And no, it is not serverless, since obviously something serves that stuff, stop lying already.
“Server” is a long-running process that serves requests. “Serverless” means that your requests are handled by short-lived processes of some kind.
This is just the name for it. If you want to poke holes in the definition I gave, it won’t do any good, because definitions are kinda fuzzy and incomplete to begin with. If you try to argue about what “serverless” means, you’re just going to end up drifting away from the consensus about what words mean. Consensus is the only thing that matters here. Consensus generally won’t be universal—I know people that refuse to accept “literally” as an intensifier—but that is a problem we have to deal with.
Likewise, every pair of “wireless headphones” have many sets of wires inside them, if you cut them open.
“Vegan” products you buy at the store are made by humans, who are a type of animal, so you cannot say that vegan products contain no animal products. Plastic is an animal product because humans make it, and humans are animals.
If we go down this road, the only conclusions we get are absurd and useless ones. Language is a consensus process dominated by metaphor and shared context. It is not a mathematical process.
> A server is a physical machine with an operating system. It can run many processes.
That is an acceptable and different definition of “server”. Words have multiple definitions, that’s why any time you look something up in the dictionary, you are likely to see multiple entries for the same word.
I generally don’t use the word “server” that way. I say “machine” when I am talking about machines, and “server” when I am talking about daemon processes. A great many people I have worked with in the past ten years or so, in multiple parts of the country, have adopted similar terminology to lessen confusion.
If you accept that words have multiple definitions, and accept that “server” has multiple definitions, then it is easy to accept that “serverless” means something like “the response is not ultimately processed by a daemon process”. Just like we accept that wireless headphones are called “wireless headphones”, even though they have wires inside. They are only missing one particular set of wires. Like, “the audio is not transmitted to the headphones over a wire” wireless, “the request is ultimately processed by something other than a long-running daemon” serverless.
I find these definitions easy to understand, natural to use, and I find that they don’t confuse listeners when I use them.
I avoid calling the machine a “server” because it is sometimes confusing, but I’m not going to argue with someone who uses it that way and tell them that they are wrong. That would be horrible.
My point was that I have never heard anyone describe a server or serverless in that way: as simply a process. I agree that thinking about server as a process/Daemon is a great way to frame serverless.
I've also not considered that wireless headphones have wires so it's been fun to think about that too.
I assume these are mostly procedures, not functions, (that is, they have side effects) as applying a function 1.3M times per minute immediately raises the question of caching which doesn't seem to be important for them.
But if you execute that many procedure calls, how do you guard against them influencing each other due to their side effects? Memory leaks come to mind, or other weird bugs. Also, how do you manage the credentials to effectively issue side effects in a (hopefully) zero-trust environment?
Off topic: I ended up not finishing the article, because first it presented me with a full screen overlay asking me to sign up to their news letter, forcing me to click on "I want to read it first" to see the article.
Then after scrolling down two paragraphs, it presented me with another overlay blocking the article, again asking me to sign up for the newsletter.
Can we please stop rewarding these kinds of practices with traffic by posting and upvoting them?
I wonder how much actual conversion happens in these kind of forms, anyway. Who are the people that sign up to newsletters of a random website before even reading anything? And who signs up after reading just an intro?
I get the idea of the mailing lists - generate traffic that does not rely on external platforms, with their ever changing algorithms that need to be appeased and courted, and who might even change the rules or ban you from promoting your content.
But sure the conversion rates for these full screen modal overlays must be lower than the attrition you get from scaring away passer by readers?
I am running into the perennial latency Vs throughout problem.
A batch of queue can hide latency problems. If you process 1 million per second (1 batch or 1 million separate requests) with 100 nanosecond latency or 10 microseconds per batch.
In XFaas do they run an interpreter in a thread ? Or a warm process?
IOW, runtime. It is rather typical to refer to runtimes of languages not compiling to native code as VMs. Runtime for Facebook's fork of PHP is literally called "HipHop Virtual Machine"
I believe for PHP there is little difference. Those function hosting solutions are typically used with Java where starting the JVM for every single request would be too much overhead.
Instagram and Facebook are some of the buggiest platforms I use. I often run into bugs in several key user journeys that go unresolved and it baffles me how they don’t pick up on them.
One of the problems is it’s basically impossible to get in touch with Meta to report them.