Hacker News new | past | comments | ask | show | jobs | submit login

Works out to around 115 function calls a second per server



The calls per server is probably not the difficult part - this is the type of scale where you start hitting much, much harder problems, e.g.:

- Load balancing across regions [0] without significant latency overhead

- Service-to-service mesh/discovery that scales with less than O(# of servers)

- Reliable processing (retries without causing retry storms, optimistic hedging)

- Coordinating error handling and observability

All without the engineers actually writing the functions needing to know anything about the system (which requires airtight reliability, abstractions, observability).

I don't mean to comment on whether this is impressive or not, just pointing out that per-server throughput would never be the difficult part of reaching this scale.

[0] And apparently for this system, load balancing across time, which is at least a mildly interesting way of thinking about it


That doesn’t sound right… maybe per user?


1 Trillion function calls over 100,000 servers. Technically they say trillions over hundreds of thousands, but I went for the lower case.

1,000,000,000,000 / 100,000 = 10,000,000

10,000,000 / 24 / 3600 = 115.7


We don't know what the servers are; dual socket EPIC, i.e. 128cores, makes the number look beyond trivial.


So... 115.7 rps doesn't sound groundbreaking, right?


Was just trying to break the number down to something more easy to understand, I don’t know enough if this is impressive or not! Depends on the complexity of the request, and I guess the complexity of routing that many requests over such a large network. I’ve never worked at that scale.


This isn't unreasonable. ML workloads benefit from more computational time per request. Lower QPS = better results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: