There may be further opportunities for improvement.
Chrome and Curl both report it takes about 1100ms to load the linked page's HTML, split about 50/50 between establishing a connection and fetching content. I'm not sure how the implementation works internally but that seems like a long time for a site served from memory and aiming to be "high-performance". The images bring the total time up to around 5.7s.
As a point of comparison, my site (nginx serving static content, on the 0.25 CPU GCP instance) serves the index page in 250ms. Of that, ~140ms is connection setup (DNS, TCP, TLS). The whole page loads in < 1000ms.
One thing to remember is that when a server like nginx serves static content, it's often serving it from the page cache (memory). The author of Varnish has written at some length about the benefits of using the OS page cache, for example <https://varnish-cache.org/docs/trunk/phk/notes.html>. Some of the same principles can be applied even for servers that render dynamically (by caching expensive fragments).
Author here. I wrote that post before I axed the CDN for my blog site itself. It was true at the time of writing, but it is not true anymore because I need to redo the CDN for the blog itself. All the images are CDNed with XeDN though.
The main thing the CDN provided was nodes on basically every continent that kept the site in cache. Without those servers on every continent keeping the site in cache, it takes longer to get to the netherlands to get the site loaded. The speed of light is only so fast.
If they are keeping your article cached, what's the point of saying it's a high-performance blog? Saying it's slow because the CDN down means that it's just slow... You can have a 'high performance' blog run on a raspberry pi zero if it's globally cached by someone else, but then I wouldn't say that's high performance.
Cool article though. Agree on the ructe part, and I dislike how whitespace is handled. I wish Jade/Pug templates could be done in rust but will check out Maud.
I thought of a better way to phrase it. The website itself is fast, but the process of you observing the website is slow because of limitations of the speed of light (or other interconnects the internet uses to get your traffic to Helsinki).
My examples are in Ruby, which is super slow compared to what you’re doing. Now I’m super curious what kind of performance you’d get globally on Fly if you deployed to a bunch of different regions.
I know next to nothing about Nix at Fly except there's a few folks who are looking at it at Fly. The repo at https://github.com/fly-apps/nix-base shows how its working for a Rails app. Is that enough to get you running with Nix flakes?
Is the Internet not connected internationally (US -> Europe for example) via cables underneath the ocean? Speed of light would be satellite, light? Not electric current?
Or is electricity flowing through a wire also "speed of light"?
First of all, the "speed of light" is usually referring to c, the maximum speed that matter or energy can move at.
Second of all, electrical signals in cables move at speeds slightly lower than c, but very close to it, so the speed of light is still a very good approximation of the possible upper bound.
Third of all, intercontinental cables are normally fiber optic, for several reasons. That is, they directly transmit light through the cable.
Fourth, it should be noted that electricity is actually the same thing as light, since photons are the carrier particles of the electric field (when two charged particles interact, they are actually exchanging a photon). It's of course not visible light, but satellite communication also uses radio waves normally, which are not visible light either.
Finally, either through cables or through satellite communication, the distance/c minimum theoretical one-way latency is usually a significant under-estimation of the actually possible minimal latency, since the straight-line distance is significantly shorter than the actual cable/satellite-and-back distance that the signals must travel - the difference in straight-line VS physical path distance is typically much larger than the difference between the theoretical speed of light and the actual speed of the electrical signal propagation.
We tend to think of wires as a garden hose for electrons, but it's the EM field that propagates, moreso than the electrons. Especially for AC power and for signals.
I'm not an expert in cross-continent interconnects, I have no idea what cables are being used there. I'd imagine that a lot of the backbone of the internet is fiber because that's what all the SRE memes say about wandering backhoes and sharks being the primary predator of fiber optic cables.
Nothing is faster than speed of light. In fact signals transmitted via copper wires are traveling at 2/3 the speed of light. Don't know the details about fiber ocean links, but it certainly won't be faster than speed of light.
Fiber optics are also about 2/3rds the speed of light. One of the interesting things about Starlink is that when they have laser links between satellites they should be able to beat terrestrial latencies over very long distances like California to London.
As a contrasting point: I'm consistently getting 150ms from their main domain, and 25-35ms from their cdn subdomain. I suspect most of your latency is from "the internet".
After going to the end of a long post, I'm disappointed to not find any latency or throughput efficiency metrics. Author seems to claim he has a very popular high-traffic blog and it is super fast, faster than all the popular web servers serving static pages. Where's the performance data to prove this?
edit: web.dev measure gave this blog post url a performance score of 30/100 which is quite poor.
Ripping out cloudflare made the metrics slower. I wrote this post before I ripped out cloudflare and it was accurate at the time of writing. It will be better once I can re-engineer things to be anycasted.
It would be good if the post contained some data to justify its points, like a graph of loading times. Otherwise assertions like "So fast that it's faster than a static website." don't seem supportable.
I would have liked to see the actual results from this comparison: "I compared my site to Nginx, openresty, tengine, Apache, Go's standard library, Warp in Rust, Axum in Rust, and finally a Go standard library HTTP server that had the site data compiled into ram."
I'm sorry but I have lost that data after some machines got reinstalled. I can attempt to recreate it, but that will have to wait for a future blogpost.
I want to see this taken to the logical extreme. A real OS with actual drivers (no unikernel, no virtio) for a small set of hardware that only serves static pages. No need for virtual memory. Just hardcode the blog posts right into the OS and use the most minimal TCP stack you can make.
I think that should be possible with Cosmopolitan Rust (https://ahgamut.github.io/2022/07/27/ape-rust-example/). It would create a Baremetal runnable ELF binary with just cosmopolitan libc statically linked, not sure about driver support though.
If it's amd64, long mode requires a page table. Otherwise, a page table is handy so you can get page faults for null pointer dereferencing. Of course, you could do that only for development, and let production run without a page table.
My hobby OS can almost fill your needs though, but the TCP stack isn't really good enough yet (I'm pretty sure I haven't fixed retransmits after I broke them, no selective ack, probably icmp path mtu discovery is broken, certainly no path mtu blackhole detection, ipv4 only, etc), and I only support one realtek nic, cause it's what I could put in my test machine. Performance probably isn't great, but it's not far enough along to make a fair test.
I am actually not sure if a more minimal TCP stack would be the best, especially if you would need to handle packet loss because of congestion for example. For example recent work such as RACK-TLP gives certain workloads better performance, but it is not something you would have in a minimal TCP stack
One approach is to run some kind of optimizer on a docker image that throws away everything that does not contribute to the end goal of yeeting text at http clients.
I think they typically target hypervisors because it's far more likely that that's what people will want to run on, but there's nothing fundamental stopping a unikernel from running on bare metal.
I remember working in 2008 on a project for some geothermal devices that were spitting some IoT data on a "hardcoded" html page directly in the C code of the program, the device was using a chinese 8051-like CPU so you had no OS-per se
How can it be faster than a static page that is already in memory, the bytes are there you just send them over a socket? Transforming some template to rust code back to string buffer is somehow faster?
How can it be faster than a static page that is already in memory, the bytes are there you just send them over a socket? Transforming some template to rust code back to string buffer is somehow faster?
I don't think the author is claiming it is faster than a static site stored in memory, they're saying it is faster than a traditional static site that loads files from the disk. At least that's how I read it.
That “traditional” site doesn’t actually load the data from disk, in practice. It does once, after a reboot, but that’s true for this solution’s executable file as well.
That “traditional” site doesn’t actually load the data from disk, in practice. It does once, after a reboot, but that’s true for this solution’s executable file as well.
Does Apache/Nginx/IIS load static files in memory ahead of time? I would assume no, unless someone went through and did some optimizations. Even so, there is always a point where memory runs out, and in that case a templating engine is essentially compression. I would assume if the author outputted his whole website as static files and stored them in memory it would be even faster, but that would require quite a bit more memory.
> Does Apache/Nginx/IIS load static files in memory ahead of time?
Linux loads them on the first usage. If you have enough memory, they'll just stay there. It doesn't that much memory, most sites are pretty small.
But the article's way doe use less memory, less system calls, and is completely optimized for that one site only. So yeah, it will surely be faster. Besides, his site appears to not be static.
The problem with OS file caches has ever been that people look at a box, see that the programs aren't consuming all of the available memory, and argue that they should be able to cram more shit on the box because it's 'underutilized'.
There are very reasonable and sane system architectures that let the OS handle caching, but you need a way to defend against these sorts of situations.
The performance falloff for this failure mode is exponential, so people try it a few times, and not getting any negative feedback, they add it to their toolbox only to get lectured months later once the bad behavior has not only become standard for them but also spread to other people.
It almost begs for a different system call that can earmark the memory usage by the app in a way that's easier for people to see.
With Apache/Nginx e.t.c. a file is cached by VM/FS on the first request and will stay in RAM for a long time unless there is a memory pressure. For most sites this is good enough. For cases where it isn't one can pre-load files after a reboot using find /path -type f -exec cat {} + > /dev/null.
Author here. I don't identify as male. It would be nice if you could update your comment to not make a factual error when referring to me. Please use https://pronoun.is/they.
It can be a tiny amount more efficient since an async disk IO implementation might dispatch the file read() call to a thread pool, wait for the result, and then send the data back to the client. Makes 2 extra context switches compared to sending data from memory. Now if the user is super confident that the data is hot and in page cache then a synchronous disk read will fix the problem. Or trying a read with RWF_NOWAIT and only falling back to a thread pool if necessary.
On the other hand rendering a template on each request also requires CPU, which might be either more or less expensive than doing a syscall.
All in all the efficiency differences are likely negligible unless you run a CDN which does thousands of requests per seconnd.
In terms of throughput to the end user it will make zero measurable difference unless the box ran out of CPU.
On the one hand, sure, you can probably squeeze some cycle or two out of buffering everything in memory. Even though your disk read is a memory read in all likelihood given how filesystem caching works, it's still an IO call, which isn't free.
Keeping everything in user space buffers might just be faster.
On the other hand, you're sending that sucker over network, and what you save doing this is most likely best counted in microseconds/request. It's piss in the ocean compared to the delay introduced even over a local network.
> Even though your disk read is a memory read in all likelihood given how filesystem caching works, it's still an IO call, which isn't free.
I wonder if io_uring could be used to issue a single syscall that would read data from disk (actually using page cache) and send it on the network.
Of course, you could use DPDK or similar technologies to do the opposite - read the data from disk once and keep it in user-space buffers, then write it directly to NIC memory without another syscall. That should still theoretically be faster, since there would be 0 syscalls per request, where the other approach would require 1 per request.
You can do kernel TLS for sendfile at least, maybe for io_uring too? Probably not for HTTP/2, but I'm not convinced multiplexed tcp in tcp is a good protocol for the public internet anyway.
That's indeed possible, if one has a TLS stack which supports KTLS. I however don't think there's not too many of those yet, and probably even less so in Rust where both the library and a potential Rust wrapper would need to support it.
Just like the Rust executable is still on disk. Sure, if it is running, it is memory mapped, but it can still be paged out. This is not theoretical. In practice, upon request, the probability of finding the static page in cache should be similar to the probability of the executable not being paged out. (That's true as long as the actual data is the same, and the differing factors, like the size of the web server executable, are small compared to the amount of free memory.)
Interesting. I've never heard of that happening before. Can you link a reference to where I can find out more about that aspect of the linux memory subsystem?
The mlock manual[0] has a "notes" section that provides a good brief summary. The GNU libc manual has more than anyone would ever want to read about memory management, including a section on memory locking[1].
On an intuitive level, think of swap as being a place the kernel can put memory the program has written. When you malloc(4096) and write some bytes into it, the kernel can't evict that page to disk unless there's some swap space to stick it in. However, executables are different because they're already on disk -- the in-memory version is just a cache (everything is cache (computers have too many caches)). The kernel is allowed to drop the copy of the program it has in memory, because it can always read it back from the original executable.
It's one of the reasons why running without swap can have even worse pathological behaviour than running with swap. With swap the kernel can prioritise keeping code in RAM over little-used data, wheras without it when RAM fills up with data eventually the currently running hot code gets swapped out and performance completely tanks, meaning the system doesn't actually hit the nice OOM error you hope it would. (hence userspace utilities like earlyoom to kick in before the kernel's absolute last resort strategy).
I believe that when a file is mmap'd a page table is created for it in the . As you perform read/write on the file a fault loads the actual entries into that page table. As pages can be mapped they so too can be unmapped under pressure, without that falling back to swap (since it is already a file backed map, you wouldn't swap a file backed map to a different file after all).
There are a few relevant bits to this. You can MAP_POPULATE the file to prepopulate the entries and you can MAP_LOCKED to MAP_POPULATE + lock the pages in (unreliably). As mentioned in the man page for mmap MAP_LOCKED has some failure modes that you don't get with mlock.
I was thinking the same. They said a Go precompiled version was faster, but was 200MB. Which I don't understand.
200MB of pages and assets, sure. Code? No. If you compile it into the binary then the storage is no worse than having a small binary and all the resources separate.
Taking a statically generated site and returning the raw bytes is 100% faster. The author said so themselves.
The tech is cool, but some of the language is so cringy. For example, the statement "websites are social constructs" makes zero sense. You could say that websites are material objects of a symbolic network of computer languages, like physical paper money is a material, fetishized object of the social construct of money. Websites themselves are not constructed socially. Maybe the author means how websites are perceived, or conventions of web tech itself, is constructed socially?
I don't have the time to get into a hardcore semiotics discussion at the moment, but basically I'm using words in the ways that normal people use words, which generally treats perception of the conventions of a thing as the thing itself. People do this mostly for convenience.
A website is a social construct because it can only function by the agreement of everyone involved (i.e., we all agree on how to parse HTML).
The individual site may be constructed individually (maybe) but it can only work if the society of people-who-use-the-internet all agree to follow a series of conventions about how websites work; you can't start using \<soul\> instead of \<body\> and expect everything to work as normal, because the reason the \<body\> tag is used to define the body of a page is because we needed a way to make sure people can use a webpage without having to define an entire new language for each one.
Sure but that's as useful as saying shit is social construct because we as humans decided to name it that. Technically true, practically it's vapid useless speech that doesn't bring anything useful to the discussion aside from person using the term feeling smart
Shit is not a social construct. A human being can produce shit without the cooperation of any other human, no matter what language that human uses to describe it. Bathrooms are a social construct; we had to all agree that it's not acceptable to shit in the sink.
Hey mate, you need to try a web browser. I found my experience using the Internet greatly improved when I stopped trying to parse the html documents myself.
I don't agree with this at all. What makes one set of frequency changes over a wire a website and another a voice call? A big pile of socially constructed concepts, from written language to Unicode to TCP and HTML. The electrical impulses are physically real; the website is a construct and makes sense only in the context of society.
Wow, NoraCodes! I just finished your Rust book! It's great and you're a hero!
But no, a website is not a social construct because you don't have to have a society to have a website. I can have two machines connected and host an html file on one of them and stare at it on the other one all by myself and it will still be a website on a web! No contractual agreement is necessary!
But anyway, it's amazing that you posted on my comment! I am a huge fan!
Talking about cringe, the furry stuff does in my opinion also negatively impact impression of the article. Also I distinctly remembered extremely similar blog written by someone with different name, but apparently author changed it yet again.
You don't need Rust for this -- you can do the same in Go, Node, etc. In 2012 my cheap VPS had a crappy HDD share but fairly acceptable memory, so I rendered the Markdown files and stored them in a little structure, returning them directly from memory.
Everyone thought it was amazing even though it was just a dumb http server returning pages[req.path] :-) Latency was under 10ms which was pretty amazing for a 2012 KVM VPS.
I don't think OP was implying that Rust was a requirement, just what was actually used in this case. And, indeed, OP gives some reasons that Rust might be preferable:
> And when I say fast, I mean that I have tried so hard to find some static file server that could beat what my site does. I tried really hard. I compared my site to Nginx, openresty, tengine, Apache, Go's standard library, Warp in Rust, Axum in Rust, and finally a Go standard library HTTP server that had the site data compiled into ram. None of them were faster, save the precompiled Go binary (which was like 200 MB and not viable for my needs). It was hilarious. I have accidentally created something so efficient that it's hard to really express how fast it is.
I did that in Go although it was "only" caching the markdown rendering - the page templates were written in Go (via some lib that gave tools to make that mangeable) and compiled with the app so the whole template building was blazingly fast.
I get the fun for a developer to set up something like this to experiment and learn new things. But I'm left with a question: why? Like, is there really a point aside for the aforementioned intrinsic dev fun?
There has to be a point of diminishing return. And again, I'm not discarding the dev side of things but it seems a lot of extra tooling and complexity cor not much gain.
I admire the OP's ability to use their blog as a rapid prototyping platform that is constantly growing and changing. Over engineering on a personal project like this is the whole point! Very cool.
I am too much of an OCD perfectionist and don't have the guts to ship this often.
Relentless Refactoring is a great tool, but one that is often stymied by faddish behaviors like micro-services/modules. Small projects tend not to have that problem and so make a better petri dish. Of course then you have to take your knowledge out of the 'lab' and apply it in vivo...
A lot of our (and in particular, my) best features come from of relocating the boundaries between things, to make space for features that weren't considered in the original design. With monolithic systems we see this late in the lifecycle in the form of Conway's Law. If you stick this problem in front of the CI/CD mirror, it's painful to face. CI/CD argues that if something is difficult we should do it all the time so that it's routine (or stop doing it entirely).
However there's a conspicuous lack of tools and techniques to make that practical. The only one I really know of is service retirement (replace 2-3 services with 2 new, refactored services), and we don't have static analysis tools that can tell us deterministically when we can remove an API. We have to do it empirically, which is fundamentally on par with println debugging.
I had the numbers at one point, but I have lost them. I can try to recreate them, but I'd probably have to use my old mac pro again to be sure the results are consistent.
See https://en.wikipedia.org/wiki/Singular_they for more information. It's been around since the 1300's but it's only gone into the mainstream as someone to refer to direct people fairly recently.
This loaded pretty slowly for me (2 seconds) and also has aggressive page layout changes. It’s almost like for 99% of software the most important part is UX not the low level programming language that is chosen
Can you use more rust to serve the 7 readers of a blog? You know what: use caching or something that compiles to plain html (hugo, jekyll etc.). No need for hardcore memory optimization.
And then all the gains were entirely eaten by first hop to a network device. Speaking from experience as I did similar thing, although speed was not a concern, just perpetual annoyance with available tools for blogging.
It's pretty bad faith to post an inflammatory comment, initially with several errors like referring to the Ref<> stuct instead of Rc<>, add edits to complain about downvotes, and then steadily edit it to be more correct and reasonable while leaving in the complaints about downvotes. Leaving the impression that it was the current iteration that attracted the downvotes because "the truth hurts".
It suggests to me you know as well as we do that the downvotes were about snarkily expressing your views while making mistakes that might suggest you aren't all that familiar with rust in the first place, and not that you're expressing "forbidden thoughts."
As always, Venn diagram of "people with extremely pointless gripes about the Rust language" and "people who compare the slightest criticism to being accused of thoughtcrime" is a circle.
That's not to say everyone who dislikes Rust is like this. I have plenty of gripes about Rust. Just the people who say things like "if you use a refcount and make a cycle, you can leak memory, oh noez".
Some seem to interpret the very existence of other people's ideas and especially what they perceive as an emerging consensus as a threat to their autonomy? I don't understand but I do observe it. I feel like what people actually take issue with is the existence of Rust and with people talking about it, and so they complain about how often it's on the front page, or they'll complain that it's in the title of the post, etc. But I feel like the root cause here is that they want it to go away, because it's existence bothers them.
There is value in having unsafe parts of a program clearly annotated (not just with comments). It is similar to how in some languages you annotate pure functions and they do not compile unless they are pure.
anytime you need to implement anything that shares references (just about any data structure worth its weight in implementation time) you need to use Rc<> and friends.
It's not anytime. It's limited to recursive types with interior mutability. These two conditions are specific to mutable graphs, not just any shared ownership. There are plenty of uses of Rc that cannot possibly cause a cycle.
Chrome and Curl both report it takes about 1100ms to load the linked page's HTML, split about 50/50 between establishing a connection and fetching content. I'm not sure how the implementation works internally but that seems like a long time for a site served from memory and aiming to be "high-performance". The images bring the total time up to around 5.7s.
As a point of comparison, my site (nginx serving static content, on the 0.25 CPU GCP instance) serves the index page in 250ms. Of that, ~140ms is connection setup (DNS, TCP, TLS). The whole page loads in < 1000ms.
https://i.imgur.com/X4LDbWj.png
https://i.imgur.com/Ccwzmgz.png
One thing to remember is that when a server like nginx serves static content, it's often serving it from the page cache (memory). The author of Varnish has written at some length about the benefits of using the OS page cache, for example <https://varnish-cache.org/docs/trunk/phk/notes.html>. Some of the same principles can be applied even for servers that render dynamically (by caching expensive fragments).