So what's wrong with 1975 programming? (2006)

rodgerd · on March 25, 2015

All that slagging off of Squid looks pretty stupid now that Varnish are implementing... their own OS independent disk and memory management! (see https://www.varnish-software.com/blog/introducing-varnish-ma...).

Maybe those Squid developers weren't such knuckle-dragging morons after all.

BlackAura · on March 25, 2015

Unwarranted hostility aside...

The fact that Varnish changed over the years neither invalidates this article, nor vindicates Squid's design.

On any remotely modern system (say, 2006 or later), Squid's design is absurd. The critique in this article is spot on. Squid basically pretends that the operating system's virtual memory system and disk cache simply don't exist, and spends it's time working against them. This does cause exactly the kind of problems detailed in the article.

Of course, that's because Squid is not Varnish. Squid was designed a long time ago, with maximum portability in mind, and intended to run on operating systems with very poor VM and disk cache systems. With that in mind, Squid's design makes sense. It just doesn't make sense on newer systems.

In Varnish, all of this work was delegated to the operating system. This works very well. It's certainly a lot simpler than Squid, in addition to being a lot faster.

As long as most of your hot data can fit in the disk cache, at least. The infrequently used parts, which could well be a lot larger than the frequently used parts, can be kicked out to disk by the OS, and although reading them back in incurs a performance penalty, it's not that bad. It only really affects less commonly accessed data, and doesn't interfere with everything else.

The original varnish design works great for that. It's less good if your entire working set fits in RAM (in which case, the slightly newer malloc-based system is faster because it has lower overhead, but becomes much slower if you really need to swap).

Varnish starts to fall down if your working set doesn't fit in RAM (in which case, you're doomed regardless), or if the total cache is really huge (think somewhere in the terabyte range).

The new storage engine mostly just re-organizes the existing mmap-based caches. It has better cache eviction algorithms, which give a much higher cache hit rate, and much lower internal fragmentation. That alone accounts for nearly all of the performance benefit.

The only I/O change I can find is that it uses the write syscall to write newly cached objects to the file directly, rather than writing to the mmap file. That allows them to replace the contents of those pages atomically - the OS will just drop them into the disk cache, rather than potentially having to re-read them from disk if they happen to not be in the cache.

All of the reading, memory management and I/O is still done by the VM and disk cache systems of the OS. That hasn't changed.

acqq · on March 25, 2015

Thanks for the details. It still stays that the original Varnish design was also less than optimal, even for the computers of 2006, and that the more recent changes made it using the hardware better.

lazyjones · on March 25, 2015

Apart from that, it's also not really a bad idea to write your own virtual memory management layer if you know you can do a better job than the OS because you know much more about the expected access patterns, or because you know that the OS does a really bad job in some common cases. Sometimes, the OS is unable to handle large virtual memory areas (think OpenVZ/Virtuozzo VPS with 256MB RAM = virtual memory and 50GB of storage that now cannot be used in this way).

Perhaps Squid didn't do such a good job, but just leaving everything to an idealized OS is short-sighted (as proven by the more recent Varnish code).

jacquesm · on March 25, 2015

It's all about the cache hit rate. The difference between 99% and 99.9% cache hit rate is not .9%, it's 900%!

tambourine_man · on March 25, 2015

Previously, on HN: https://news.ycombinator.com/item?id=4874304

SwellJoe · on March 25, 2015

And, even earlier (also with excellent commentary and occasional rebuttal): https://news.ycombinator.com/item?id=1554656

mooreds · on March 25, 2015

Thanks, some really intelligent comments on that thread.

pjc50 · on March 25, 2015

It's still interesting to read about people minimising the number of syscalls and memory copy/allocation actions. When I was working at Zeus on their webserver back in 2001, it had a lot of effort devoted to exactly that. Strings referring to chunks of header rather that malloc-and-copy. A stat() cache to avoid touching the disk.

simula67 · on March 25, 2015

God, what a mess we are in.

I can't wait for memristors to become commercialized and get TBs of register speed memory on every processor core. None of this cache, paging, NUMA nonsense.

dietrichepp · on March 25, 2015

This is not a problem solved by memristors. The more memory you have, the more addressing and multiplexing you need to address it. The delay in a multiplexer grows logarithmically with the number of inputs. With a cache, it is even worse, because you have to address it by the real address, not the address in cache. So there will always be a hierarchy of speeds, unless you can figure out a completely different way to design a multiplexer.

In the best case scenario, memristors give us TBs of NVRAM.

simula67 · on March 25, 2015

I am not a hardware designer, but are you saying one large 64 bit multiplexer to access the whole memory would be impractically slow ? Even if we don't get register speed, it would simplify software design, wouldn't it ?

dietrichepp · on March 25, 2015

Yes, it would be slow and large. Because it's large, you'd get less memory in the same area. Another factor is that RAM and CPUs are usually on completely different dies to begin with, which are manufactured with somewhat different processes so you can't just copy and paste them onto the same chip.

Incidentally, this is what computers looked like 30 years ago. You could have a CPU with a bunch of address and data pins wired up to a RAM chip that would give you whatever address you wanted right away.

Loosely speaking, a modern computer still works the same way, but memory speeds haven't kept up with CPU speeds. So, to make our software run faster, we put layers of smaller, faster memory between the CPU and the larger, slower memory. But the hardware hides all of this from the software: you don't have to care, unless you want to optimize things. So we have registers, L1 cache, L2 cache, RAM, SSDs, HDDs, and the network. You can write a program today and all seven layers of caching might be mostly transparent, some more so than others.

A lot of this complexity is in the hardware, and other parts of it are in the OS. Application developers have it easy.

nhaehnle · on March 25, 2015

I'm not dietrichepp, but that's the gist of it. Look into how multiplexers are actually implemented, and it becomes painfully obvious that hardware cannot magically solve your problems.

And the "even if we don't get register speed" already applies today. In a sense, your system does have one large multiplexer to access all of memory. It's called the memory controller. The point is that this is painfully slow and far from register speed, so caches are built on top of it to make it faster. The same applies to disk access.

sp332 · on March 25, 2015

We could throw away von Neumann architecture, move computation out to the data, and build an ALU per every 32k of memory (plus some FPUs here and there). That gives us 33.5 billion compute units per TB of register-speed memory :)

314 · on March 25, 2015

The programming model would look a lot like a GPU does today. If computation was regular and uniformly distributed across the memory then life would easy and everything would work quickly. When computation patterns get chunky there would be a hellish fight to make it fit the architecture. So processing trees and graphs would require a lot of tricky coding to work around the memory model.

pjc50 · on March 25, 2015

This is not going to happen any time soon. Even within the core, you can't have everything at "register speed" anyway; and as soon as you have multiple sockets NUMA pops up again.

mitchty · on March 25, 2015

Yep memristors will be cool sure, but they aren't going to solve all our problems. Even within a core not everything will work at "register speed".

I'd consider memristors to be something more helpful to our vm subsystems than anything.

Qantourisc · on March 25, 2015

Or you just use mlock to prevent a civil war ? (Prevent the kernel from paging it out)

dietrichepp · on March 25, 2015

Locking pages to try to improve performance is a dangerous game.