For the record, I think this will be a disaster. There is far too much
code that will get broken, largely silently, and much of it is not
under our control.
regards, tom lane
For those who don't follow PHP closely - that version was an attempted refactor of the string implementation which essentially shut down nearly all work on PHP for a decade, stagnating the language until it became pretty terrible compared to other options. They finally gave up and started work on PHP 7 which uses the (perfectly good) PHP 5 strings.
Ten years of wasted time by the best internal PHP developers crippled the project - I'm amazed it survived at all.
On the other hand, there's also the case of the Lunar Module guidance software that was hard-coded to run exactly every two seconds. If the previous subroutine call was still running when the next one was due, the previous one was harshly terminated (with weird side effects).
One of the main programmers suggested making it so that the next guidance routine wouldn't run until the previous one was done. This would make the code less sensitive to race conditions and allow more useful functionality for the pilots (who were the actual users and did seem to want it). However everyone assumed the two-second constant was implicitly embedded everywhere.
It wasn't -- only in a few places -- and with that fixed the code got more general and the proof of concept ran better than ever in about every simulator available. The amount of control it gave pilots was years ahead of the curve. But it never got a chance to fly on a real mission because what was there was "good enough" and nobody bothered to try.
In our combined comments there's a lesson about growing experiments and figuring out how to achieve failure quickly.
I think this is a great article that takes a maximalist point and that’s its flaw.
You should rewrite code only when the cost of adding a new feature (one that is actually necessary) to the old codebase becomes comparable to designing your entire system from scratch to allow for that feature to be added easily. That is to say that the cost of the rewrite should become comparable to the cost of continuing development. I have been a part of a couple of rewrites like that, one of them quite complex, and yes they were warranted and yes they worked.
But having said that you should absolutely be conservative with rewriting code. It’s a bad habit to always jump to a rewrite.
I think it’s very dependent on how you use words like “rewrite” or “refactor”. The point the author makes about the two page function, and all the bug-fixes (lessons learned) makes sense only if you “rewrite” from scratch without looking at the history. You can absolutely “rewrite” the function in a manner that is “refactoring”, but will often get called “rewrite” in the real world. This may be because “refactor” is sort of this English CS term that doesn’t have a real translation or usage in many languages and “rewrite” is sort of universal for changing text, but in CS is sort of “rebuilding” things.
I don’t think you necessarily need to be conservative about rewriting things. We do it all the time in fact. We build something to get it out there and see the usage, and then we build it better and then we do it again. Which often involves a lot of “rewriting” but thanks to principles like SOLID’s single responsibility makes this rather easy to both do and maintain (we write a lot of semi-functional code and try to avoid using OOP unless necessary, so we don’t really use all the parts of SOLID religiously).
I do agree that it’s never a good idea to get into things with the mind-set of “we can do this better if we start from scratch” because you can’t.
There's currently a trend towards shitting on microservices-everything, imo largely justified. But missing from that is that identifying a logical feature and moving it to a microservice is one of the safer ways to begin a gradual rewrite of a critical system. And usually possible to get cross-dept buyin for various not always wholesome reasons. It may not always be the best technical solution but it's often the best political one when a rewrite is necessary.
>identifying a logical feature and moving it to a microservice is one of the safer ways to begin a gradual rewrite of a critical system
Why not identify that same logical feature and move it into a library. How does a Microservice add value here?
Identifying, extracting and adding tests to logical features has been the sane way to rewrite software for ages. Michael feathers even wrote a book about it [1]. This ship of theseus approach works because it's incremental and allows for a complete rewrite without ever having non-functional software in between.
Adding a REST/networking/orchestration boundary to make it a Microservice just for the sake of extracting it adds a lot of complexity for no gain.
Microservice can be the right architecture, but not if all you want is to extract a library.
The big problem there is that the people you are letting loose on the alternative, are lost from the original, so O loses steam that A gains. You still have to produce bug fixes and features to _both_ O and A to keep them in sync. So you essentially have a doubled required production rate to be delivered using the same staff.
So in order for there to be a net gain, the gang working on the alternative have to be able to find such big wins as to being neigh impossible.
This is a very very hard problem in our domain. 99% of the time, we have to simply resist the urge to _just rewrite the sucker_. No! Don't do it! (And this is incredibly hard because we all want to.)
I'm saying that the productivity gain would have to be incredibly large since it has to encompass a doubled output of features and bug fixes for a net zero change.
Say you have product A with features P, Q, R, and S; writing product B has to reproduce P, Q, R, and S, plus X and Y that is currently produced by the A team. On top of that, it has to fix (conceptual) bugs within P, Q, R, and S. All this is to be done by the new, crack, team that aims to make it so that: cost-of-development(B) < cost-of-development(A).
But the point is that the difference in magnitude of cost-of-development(B) and cost-of-development(A) has to be rather large considering the amount of work needed to have a return on that investment at all.
Yes, that was a rough migration process, but the long-term result is we have an improved language and growing community instead of Python going the way of PHP and Perl.
Between the 3 P's, Python's strategic decisions in 2000s were clearly the most successful.
Python3 changes were many "little" things; some more fundamental than other (unicode str). So I guess they were able to split the work in tiny pieces and, ultimately, were able to manage the project...
Process isolation affects so many things in C. The strategy change is going to require changes to so many modules that it will either be a re-write or buggy.
In practical terms, if every line needs to be audited and updated, it is a re-write
What makes you think that it will require that many changes? There will be some widespread mechanical changes (which can be verified to be complete with a bit of low level work, like a script using objdump/nm to look for non-TLS mutable variables) and some areas changing more heavily (e.g. connection establishment, crash detection, signal handling, minor details of the locking code). But large portions of the code won't need to change. Note that we/postgres already shares a lot of state across processes.
I'm not the person you asked and I don't have any particular knowledge of postgres internals.
Experience with other systems has taught me that in a system that's been in active use and development for decades, entanglement will be deep, subtle, and pervasive. If this isn't true of postgres then it's an absolute freak anomaly of a codebase. It is that in other ways, so it's possible.
But the article mentions there being thousands of global variables. And Tom Lane himself says he considers it untenable for exactly this reason. That's a very good reason to think that it will require that many changes imo.
A large refactor at best. It will touch lots of parts of the code base, but the vast majority of the source code would remain intact. Otherwise they could just Rewrite it in Rust™ while they’re at it
> if every line needs to be audited and updated, it is a re-write
I’m not sure why you believe every line needs to be updated. Most code is thread agnostic.
I've used PHP in the past (PHP 4 and 5), as well as some simple templated projects in PHP 7. I try to keep up on news with what is happening in the PHP world, and it's difficult because of the hate for the language. Is the solution to Unicode strings still to just use the "mb_*" functions?
I got my real professional start using PHP, and have built even financial systems in the language (since ported to .NET 6 for my ease of maintenance, and better number handling). I'm still very interested in the language itself, in case I ever have the need to freelance or provide a solution to a client that can't afford what I can build in .NET (although to be honest, at this point I'm roughly able to code at the same speed in .NET as in PHP, but with the added type-safety, although I know PHP has really stepped up in providing this).
I believe so - most (all?) string functions have an mb_ equivalent, for working on multibyte strings.
Regular PHP strings are actually pretty great, since you can treat them like byte arrays. Fun fact: PHPs streaming API has an “in-memory” option and it’s… just a string under the hood.
Just don’t forget to use multibyte functions when you’re handling things like user input.
I have the "Professional PHP6" book which I feel like should be a collectors item or something.
Weird book IMO, because it has a lot of content that's just about general software development, rather than anything to do with PHP specifically, or the theoretical PHP6 APIs in particular.
PHP used to be the first computer language learned by people wanting to create a scripted web page. This was more true in the 90s but maybe it stuck. So it would be OK to add some general guidance about writing software and organizing projects.
I don't expect you or others to buy into any particular code change at
this point, or to contribute time into it. Just to accept that it's a
worthwhile goal. If the implementation turns out to be a disaster, then
it won't be accepted, of course. But I'm optimistic.
The reply is much more reasonable than this blanket assertion of a disaster.
As an outsider it doesn't sound like something a few people could spin off in a branch in a couple months and see how code review goes. They're talking about doing it over multiple (yearly?) releases. It seems like it'll take a lot of expert attention, which won't be available for other work and the changes themselves will impact all other ongoing work.
I'm not trying to naysay it per se, bc again I don't have technical knowledge of this codebase. But that's exactly the sort of scenario that can cause a large project to splinter or stall for years. Talking about "the implementation" absent the context that would be necessary to create that implementation seems naively optimistic, or at worst irresponsible.
You are talking about implementation, the OP was talking about raising the concept with interested parties and seeing whether it is worth even starting to think about it.
They could fork, they could add threading to some sub systems and roll it out over several versions.
I don't know enough about the code but, of course, it is a hard problem but the solution might be to build it from the ground up as a threaded system, using the skills learned over 30 years and taking the hit on the rebuild instead of reworking what is there.
I am most interested because I didn't realise there was a performance problem in the first place.
Am I going crazy, or has the obvious implementation of such a change been missed on people? If they were proposing taking a multi-threaded app and splitting it into a multi-process one, I would predict they would find a hell of a lot of unexpected or unknown implicit communication between threads, which would be a nightmare to untangle.
Going the other way, there is an extremely well understood interface between all the processes which run in isolation: shared memory. Nearly by definition this must be well coordinated between the processes.
So the first step in moving to a multi-threaded implementation would be to change nearly nothing about each process, and then just run each process in its own pthread, keeping all the shared memory ‘n all.
You would expect performance to be about the same, maybe a little better with the reduces TLB churn, but the architecture is basically unchanged. At that point, you can start to look at what are more appropriate communication/synchronisation mechanisms now you’re working in the same address space.
I just don’t understand why so many people seem to think this requires an enormous rewrite - having developed as a multi-process system means you’ve had to make so much of the problematic things explicit and control for them, and none of these threads would know anything at all about each other’s internals.
This should be considered a research effort, assuming it will be a complete rewrite. In light of that, you should not draw down resources from the established code base to work on it.
Ignoring the above, first state the explicit requirements driving this change and let people weigh in on those. This sounds like a geeky dev itch.
There are several forks of PostgreSQL, in various levels of license, additional features and activity. However, maintaining a fork in addition to a main project is inherently more expensive than maintaining just a single project, so adding features to new major releases of the main project is generally preferred over forking every release into its own, newly named, project. After all, that is what we have major (feature) releases and stabalization windows (beta releases) for.
Without being familiar with the Postgres source, this seems to be what I call a "somersault problem": hard to break down into sub-goals. I have heard that the Postgres codebase is solid which makes it easier but it's still mature and highly complex. It doesn't sound feasible to me.
The original post does describe several sub-problems. The group could first chip away at global state, signals, libraries. They can do this before changing the process model in any way.
Feel like the PostgreSQL Core Team should just build a new database from scratch using what they have learned from experience instead of attempting such a fundamental architectural migration. It would give them more freedom to change things also. Call it "postgendb" and provide a data migrator.
That's a great idea. I've been considering whether or not to use Cockroach Db at work, and I love the fact that it's distributed from the get go.
Why not work on something like that instead of changing something that works? Especially since they the process model really only runs into trouble on large systems.
If the existing code is old-school enough to use thousands of global variables in a thread-unsafe way, seems like changing it enough to compile as safe Rust code would push the "non-trivial" envelope pretty far.
Sorry if I offend anybody, but this sounds like such a bad idea. I have been running various versions of postgres in production for 15 years with thousands of processes on super beefy machines, and I can tell you without a doubt that sometimes those processes crash - specially if you are running any of the extensions. Nevertheless, Postgres has 99% of the time proven to be resilient. The idea that a bad client can bring the whole cluster down because it hit a bug sounds scary. Every try creating a spatial index on thousands/millions of records that have nasty overly complex or badly digitized geometries? Sadly, crashes are part of that workflow, and changing this from process to threading would mean all the other clients also crashing and cutting connections. This as a potential problem because I want to avoid context switching overhead or cache misses, no thanks.
However, it's already the case that if a postgres process crashes, the whole cluster gets restarted. I've occasionally seen this message:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
> However, it's already the case that if a postgres process crashes, the whole cluster gets restarted. I've occasionally seen this message:
Sure, but the blast radius of corruption is limited to that shared memory, not all the memory of all the processes. You can at least use the fact that a process has crashed to ensure that the corruption doesn't spread.
(This is why it restarts: there is no guarantee that the shared memory is valid, so the other processes are stopped before they attempt to use that potentially invalid memory)
With threads, all memory is shared memory. A single thread that crashes can make other threads data invalid before the detection of the crash.
yes, but postmaster is still running to roll back the transaction. If you crash a single multi-threaded process, you may lose postmaster as well and then sadness would ensue
The threaded design wouldn't necessarily be single-process, it would just not have 1 process for every connection. Things like crash detection could still be handled in a separate process. The reason to use threading in most cases is to reduce communication and switching overhead, but for low-traffic backends like a crash handler the overhead of it being a process is quite limited - when it gets triggered context switching overhead is the least of your problems.
Seconded. For instance, Firefox' crash reporter has always been a separate process, even at the time Firefox was mostly single-process, single-threaded. Last time I checked, this was still the case.
PostgreSQL can recover from abruptly aborted transactions (think "pulled the power cord") by replaying the journal. This is not going to change anyway.
Transaction roll back is a part of the WAL. Databases write to the disk an intent to change things, what should be changed, and a "commit" of the change when finished so that all changes happen as a unit. If the DB process is interrupted during that log write then all changes associated with that transaction are rolled back.
Running the whole DBMS as a bunch of threads in single process changes how fast is the recovery from some kind of temporary inconsistency. In the ideal world, this should not happen, but in reality it does and you do not want to bring the whole thing down because of some superficial data corruption.
On the other hand, all cases of fixable corrupted data in PostgreSQL I have seen were result of somebody doing something totally dumb (rsyncing live cluster, even between architectures), while on InnoDB it seems to happen somewhat randomly without any obvious reason of somebody doing stupid things.
Reading your comment makes me think it is not only a good idea, it is a necessity.
Relying on crashing as a bug recovery system is a good idea? Crashing is just part of the workflow? That's insane, and a good argument against PostgreSQL in any production system.
It is possible PostgreSQL doesn't migrate to a thread based model, and I am not arguing they should.
But debug and patch the causes of these crashes? Absolutely yes, and the sooner, the better.
A database has to handle situations outside its control, e.g. someone cutting the power to the server. That should not result in a corrupted database, and with Postgres it doesn't.
The fundamental problem is that when you're sharing memory, you cannot safely just stop a single process when encountering an unexpected error. You do not know the current state of your shared data, and if it could lead to further corruption. So restarting everything is the only safe choice in this case.
We do fix crashes etc, even if the postgres manages to restart.
I think the post upthread references an out-of-core extension we don't control, which in turn depends on many external libraries it doesn't control either...
Building a database which is never gonna crash might be possible but at what cost? Can you name any single real world system archived that? Also, there can be a regression. More tests? Sure but again, at what cost?
While we are trying to get there, having a crash proof architecture is also a very practical approach.
We don't want stuff to crash. But we also want data integrity to be maintained. We also want things to work. In a world with extensions written in C to support a lot of cool things with Postgres, you want to walk and chew bubblegum on this front.
Though to your point, a C extension can totally destroy your data in other ways, and there are likely ways to add more barriers. And hey, we should fix bugs!
Is the actual number you got 99%? Seems low to me but I don’t really know about Postgres. That’s 3 and a half days of downtime per year, or an hour and a half per week.
Well, hour and half per week is the amount of downtime that you need for modestly sized database (units of TB) accessed by legacy clients that have ridiculously long running transactions that interfere with autovacuum.
I'm honestly surprised it took them so long to reach this conclusion.
> That idea quickly loses its appeal, though, when one considers trying to create and maintain a 2,000-member structure, so the project is unlikely to go this way.
As repulsive as this might sound at first, I've seen structures of hundreds of fields work fine if the hierarchy inside them is well organized and they're not just flat. Still, I have no real knowledge of the complexity of the code and wish the Postgres devs all the luck in the world to get this working smoothly.
> I'm honestly surprised it took them so long to reach this conclusion.
I'm not. You can get a long way with conventional IPC, and OS processes provide a lot of value. For most PostgreSQL instances the TLB flush penalty is at least 3rd or 4th on the list of performance concerns, far below prevailing storage and network bottlenecks.
I share the concerns cited in this LWN story. Reworking this massive code base around multithreading carries a large amount of risk. PostgreSQL developers will have to level up substantially to pull it off.
A PostgreSQL endorsed "second-system" with the (likely impossible, but close enough that it wouldn't matter) goal of 100% client compatibility could be a better approach. Adopting a memory safe language would make this both tractable and attractive (to both developers and users.) The home truth is that any "new process model" effort would actually play out exactly this way, so why not be deliberate about it?
From what I gather postgres isn't doing conventional IPC but instead it uses shared memory, which means the same mechanism threads use but with way higher complexity
IPC, to me, includes the conventional shared memory resources (memory segments, locks, semaphores, condition variable, etc.) used by these systems: resources acquired by processes for the purpose of communication with other processes.
I get it though. The most general concept of shared memory is not coupled to an OS "process." You made me question whether my concept of term IPC was valid, however. So what does one do when a question appears? Stop thinking immediately and consult a language model!
Q: Is shared memory considered a form of interprocess communication?
GPT-4: Yes, shared memory is indeed considered a form of interprocess communication (IPC). It's one of the several mechanisms provided by an operating system to allow processes to share and exchange data.
...
Why does citing ChatGPT make me feel so ugly inside?
I always understood IPC, "interprocess communication", in general sense, as anything and everything that can be used by processes to communicate with each other - of course with a narrowing provision that common use of the term refers to those means that are typically used for that purpose, are relatively efficient, and the process in question run on the same machine.
In that view, I always saw shared memory as IPC, in that it is a tool commonly used to exchange data between processes, but of course it is not strictly tied to any process in particular. This is similar to files, which if you squint are a form of IPC too, and are also not tied to any specific process.
> Why does citing ChatGPT make me feel so ugly inside?
That's probably because, in cases like this, it's not much different to stating it yourself, but is more noisy.
> Why does citing ChatGPT make me feel so ugly inside?
Its the modern let me Google that for you. Just like people don't care what the #1 result on Google is, they also don't care what ChatGPT has to say about it. If they did, they'd ask it themselves.
With regard to client compatibility there are related precedents for this already; the PostgreSQL wire protocol has emerged as a de facto standard. Cockroachdb and ClickHouse are two examples that come to mind.
Postmaster would just share the already shared memory between processes (containing also the locks). That explicit part of memory would opt-in to thread -like sharing and thus get faster/less tlb switching and lower memory usage. While all the rest of the state would still be per-process and safe.
tl;dr super share the existing shared memory area with kernel patch
All operating systems not supporting it would keep working as is.
Yeah. I think as a straightforward, easily correct transition from 2000 globals, a giant structure isn't an awful idea. It's not like the globals were organized before! You're just making the ambient state (awful as it is) explicit.
We did this with a project I worked on. I came on after the code was mature.
While we didn't have 2000 globals, we did have a non-trivial amount, spread over about 300kLOC of C++.
We started by just stuffing them into a "context" struct, and every function that accessed a global thus needed to take a context instance as a new parameter. This was tedious but easy.
However the upside was that this highlighted poor architecture. Over time we refactored those bits and the main context struct shrunk significantly.
The result was better and more modular code, and overall well worth the effort in our case, in my opinion.
> I think as a straightforward, easily correct transition from 2000 globals, a giant structure isn't an awful idea.
Agree.
> It's not like the globals were organized before!
Using a struct with 2000 fields loses some encapsulation.
When a global is defined in a ".c" file (and not exported via a ".h" file), it can only be accessed in that one ".c" file, sort of like a "private" field in a class.
Switching to a single struct would mean that all globals can be accessed by all code.
There's probably a way to define things that allows you to regain some encapsulation, though. For example, some spin on the opaque type pattern: https://stackoverflow.com/a/29121847/163832
A plain global can be accessed from other compiled units - agreed with no .h entry it is my=uch more error prone e.g. you don't know the type but the variables name is exposed to other objects
At most they'd be determined to be read only constants that are inlined during constant folding. This includes most integral sized / typed scalar values that fit into registers for the most part, and nothing you've taken the address of either - those remain as static data.
I think there might be a terminology mix-up here. In C, a global variable with the `static` keyword is is still mutable. So it typically can't be constant-folded/inlined.
The `static` modifier in that context just means that the symbol is not exported, so other ".c" files can't access it.
A static variable in C is mutable in the same sense that a local variable is, but since it's not visible outside the current compilation unit the optimizer is allowed to observe that it's never actually modified or published and constant fold it away.
Check out the generated assembly for this simple program, notice that kBase is folded even though it's not marked const: https://godbolt.org/z/h45vYo5x5
It is also possible for a link-time optimizer to observe that a non-static global variable is never modified and optimize that away too.
But the Postgres mailing list is talking about 2000 global variables being a hurdle to multi-threading. I doubt they just didn't realize that most of them can be optimized into constants.
Yea. Just about none of them could be optimized to constants because, uh, they're not constant. We're not perfect, but we do add const etc to TU level statics/globals that are actually read only. And if they are actually read only, we don't care about them in the context of threading anyway, since they wouldn't need any different behaviour anyway.
Exactly, if you're now forced to put everything in one place you're forced to acknowledge and understand the complexity of your state, and might have incentives to simplify it.
I believe I can safely say that nobody acknowledges and understands the complexity of all state within that class, and that whatever incentives there may be to simplify it are not enough for that to actually happen.
Right but that would still be true if they were globals instead. Putting all the globals in a class doesn't make any difference to how much state you have.
> Putting all the globals in a class doesn't make any difference to how much state you have.
I didn't make any claims about the _amount_ of state. My claim was that “you're forced to acknowledge and understand the complexity of your state” (i.e., moving it all together in one place helps understanding the state) is plain-out wrong.
It's not wrong. Obviously putting it all in one place makes you consider just how much of it you have, rather than having it hidden away all over your code.
Yes, it’s the most pragmatic and it’s only “awful” because it makes the actual problem visible. And would likely encourage slowly refactoring code to handle its state in a more sane way, until you’re only left with the really gnarly stuff, which shouldn’t be too much anymore and you can put them in individual thread local storages.
I think my bigger fear is around security. A process per connection keeps things pretty secure for that connection regardless of what the global variables are doing (somewhat hard to mess that up with no concurrency going on in a process).
Merge all that into one process with many threads and it becomes a nightmare problem to ensure some random addon didn't decide to change a global var mid processing (which causes wrong data to be read).
Access checking, yes, but the scope of memory corruption does increase unavoidably, given the main thing the pgsql-hackers investigating threads want: one virtual memory context when toggling between concurrent work.
Of course, there's a huge amount of shared space already, so a willful corruption can already do virtually anything. But, more is more.
I've never really been limited by CPU when running postgres (few TB instances). The bottleneck is always IO. Do others have different experience? Plus there's elegance and a feeling of being in control when you know query is associated with specific process which you can deal with and monitor just like any other process.
But I'm very much clueless about internals, so this is a question rather than an opinion.
I see postgres become CPU bound regularly: Lots of hash joins, copy from or to CSV, index or materialized view rebuild. Postgis eats CPU. Tds_fdw tends to spend a lot of time doing charset conversion, more than actually networking to mssql.
I was surprised when starting with postgres. Then again, I have smaller databases (A few TB) and the cache hit ratio tends to be about 95%. Combine that with SSDs, and it becomes understandable.
Even so, I am wary of this change. Postgres is very reliable, and I have no problem throwing some extra hardware to it in return. But these people have proven they know what they are doing, so I'll go with their opinion.
It's not just CPU - memory usage is also higher. In particular, idle connections still consume signficant memory, and this is why PostgreSQL has so much lower connection limits than eg. MySQL. Pooling can help in some cases, but pooling also breaks some important PostgreSQL features (like prepared statements...) since poolers generally can't preserve session state. Other features (eg. notify) are just incompatible with pooling. And pooling cannot help with connections that are idle but inside a transaction.
That said, many of these things are solvable without a full switch to a threaded model (eg. by having pooling built-in and session-state-aware).
> solvable without a full switch to a threaded model (eg. by having pooling built-in and session-state-aware).
Yeeeeesssss, but solving that is solving the hardest part of switching to a threaded model. It requires the team to come terms with the global state and encapsulating session state in a non-global struct.
> That said, many of these things are solvable without a full switch to a threaded model (eg. by having pooling built-in and session-state-aware).
The thing is that that's a lot easier with threads. Much of the session state lives in process private memory (prepared statements etc), and it can't be statically sized ahead of time. If you move all that state into dynamically allocated shared memory, you've basically paid all the price for threading already, except you can't use any tooling for threads.
I've generally had buffer-cache hit rates in the 99.9% range, which ends up being minimal read I/O. (This is on AWS Aurora, where these bo disk cache and so shared_buffers is the primary cache, but an equivalent measure for vanilla postgres exists.)
In those scenarios,there's very little read I/O. CPU is the primary bottleneck. That's why we run up as many as 10 Aurora readers (autoscaled with traffic).
Throw a ridiculous amount of RAM at it is more correct assessment. NVMe reads are still an “I/O” and that is slow. And for at least 10 years buying enough RAM to have all off the interesting parts of OLTP psql database either in shared_buffers or in the OS-level buffer cache is completely feasible.
It's orders of magnitude faster than SAS/SATA SSDs and you can throw 10 of them into 1U server. It's nowhere near "slow" and still easy enough to be CPU bottlenecked before you get IO bottlenecked.
But yes, pair of 1TB RAM servers gotta cost you less than half year's worth of developer salary
an array of modern SSDs can get to a similar bandwidth to RAM, albeit with significantly worse latency still. It's not that hard to push the bottleneck elsewhere in a lot of workloads. High performance fileservers, for example, need pretty beefy CPUs to keep up.
With modern SSDs that can push 1M IOPs+, you can get into a situation where I/O latency starts to become a problem, but in my experience, they far outpace what the CPU can do. Even the I/O stack can be optimized further in some of these cases, but often it comes with the trade off of shifting more work into the CPU.
Postgres uses lots of cpu and memory if you have many connections and especially clients that come and go frequently. Pooling and bouncers help with that. That experience should better come out of the box, not by bolting on tools around it.
+significant and unknown set of new problems, including new bugs.
This reminds me of the time they lifted entire streets in Chicago by 14 feet to address new urban requirements. Chicago, we can safely assume, did not have the option of just starting a brand new city a few miles away.
The interesting question here is should a system design that works quite well upto a certain scale be abandoned in order to extend its market reach.
Also, even if a 2k-member structure is obnoxious, consider the alternative - having to think about and manage 2k global variables is probably even worse!
I think this is a situation where a message-passing Actor-based model would do well. Maybe pass variable updates to a single writer process/thread through channels or a queue.
Years ago I wrote an algorithmic trader in Python (and Cython for the hotspots) using Multiprocessing and I was able to get away with a lot using that approach. I had one process receiving websocket updates from the exchange, another process writing them to an order book that used a custom data structure, and multiple other processes reading from that data structure. Ran well enough that trade decisions could be made in a few thousand nanoseconds on an average EC2 instance. Not sure what their latency requirements are, though I imagine they may need to be faster.
Obviously mutexes are the bottleneck for them at this point, and while my idea might be a bit slower than a low-load situation, perhaps it would be faster when you start getting to higher load.
> I'm honestly surprised it took them so long to reach this conclusion.
Oracle also uses a process model on Linux. At some point (I think starting with 12.x), it can now be configured on Linux to use a threaded model, but the default is still a process-per-connection model.
Why does everybody think it's a bad thing in Postgres, but nobody thinks it's a bad thing in Oracle.
With the multi-threaded tcc above it scales about as well as multiprocess. With mainline it doesn't scale well at all.
So far I haven't gotten around to reusing anything across libtcc handles/instances, but would eventually like to share mmap()'d headers across instances, as well as cache include paths, and take invocation arguments through stdin one compilation unit per line.
I don't see the problem. All variables are either set in config or at runtime and then for every new query they are read and used by PostgreSQL (at least this is my understanding).
Regarding the threading issue, I think you can do the connections part multithreaded instead of one process per connection and still use IPC between this and postmaster. Because of the way PostgreSQL currently works, seems feasible to move parts one by one into a threaded model and instead of tens/hundreds of processes you can have just a few and a lot of threads.
Honestly, they should prototype it and see how it looks like and then decide on the way forward.
I don't get it. How is a 2000-member structure any different from having 2000 global variables? How is maintaining the struct possibly harder than maintaining the globals? Refactoring globals to struct members is semantically nearly identical, it may as well just be a mechanical, cosmetic change, while also giving the possibility to move to a threaded architecture.
Because global variables can be confined to individual cpp files, exclusively visible in that compilation unit. It makes them far easier to reason with than hoisting them to the "global and globally visible" option if you just use a gargantuan struct. Which is why a more invasive refactor might be required.
What if the global variable has a greater scope than just a single TU? For simple variables of limited scope this approach would work but for more complex variables that are impacting multiple "modules" in the code it would introduce yet another code design problem to solve.
Yeah, I was really into that before there was even a cross-compiler/cross-platform syntax for declaring TLS values in C++ but have since “upgraded” to avoiding TLS altogether where possible. The quality of the implementations vary greatly from compiler and platform to compiler and platform, you run into weird issues with thread_at exit if they’re not primitive types, they run afoul of any fibers/coroutines/etc that have since become extremely prevalent, and a few other things.
I recently looked through the source code of postgresql and every source files starts with a (really good) description of what the file is supposed to do, which made it really easy to get in to the code compared to other open source projects I've seen. So thanks for that.
I have no idea why that isn't standard practice in every codebase. I should be able to figure out your code without having to ask, or dig through issues or commit messages. Just tell me what it's for!
Because it takes a lot of time and because the comments can get outdated. I also want this for all my code bases. But do I always do this myself? No, especially on green field projects. I will sometimes go back and annotate them later.
Trying to understand what I previously wrote and why I wrote it takes more time than I ever care to spend. I'd much rather have the comments, plus at this point, by making them a "first class" part of my code, I find them much easier to write and I find the narrative style I use incredibly useful in laying out a new structure but also in refactoring old ones.
Even outdated comments can tell you the original purpose of the code, which helps if you're looking for a bug. Especially if you're looking for a bug.
If someone didn't take the time to update the comments and the reviewers didn't point it out, then you've probably found the bug because someone was cowboying some shitty code.
Outdated comments are often way worse than no comments, because they can give you wrong ideas that aren't true anymore, and send you off in the wrong direction before you finally figure out the comment was wrong.
It kind of is in rust now, with module-level documentation given its own specific AST representation instead of just being a comment at the top of the file (a file is a module).
Having been using and administering a lot of PostgreSQL servers, I hope they don't lose any stability over this.
I've seen (and reported) bugs that caused panics/segfaults in specific psql processes. Not just connections, also processes related to wal writing or replication. The way it's built right now, a child process can be just forced to quit and it does not affect other processes. Hopefully switching into thread won't force whole PostgreSQL to panic and shut down.
Because of shared memory most panics and seg faults in a worker process take down the entire server already (this wasn’t always the case, but not doing so was a bug).
Most likely, the postmaster will maintain a separate process, much like today with pg, or similar to Firefox or Chrome's control process that can catch the panic'd process, cleanup and restart them. The WAL can be recovered as well if there were broken transactions in flight.
100%. Same here. There's a lot of baby in the processes, not just bathwater.
As a longstanding PG dev/DBA who doesn't know much about its internals, I would say that they should just move connection pooling into the main product.
Essentially, pgbouncer should be part of PG and should be able to manage connections with knowledge of what each connections is doing. That, plus, some sort of dynamic max connection setting based on what's actually going on.
That'll remove almost all the dev/DBA pain from separate processes.
Of course it will. That's better than continue working with damaged memory structures and unpredictable consequences. For database it's more important than ever. Imagine writing corrupted data because other thread went crazy.
You're implying that only an OS can provide memory separation between units of execution - at least in .NET AppDomains give you the same protection within a single process, so why couldn't postgres have its own such mechanism?
I'd also think with a database engine shared state is not just in-memory - i.e. one process can potentially corrupt the behaviour of another by what it writes to disk, so moving to a single-process model doesn't necessarily introduce problems that could never have existed previously (but, yes, would arguably make them more likely)
No AppDomains are not as good as processes, I have tried to go that route before, you cannot stop unruly code reliably in an app domain (you must use thread.abort() which is not good) and memory can still leak in any native code used there.
The only reliable way to stop bad code like say an infinite loop is to run in another process even in .Net.
They also removed Appdomain in later versions of .Net because they had little benefit and weak protections compared to a a full process.
Not claiming they're as good, just noting that there are alternative ways to provide memory barriers, though obviously if it's not enforced at the language/runtime level, it requires either super strong developer disciple or the use of some other tool to do so.
I can't find anything suggesting AppDomains have been removed completely though, just they're not fully supported on non-Windows platforms, which is interesting, I wonder if that means they do have OS-level support.
"On .NET Core, the AppDomain implementation is limited by design and does not provide isolation, unloading, or security boundaries. For .NET Core, there is exactly one AppDomain. Isolation and unloading are provided through AssemblyLoadContext. Security boundaries should be provided by process boundaries and appropriate remoting techniques."
AppDomains pretty much only allowed you to load unload assemblies and provided little else. If you wanted to stop bad code you still used Thread.Abort which left your runtime in a potentially bad state due to no isolation between threads.
Is that saying global variables are shared between AppDomains on .NET core then? Scary if so, we have a bunch of .NET framework code we're looking at porting to .NET core in the near future, and I know it relies on AppDomain separation currently. It's not the first framework->Core conversation I've done, but I don't remember changes in AppDomain behaviour causing any issues the first time.
As it happens I already know there are bits of code currently not working "as expected" exactly because of AppDomain separation - i.e. attempting to use a shared-memory cache to improve performance and in one or two cases in an attempt to share state, and I got the impression whoever wrote that code didn't understand that there even were two AppDomains involved, and used various ugly hacks to "fall back" to alternative means of state-sharing, but in fact the fall-back is the only thing that actually ever works.
> Is that saying global variables are shared between AppDomains on .NET core then?
No, you can't create a second AppDomain at all. AppDomains are dead and buried; you would need to remove all of that from your code in order to migrate to current .NET. The class only remains to serve a couple ancillary functions that don't involve actually creating additional AppDomains.
I don't know .NET enough to comment here, but I'm pretty sure that if you would manage to run bare metal C inside your .NET app (should be possible), it'll destroy all your domains easily. RAM is RAM. The only memory protection that we have is across process boundary (even that protection is not perfect with shared memory, but at least it allows to protect private memory).
At least I'm not aware of any way to protect private thread memory from other threads.
Postgres is C and that's not going to change ever.
I certainly wasn't suggesting it would make sense to rewrite Postgres to run on .NET (using any language, even managed C++, assuming anyone still uses that).
Yes, it's inherent in the C/C++ language that it's able to randomly access any memory that a process has access to, and obviously on that basis OS-provided process-separation is the "best" protection you can get, just pointing out that it's not the only possibility.
.NET is a managed-language with a VM. In such language, a memory error in managed-code will often trigger a jump back to the VM, where they can attempt to recover from there.
For native code, there's no such safety net. Likewise, even for managed language, an error in the interpreter code will still crash the VM, since there's nothing to fallback to anymore.
True, if you're talking unrestricted native code, I'd essentially agree with the OP's implication that only the OS (and the CPU itself) is capable of providing that sort of memory protection. I guess I was just wondering what something like AppDomains in C might even look like (e.g. all global variables are implicitly "thread_local"), and how much could be done at compile-time using tools to prevent potentially "dangerous" memory accesses. I've never looked at the postgres source in any detail so I'm likely underestimating the difficulty of it.
Back about a decade ago I was "auditing" someone else's threaded code. And couldn't figure it out. But he was the company's "golden child" so by default it must be working code because he wrote it.
And then it started causing deadlocks in prod.
"What do you want me to do about it? It's the golden child's code. He's not even gonna show up til 2pm today."
I'm not sure if I'd judge it as harshly, but you have a good point: A lot of debugging / validation tooling understands threads, but not memory shared between processes.
In Linux, multi process with shared memory regions is basically just threads. The kernel doesn’t know anything about threads, it knows about processes and it lets you share memory regions between those processes if you so desire.
By bespoke you mean using standard interfaces to create shared memory pools?
They do roll some of their own locking primitives, but that's not particularly unusual in a large portable program (and quite likely what they wanted is/was not available in glibc or other standard libraries, at least when first written).
On UNIX systems, Oracle uses a multi-process model, and you can see these:
$ ps -ef | grep smon
USER PID PPID STARTED TIME %CPU %MEM COMMAND
oracle 22131 1 Mar 28 3:09 0.0 4.0 ora_smon_yourdb
Windows forks processes about 100x slower than Linux, so Oracle runs threaded on that platform in one great big PID.
Sybase was the first major database that fully adopted threads from an architectural perspective, and Microsoft SQL Server has certainly retained and improved on that model.
> Windows forks processes about 100x slower than Linux...
I work with a Windows-based COTS webapp that uses Postgres w/o any connection pooling. It's nearly excruciating to use because it spins-up new Postgres processes for each page load. If not for the fact that the Postgres install is "turnkey" with the app I'd just move Postgres over to a Linux machine.
If you run postgres under WSLv1 (now available on Server Edition as well), the WSL subsystem handles processes and virtual memory in a way that has been specifically designed to optimize process initialization as compared to the traditional Win32 approach.
Could you point out, aside from the large numbers of clients I mentioned (and the development overhead of implementing multi-process memory management code), what the article mentions is a primary drawback of using processes over threads?
> The overhead of cross-process context switches is inherently higher than switching between threads in the same process - and my suspicion is that that overhead will continue to increase. Once you have a significant number of connections we end up spending a lot of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes.
Yes, that's per-client performance scaling ("significant number of connections"), which indicates a pooled connection model might mitigate most of the performance impact while allowing some core code to remain process-oriented (and thus, not rewritten).
pgbouncer is not transparent, you loose features, particularly when using the pooling mode actually allowing a larger number of active concurrent connections. Solving those issues is a lot easier with threads than with processes.
That helps a lot but it's not a replacement for large number of persistent connections. If you had that you could simplify things in the application layer and do interesting things with the DB.
Didn't Oracle switch to threaded model in 12c - at least on Linux I remember there being a parameter to do that - it dropped the number of processes significantly.
It's always amazed me with databases why they don't go the other way.
Create an operating system specifically for the database and make it so you boot the database.
Databases seem to spend most of their time working around the operating system abstractions. So why not look at the OS, and streamline it for database use - dropping all the stuff a database will never need.
That then is a completely separate project which is far easier to get started rather than shoehorning the database into an operating system thread model that is already a hack of the process model.
I'm not sure what you mean by OS.
If you mean a whole new kernel, it will take decades. They can support only small number of HW. If you mean a specialized linux distro, many companies does that already.
I don't know how that can make it easier the process based / thread based problem.
That was/is part of the promise of the whole unikernel thing, no?
https://mirage.io/ or similar could then let you boot your database. That said, it's not really taken off from what I can tell, so I'm guessing there's more to it than that.
Yeah indeed, that was my feeling on it as well. As much as Linux et al might get in ones way at times, what we get for free by relying on them is too useful to ignore for most tasks I think.
That said, perhaps at AWS or Google scale that would be different? I wonder if they've looked at this stuff internally.
You can get most of these speedups by using advanced APIs like IO_uring and friends, while still benefiting of using an OS, which is taking care of the messy and thankless task of hardware support.
> Create an operating system specifically for the database and make it so you boot the database.
(Others downthread have pointed out unikernels and I agree with the criticisms)
This proposal is an excellent Phd project for someone like me :-)
It ticks all of the things I like to work on the most[1]:
Will involve writing low-level OS code
Get to hyper-focus on performance
Writing a language parser and executor
Implement scheduler, threads, processes, etc.
Implement the listening protocol in the kernel.
I have to say, though, it might be easier to start off with a rump kernel (netBSD), then add in a specific RAW disk access that bypasses the OS (no, or fewer, syscalls to use it), create a kernel module for accepting a limited type of task and executing that task in-kernel (avoiding a context-swtich on every syscall)[2].
Programs in userspace must have the lowest priority (using starvation-prevention mechanisms to ensure that user input would eventually get processed).
I'd expect a non-insignificant speedup by doing all the work in the kernel.
The way it is now,
userspace requests read() on a socket (context-switch to kernel),
gets data (context-switch to userspace),
parses a query,
requests read on disk (multiple context-switches to kernel for open, stat, etc, multiple switches back to userspace after each call is complete). This latency is probably fairly well mitigated with mmap, though.
logs diagnostic (multiple context-switches to and from kernel)
requests write on client socket (context switch to kernel back and forth until all data is written).
The goal of the DBOS would be to remove almost all the context-switching between userspace and kernel.
[1] My side projects include a bootable (but unfinished) x86 OS, various programming languages, performant (or otherwise) C libraries.
[2] Similar to the way RealTime Linux calls work (caller shares a memory buffer with rt kernel module, populates the buffer and issues a call, kernel only returns when that task is complete). The BPF mechanism works the same. It's the only way to reduce latency to the absolute physical minimum.
> Create an operating system specifically for the database and make it so you boot the database.
I have the impression that this is similar to the adhoc filesystem idea; this seems in principle very advantageous (why employing two layers that do approximately the same thing on top of each other?), but in reality, when implemented (by Oracle), it lead to only a minor improvement (a few % points, AFAIR).
It sounds like the specific concerns here are actually around buffer pool management performance in and around the TLB: "Once you have a significant number of connections we end up spending a *lot* of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes. "
Many of the comments here seem to be missing this and talking about CPU-boundedness generally and thread-per-request vs process etc models, but this seems orthogonal to that, and is actually quite specific about the VM subsystem and seems like a legitimate bottleneck with the approach Postgres has to take for buffer/page mgmt with the process model it has now.
I'm no Postgres hacker (or a Linux kernel hacker), and I only did a 6 month stint doing DB internals, but it feels to me like perhaps the right answer here is that instead of Postgres getting deep down in the weeds refactoring and rewriting to a thread based model -- with all the risks in that that people have pointed out -- some assistance could be reached for by working on specific targeted patches in the Linux kernel?
The addition of e.g. userfaultfd shows that there is room for innovation and acceptance of changes in and around kernel re: page management. Some new flags for mmap, shm_open, etc. to handle some specific targeted use cases to help Postgres out?
Also wouldn't be the first time that people have done custom kernel patches or tuning parameters to crank performance out of a database.
The TLB issue are more a hardware issue than a software/OS one. To my knowledge neither x86 nor arm provide a way to partially share TLB contents between processes/entities. Tlb entries can be tagged with a process context identifier, but that's an all or nothing thing. Either the entire address space is shared, or it's not.
> Yet rewrites are always easier and more sexy. At first.
Pretty sure Tom Lane said this will be a disaster in that same pgsql-hackers thread. Not entirely sure what benefits the multi-threaded model will have when you can easily saturate the entire CPU with just 128 connections and a pooler. So I doubt there is consensus or even strong desire from the community to undertake this boil the ocean project.
On the other hand, having the ability to shut down and cleanup the entire memory space of a single connection by just disconnecting is really nice, especially if you have extensions that do interesting things.
> Tom Lane said: "I think this will be a disaster. There is far too much code that will get broken". He added later that the cost of this change would be "enormous", it would create "more than one security-grade bug", and that the benefits would not justify the cost.
>Not entirely sure what benefits the multi-threaded model will have when you can easily saturate the entire CPU with just 128 connections and a pooler.
That the all of those would work faster because of performance benefits, as mentioned in article
"the benefits would not justify the cost". PostgreSQL, like any software, at some point in it's life need to be refactored. Why not refactor with a thread model. Of course there will be bugs. Of course it will be difficult. But I think it is a worthwhile endeavor. Doesn't sound like this will happen but a new project would be cool.
> like any software, at some point in it's life need to be refactored.
This is simply not true for most software. Software has a product life cycle like everything else and major refactors/rewrites should be weighed carefully against cost/risk of the refactor. Many traditional engineering fields do much better at this analysis.
Although, because I run a contracting shop, I have personally profited greatly by clients thinking this is true and being unable to convince them otherwise.
"Difficult" doesn't even begin to do it justice. Making a code which has 2k global variables and probably order of magnitude as many underlying assumptions (the code should know that now every time you touch X you may be influenced or influence all other threads that may touch X) is a gargantuan task, and will absolutely for sure involve many iterations which any sane person would never let anywhere near valuable data (and how long would it take until you'd consider it safe enough?). And making this all performant - given that shared-state code requires completely different approach to thinking about workload distribution, something that performs when running in isolated processes may very well get bogged down in locking or cache races hell when sharing the state - would be even harder. I am not doubting Postgres has some very smart people - much, much smarter than me, in any case - but I'd say it could be more practical to write new core from scratch than trying to "refactor" the core that organically grew for decades with assumptions of share-nothing model.
a better option would just create an experimental fork that has a different name and is obviously a different product but based on the original source. That way pg gets updates and remains stable and if they fail, they fail and it doesn't hurt all the pg in production.
If you're more interested in horizontal scaling, you may want to look into CockroachDB, which has a Postgres compatible protocol, but still quite different. There are a lot more limitations with CDB over Pg though.
With the changes suggested, I'm not sure it's the best idea from where Postgres is... if might be an opportunity to rewrite bits in Rust, but even then, there is a LOT that can go wrong. The use of shared memory is apparently already in place, and the separate process and inter-process communication isn't the most dangerous part... it's the presumption, variables and other contextual bits that are currently process globals that wouldn't be in the "after" version.
The overall surface is just massive... That doesn't even get into plugin compatibility.
This sounds like a problem that would border on the complexity of replacing the GIL in Ruby or Python. The performance benefits are obvious but it seems like the correctness problems would be myriad and a constant source of (unpleasant) surprises.
This is different because there isn’t a whole ecosystem of packages that depend on access to a thread unsafe C API. Getting the GIL out of core Python isn’t too challenging. Getting all of the packages that depend on Python’s C API working is.
An other component of the Gil story is that removing the Gil require adding fine grained locks, which (aside from making VM development more complicated) significantly increases lock traffic and thus runtime costs, which noticeably impacts single-threaded performance, which is of major import.
Postgres starts from a share-nothing architecture, it’s quite a bit easier to evaluate the addition of sharing.
Postgres already shares a lot of state between processes via shared memory. There's not a whole lot that would initially change from a concurrency perspective.
> which (aside from making VM development more complicated) significantly increases lock traffic and thus runtime costs, which noticeably impacts single-threaded performance, which is of major import.
I don't think that's a fair characterization of the trade offs. Acquiring uncontended mutexes is basically free (and fairly side-effect free) so single-threaded performance will not be noticeably impacted.
Every large C project I'm aware of (read: kernels) that has publicly switched from coarse locks to fine-grained locks has considered it to be a huge win with little to no impact on single-threaded performance. You can even gain performance if you chop up objects or allocations into finer-grained blobs to fit your finer-grained locking strategy because it can play nicer with cache friendliness (accessing one bit of code doesn't kick the other bits of code out of the cache).
> which noticeably impacts single-threaded performance, which is of major import.
1) I don't buy this a priori. Almost everybody who removed a gigantic lock suddenly realizes that there was more contention than they thought and that atomizing it made performance improve.
2) Had Python bitten the bullet and removed the GIL back at Python 3.0, the performance would likely already be back to normal or better. You can't optimize hypothetically. Optimization on something like Python is an accumulation of lots of small wins.
You don’t have to buy anything, that’s been the result of every attempt so far and a big reason for their rejection. The latest effort only gained some traction because the backers also did optimisation work which compensated (and then was merged separately).
> Almost everybody who removed a gigantic lock
See that’s the issue with your response, you’re not actually reading the comment you’re replying to.
And the “almost” is a big tell.
> suddenly realizes that there was more contention than they thought and that atomizing it made performance improve.
There is no contention on the gil in single threaded workloads.
> Had Python bitten the bullet and removed the GIL back at Python 3.0
It would have taken several more years and been completely DOA.
> there isn’t a whole ecosystem of packages that depend on access to a thread unsafe C API
They mentioned a similar issue for Postgres extensions, no?
> Haas, though, is not convinced that it would ever be possible to remove support for the process-based mode. Threads might not perform better for all use cases, or some important extensions may never gain support for running in threads.
The correctness problem should be handled by a suite of automated tests which PostgreSQL has. If all tests pass, the application must work correctly. The project is too big, and has too many developers to make much progress without full test coverage. Where else would up-to-date documentation regarding the correct behavior of PostgreSQL exist? In some developers head? SQLite is pretty famous for there extreme approach to testing including out of memory conditions, and other rare circumstances: https://www.sqlite.org/testing.html
Parallelism is often incredibly hard to write automated tests for, and this will most likely create parallelism issues that were not dreamed of by the authors of the test suite.
> If all tests pass, the application must work correctly.
These are "famous last words" in many contexts, but when talking about difficult-to-reproduce parallelism issues, I just don't think it's a particularly applicable viewpoint at all. No disrespect. :)
Even the performance benefits are not big enough compare to the GIL.
Biggest problem of the process model might be the cost of having too many DB connections. Each client need a dedicated server process. Memory usage and the context switching overhead. Or if there is no connection pool, connection time overhead is very high.
This problem has been well addressed with a connection pool. Or having a middle ware instead of exposing the DB directly. That works very well so far.
Oracle has been supporting the thread based model and it's been usable for decades. I remember I tried the thread based configuration option (MTS or shared server) in 1990s. But no one likes that at least within my Oracle DBA network.
It would be a great research project but it would be a big problem if the community pushs this too early.
It would be interesting to have something between threads and processes. I'll call them heavy-threads for sake of discussion.
Like light-threads, heavy-threads would share the same process-security-boundary and therefore switching between them would be cheap. No need to flush TLB, I$, D$.
Like processes, heavy-threads would have mostly-separate address spaces by default. Similar to forking a process, they could share read-only mappings for shared libraries, code, COW global variables, and explicitly defined shared writable memory regions.
Like processes, heavy-threads would isolate failure states. A C++ exception, UNIX signal, segfault, etc. would kill only the heavy-thread responsible.
TLB isn't "flushed" so much as it is useless across different memory address spaces. Switching processes means switching address spaces, which means you have to switch the contents of the TLB to the new process' TLB entries, which eventually indeed flushes the TLB, but that is only over time, not necessarily the moment you switch processes.
> Like processes, heavy-threads would have mostly-separate address spaces by default.
This thus conflicts with the need to not flush TLBs. You can't not change TLB contents across address spaces.
1. Mostly separate address spaces requires changing the TLB on context switch (modern hw lets it be partial). You could use MPKs to share a single address space with fast protection switches.
2. Threads share the global heap, but your heavy threads would require explicitly defined shared writeable memory regions, so presumably each one has its own heap. That's a fair bit of overhead.
3. Failure isolation is more complicated than deciding what to kill.
The expand on the last point, Postgres doesn't isolate failures to a single process because they do share memory and might corrupt those shared memory regions. But even if you don't have shared memory failure recovery isn't always easy. Software has to be written specifically to plan for it. You can kill processes because everything in the OS is written around allowing for that possibility, for example, shells know what to do if a sub-process is killed unexpectedly. Killing a heavy thread (=process) is no good if the parent process is going to wait for a reply from it forever because it wasn't written to handle the process going away.
I've been pondering / ruminating with this too; I've been somewhat surprised that few operating systems have played with reserving per-thread address space as thread-local storage, or requiring something akin to a 'far' pointer to access commonly-addressed shared memory.
> Like light-threads, heavy-threads would share the same process-security-boundary and therefore switching between them would be cheap. No need to flush TLB, I$, D$.
> Like processes, heavy-threads would have mostly-separate address spaces by default. Similar to forking a process, they could share read-only mappings for shared libraries, code, COW global variables, and explicitly defined shared writable memory regions.
I don't think you realistically can have separate address spaces and not have TLB etc impact. If they're separate address spaces, you need separate TLB entries => lower TLB hit ratio.
A close to impossible task, if anyone can do it's probably Heikki though.
Unfortunately I expect this to go the way of zheap et al. Fundamental design changes like this have just had such a rough time of succeeding thus far.
I think for such a change to work it probably needs not just the support of Neon but also of say Microsoft (current stewards of Citus) that have larger engineering resources to throw at the problem and grind out all the bugs.
I know that at least people from EDB (Robert) and Microsoft (Thomas, me) are quite interested in eventually making this transition, it's not just Heikki. Personally I won't have a lot of cycles for the next release or two, but after that...
Not who you asked, but: He is a longtime contributor who has written/resigned important parts of postgres (WAL format, concurrent WAL insertion, 2PC support, parts of SSI support, much more). And he is just a nice person to work with.
So compromise. Take the current process model, add threading and shared memory, with feature flags to limit number of processes and number of threads.
Want to run an extension that isn't threadsafe? Run with 10 processes, 1 threads. Want to run high-performance? Run with 1 process, 10 threads. Afraid of "stability issues"? Run with 1 process, 1 thread.
Will it be hard to do? Sure. Impossible? Not at all. Plan for it, give a very long runway, throw all your new features into the next major version branch, and tell people everything else is off the table for the next few years. If you're really sure threading is going to be increasingly necessary, better to start now than to wait until it's too late. But this idea of "oh it's hard", "oh it's dangerous", "too complicated", etc is bullshit. We've built fucking spaceships that visit other planets. We can make a database with threads that doesn't break. Otherwise we admit that basic software development using practices from the past 30 years is too much for us to figure out.
Worked on a codebase which was separate processes, each of which has a shedload of global variables. It was a nightmare working out what was going on, not helped by the fact that there was no naming convention for the globals, plus they were not declared in a single place. I believe their use was a performance move, ie having the linker pin a var to a specific memory location rather than copying it to the stack as a variable and referencing it by offset the whole time. Premature optimisation? Optimisation at all? Who knows, but there's a good reason coding standards typically militate against globals.
Per discussion on this very page, in the headlined article, and in the mailing list discussion it references, PostgreSQL is not in that category. It has lots of static storage duration variables, which do not necessarily have external linkage.
Robert Haas pointed out in one message that an implementation pattern was to use things like file-scope static storage duration variables to provide session-local state for individual components. This is why they've been arguing against a single giant structure declared in "session.h" as an approach, as it requires every future addition to session state to touch the central core of the entire program.
They want to keep the advantage of the fact that these variables are in fact not global. They are local; and the problem is rather that they have static storage duration and are not per-thread, and thus are not per-session in a thread-per-session model.
There's something to be said for globals whose access is well-managed, though.
IMO: if the variable is _truly_ global, i.e. code all over the codebase cares about it, then it should just be global instead of pretending like it's not with some fancy architecture.
The tricky part is reacting to changes to a global variable. Writing a bunch of "on update" logic leads to madness. The ideal solution is for there to be some sort of one-directional flow for updates, like when a React component tree is re-rendered... but that's very hard to build in an application that doesn't start out using a library like React in the first place.
There are 2000 globals here, so more like a couple of shedloads. While this is something you'd sort of expect for a product that's been around 30+ years, it really seems like there's a lot of optimization that could happen and still stick with the process model.
I wish they would do some kind of easy shared storage instead, or in addition too. This sounds like an odd solution, however I’ve scaled pgsql since 9 on very, very large machines and doing 1 pgsql cluster per physical socket ended up doing near-linear scaling even on 100+ total core machines with TB+ of memory.
The challenge with this setup is that you need to do 1 writer and multiple reader clusters so you end up doing localhost replication which is super weird. If that requirement was somehow removed that’d be awesome for scaling really huge clusters.
I mentally snarked to myself that "obviously they should rewrite it in Rust first".
Then, after more thought, I'm not entirely sure that would be a bad approach. I say this not to advocate for actually rewriting it in Rust, but as a way of describing how difficult this is. I'm not actually sure rewriting the relevant bits of the system in Rust wouldn't be easier in the end, and obviously, that's really, really hard.
This is really hard transition.
I don't think multithread code quality should be measured in absolutes. There are things that are so difficult as to be effectively impossible, which is the lock-based approach that was dominant in the 90s, and convinced developers that it's just impossible difficult, but it's not multithreaded code that's impossibly difficult, it's lock-based multithreading. Other approaches range from doable to even not that hard once you learn the relevant techniques (Haskell's full immutability & Rust's borrow checker are both very solid), but of course even "not that hard" becomes a lot of bugs when scaled up to something like Postgres. But it's not like the current model is immune to that either.
It's not the same at all for global variables, of which pgsql apparently has around a couple thousand.
If every process is single threaded, you don't have to consider the possibility of race conditions when accessing any of those ~2000 global variables. And you can pretty much guarantee that little if any of the existing code was written with that possibility in mind.
Those global variables would be converted to thread locals and most of the code would be oblivious of the change. This is not the hard part of the change.
I'm assuming you're referring to formally proven programs. If that's the case, do you have any pointers?
Aside from the trivial while(!transactionSucceeded){retry()} loop, I have trouble proving the correctness of my programs when the number of threads is not small and finite.
This is true. However, the blast radius may be smaller with a process model. Also recovering from a fatal error in one session could possibly be easier. I say this as a 30-year threading proponent.
Seems like a bad idea. Processes are more elegant and scalable than threads as they discourage the use of shared memory. Shared memory is often a bad idea. You end up with different threads competing and queuing up to access or write the same data (e.g. waiting on each other to acquire a lock with mutexes - This immediately disqualifies the system from becoming embarrassingly parallel) and it becomes the OS's problem to figure out when to allow which thread to access what memory... This is bad because the OS doesn't care about optimizing memory access for your specific use case. It will treat your 'high performance' database in the same way as it treats a run-of-the-mill Gimp desktop application....
With the process model, it encourages using separate memory for each process; this forces developers to think about things like memory consistency and availability and gives them more flexibility in terms of scalability across multiple CPU cores or even hosts. Processes are far better abstractions than threads for modeling concurrent systems since their logic is fundamentally the same regardless of whether they run across different CPU cores or different hosts.
> The overhead of cross-process context switches is inherently higher than switching between threads in the same process
I remember researching this a while back. It depends on the specific OS and hardware. It's not so straight forward and this is something which tends to change over time and the differences are usually insignificant anyway.
Also, it's important not to conflate performance with scalability - These two characteristics are orthogonal at best and oftentimes conflicting.
Oftentimes, to scale horizontally, a system needs to incur a performance penalty as additional work is required to route and coordinate actions across multiple CPUs or hosts. A scalable system can service a much larger number (or even sometimes theoretically unlimited) number of requests but it will typically perform worse than a non-scalable system if you judge it on a requests-per-CPU-core basis.
Why should TLB flush performance ever be a problem on big machines? You can have one process per core with 128 or more cores, never flush any TLB if you pin those processes. And as it is a database, shoveling data from/to disk/SSD is your main concern anyways.
PostgreSQL uses synchronous IO, so you won't saturate the CPU with one process (or thread) per core.
That said, I think there have been efforts to use io_uring on Linux. I'm not sure how that would work with the process per connection model. Haven't been following it...
> That said, I think there have been efforts to use io_uring on Linux. I'm not sure how that would work with the process per connection model. Haven't been following it...
There's some minor details that are easier with threads in that context, but on the whole it doesn't make much of a difference.
I don't understand how it works with thread per connection either. io_uring is designed for systems that have a thread and ring per core, for you to give it a bunch of IO to do at once (batches and chains), and your threads to do other work in the meantime. The syscall cost is amortized or even (through IORING_SETUP_SQPOLL) eliminated. If your code is instead designed to be synchronous and thus can only do one IO at a time and needs a syscall to block on it, I don't think there's much if any benefit in using io_uring.
Possibly they'd have a ring per connection and just get an advantage when there's parallel IO going on for a single query? or these per-connection processes wouldn't directly do IO but send it via IPC to some IO-handling thread/process? Not sure either of those models are actually an improvement over the status quo, but who knows.
> io_uring is designed for systems that have a thread and ring per core
That's not needed to benefit from io_uring
> for you to give it a bunch of IO to do at once (batches and chains), and your threads to do other work in the meantime.
You can see substantial gains even if you just submit multiple IOs at once, and then block waiting for any of them to complete. The cost of blocking on IO is amortized to some degree over multiple IOs. Of course it's even better to not block at all...
> If your code is instead designed to be synchronous and thus can only do one IO at a time and needs a syscall to block on it, I don't think there's much if any benefit in using io_uring.
We/I have done the work to issue multiple IOs at a time as part of the patchset introducing AIO support (with among others, an io_uring backend). There's definitely more to do, particularly around index scans, but ...
Oh, I hadn't realized until now I was talking with someone actually doing this work. Thanks for popping into this discussion!
> > io_uring is designed for systems that have a thread and ring per core
> That's not needed to benefit from io_uring
90% sure I read Axboe saying that's what he designed io_uring for. If it helps in other scenarios, though, great.
> Of course it's even better to not block at all...
Out of curiosity, is that something you ever want/hope to achieve in PostgreSQL? Many high-performance systems use this model, but switching a synchronous system in plain C to it sounds uncomfortably exciting, both in terms of the transition itself and the additional complexity of maintaining the result. To me it seems like a much riskier change than the process->thread one discussed here that Tom Lane already stated will be a disaster.
> We/I have done the work to issue multiple IOs at a time as part of the patchset introducing AIO support (with among others, an io_uring backend). There's definitely more to do, particularly around index scans, but ...
Nice.
Is the benefit you're getting simply from adding IO parallelism where there was none, or is there also a CPU reduction?
Is having a large number of rings (as when supporting a large number of incoming connections) practical? I'm thinking of each ring being a significant reserved block of RAM, but maybe in this scenario that's not really true. A smallish ring for a smallish number of IOs for the query is enough.
Speaking of large number of incoming connections, would/could the process->thread change be a step toward having a thread per active query rather than per (potentially idle) connection? To me it seems like it could be: all the idle ones could just be watched over by one thread and queries dispatched. That'd be a nice operational improvement if it meant folks no longer needed a pooler [1] to get decent performance. All else being equal, fewer moving parts is more pleasant...
[1] or even if they only needed one layer of pooler instead of two, as I read some people have!
> > Of course it's even better to not block at all...
> Out of curiosity, is that something you ever want/hope to achieve in PostgreSQL? Many high-performance systems use this model, but switching a synchronous system in plain C to it sounds uncomfortably exciting, both in terms of the delta and the additional complexity of maintaining the result. To me it seems like a much riskier change than the process->thread one discussed here that Tom Lane already stated will be a disaster.
Depends on how you define it. In a lot of scenarios you can avoid blocking by scheduling IO in a smart way - and I think we can quite far towards that for a lot of workloads and the wins are substantial. But that obviously cannot alone guarantee that you never block.
I think we can get quite far avoiding blocking, but I don't think we're going to a complete asynchronous model in the foreseeable future. But it seems more feasible to incrementally make common blocking locations support asynchronicity. E.g. when a query scans multiple partitions, switch to processing a different partition while waiting for IO.
> Is having a large number of rings (as when supporting a large number of incoming connections) practical? I'm thinking of each ring being a significant reserved block of RAM, but maybe in this scenario that's not really true. A smallish ring for a smallish number of IOs for the query is enough.
It depends on the kernel version etc. The amount of memory isn't huge but initially it was affected by RLIMIT_MEMLOCK... That's one reason why the AIO patchset has a smaller number of io_uring "instances" than the allowed connections. The other reason is that we need to be able to complete IOs that other backends started (otherwise there would be deadlocks), which in turn requires having the file descriptor for each ring available in all processes... Which wouldn't be fun with a high max_connections.
> Speaking of large number of incoming connections, would/could the process->thread be a step toward having a thread per active query rather than per (potentially idle) connection?
Yes. Moving to threads really mainly would be to make subsequent improvements more realistic...
> That'd be a nice operational improvement if it meant folks no longer needed a pooler [1] to get decent performance. All else being equal, fewer moving parts is more pleasant...
You'd likely often still want a pooler on the "application server" side, to avoid TCP / SSL connection establishment overhead. But that can be a quite simple implementation.
Problem with all kinds of asynchronous I/O is that your processes then need internal multiplexing, akin to what certain lightweight userspace thread models are doing. In the end, it might be harder to introduce than just using OS threads.
I've had a similar situation with PHP, where we had written quite a large engine (https://github.com/Qbix/Platform) with many features (https://qbix.com/features.pdf) . It took advantage of the fact that PHP isolated each script and gave it its own global variables, etc. In fact, much of the request handling did stuff like this:
It seemed so cool! PHP could behave like Node! It would have an event loop and everything. Fibers were basically PHP's version of Swoole's coroutines, etc. etc.
Then I realized... we would have to go through the entire code and redo how it all works. We'd also no longer benefit from PHP's process isolation. If one process crapped out or had a memory leak, it could take down everything else.
There's a reason PHP still runs 80% of all web servers in the world (https://kinsta.com/blog/is-php-dead/) ... and one of the biggest is that commodity servers can host terrible PHP code and it's mostly isolated in little processes that finish "quickly" before they can wreak havoc on other processes or on long-running stuff.
So now back to postgres. It's been praised for its rock-solid reliability and security. It's got so many features and the MVCC is very flexible. It seems to use a lot of global variables. They can spend their time on many other things, like making it byzantine-fault-tolerant, or something.
The clincher for me was when I learned that php-fpm (which spins up processes which sleep when waiting for I/O) is only 50% slower than all those fancy things above. Sure, PHP with Swoole can outperform even Node.js, and can handle twice as many requests. But we'd rather focus on soo many other things we need to do :)
I've been using PHP for decades and have found its isolated process model to be about the best around, certainly for any mainstream language. Also Symfony's Process component encapsulates most of the errata around process management in a cross-platform way:
Going from a working process implementation to async/threads with shared memory is pretty much always a mistake IMHO, especially if it's only done for performance reasons. Any speed gains will be eclipsed by endless whack-a-mole bug fixes, until the code devolves into something unrecognizable. Especially when there are other approaches similar to map-reduce and scatter-gather arrays where data is processed in a distributed fashion and then joined into a final representation through mechanisms like copy-on-write, which are supported by very few languages outside of PHP and the functional programming world.
The real problem here is the process spawning and context-switching overhead of all versions of Windows. I'd vote to scrap their process code in its entirety and write a new version based on atomic operations/lists/queues/buffers/rings with no locks and present an interface which emulates the previous poor behavior, then run it through something like a SAT solver to ensure that any errata that existing software depends on is still present. Then apps could opt to use the direct unix-style interface and skip the cruft, or refactor their code to use the new interface.
Apple did something similar to this when OS X was released, built on a mostly POSIX Darwin, NextSTEP, Mach and BSD Unix. I have no idea how many times Microsoft has rewritten their process model or if they've succeeded in getting performance on par with their competitors (unlikely).
Edit: I realized that the PHP philosophy may not make a lot of sense to people today. In the 90s, OS code was universally terrible, so for example the graphics libraries of Mac and Windows ran roughly 100 times slower than they should for various reasons, and developers wrote blitters to make it possible for games to run in real time. That was how I was introduced to programming. PHP encapsulated the lackluster OS calls in a cross-platform way, using existing keywords from popular languages to reduce the learning curve to maybe a day (unlike Perl/Ruby, which are weird in a way that can be fun but impractical to grok later). So it's best to think of PHP more like something like Unity, where the nonsense is abstracted and developers can get down to business. Even though it looks like Javascript with dollar signs on the variables. It's also more like the shell, where it tries to be as close as possible to bare-metal performance, even while restricted to the 100x interpreter slowdown of languages like Python. I find that PHP easily saturates the processor when doing things in a data-driven way by piping bytes around.
At the end of the day, this doesn't solve any problems. Small setups use postgres directly just fine, and large setups use pgbouncer, and having process isolation with extensions is a good thing and probably simplifies things a lot.
A big advantage of the process-based model is its resilience against many classes of errors.
If a bug in PostgreSQL (or in an extension) causes the server to crash, then only that process will crash. Postmaster will detect the child process termination, and send an error message to the client. The connection will be lost, but other connections will be unaffected.
It's not foolproof (there are ways to bring the whole server down), but it does protect against many error conditions.
It is possible to trap on some exceptions in a threaded environment, but cleaning up after eg. an attempted NULL pointer dereference is going to be very difficult or impossible.
I'm curious if they can take advantage of vfork / CLONE_VM, to get the benefits of sharing memory and lower overhead context switches, with the trade of still getting benefits from the scheduler, and sysadmin-friendliness.
The other thing that might be interesting is FUTEX_SWAP / UMCG. Although it doesn't remove the overhead induced by context switches entirely (specifically, you would still deal with TLB misses), you can avoid dealing with things like speculative execution exploit mitigations.
Per the article, Postgres has many, many global variables, many of which track per-session state; much session state is “freed” via process exit rather than being explicitly cleaned up. Switching to CLONE_VM requires these problems to all be solved.
I think an interesting point of comparison is the latest incarnation of SQL Server. You can't even point at 1 specific machine anymore with their hyperscale architecture.
I know I’m probably being naive about this, but is it stupid to ask if there’s a way to make multi process work better on Linux - rather than “fixing” PG?
I feel like the thread vs process thing is one of those pendulums/fads that comes and goes. I’d hate to see PG go down a rabbit hole only to discover the OS could be modified to make things go better.
(I understand not all PG instances run on Linux, just using it as an example)
That'll likely be an even bigger task, and harder to get into mainline kernel.
Linux multi-process is already pretty efficient compared to Windows. However, multi-process is inherently less efficient than multi-thread due to more safety predicates / isolation guaranteed by the kernel, I feel lowering it might lead to more security issues, similar to how Hyper Threading triggered a bunch of issues with Intel Processors.
Right - yeah I was really just wondering if some of the safety predicates could be reduced when there is a relationship between processes, such as the mitigations against cache attacks. I think the cache misses caused by multi-process were one of the reasons given that it's slower than threading. But I don't understand why this is necessarily the case given that the shared memory and executable text ultimately refer to the same data. But I suppose this would need to work with processor affinity and other elements to prevent the cache being knocked around by non-PG processes, and I guess this is one place where it starts getting complicated.
That said, please understand that I'm just being curious - I really don't know what I'm talking about, I haven't built a Linux kernel or dabbled in Unix internals in like 20 years, but thanks for replying :) Postgresql is my favourite open source project and I'm spooked by the threading naysayers.
The TLB is basically keyed by (address space, virtual address % granularity), or needs to be flushed entirely when switching between different views of the address space (e.g. switching between processes). Unless your address space is exactly the same, you're at least going to duplicate TLB contents. Leading to a lower hit rate.
This isn't really an OS issue, more a hardware one, although potential hardware improvements would likely have to be explicitly utilized by operating systems.
Note that the TLB issue is different from the data / instruction cache situation.
> I feel like the thread vs process thing is one of those pendulums/fads that comes and goes.
In this context threads can be understood as processes that share the same address space and vice-versa processes as threads with separate address space.
One gives you isolation, the other convenience and performance. Either can be desirable.
This would be one those places where a language like Rust would be helpful. In C/C++ with undefined behavior and crashes, process isolation makes a lot of sense to limit the blast radius. Rust borrow checker gives you at compile time a lot of the safety that you would rely on process isolation for.
Yes, but note that the blast radius of a PostgreSQL process crash is already "the whole system reboots", so there are not a lot of differences between process- and thread-based PostgreSQL written in C.
Rewriting in Rust would be interesting, but it would also probably be too invasive to make it worthwile at all - all code in PostgreSQL is C, while not all code in PostgreSQL interacts with the intrinsics of processes vs threads. Any rewrite to Rust would likely take several times more effort than a port to threads.
It sounds to me like migrating to a fully multi-threaded architecture may not be worth the effort. Simply reducing the number of processes from thousands to hundreds would be a huge win and likely much more feasible than a complete re-architecture.
They could host those currently subprocess inside Wasm environments. This would be largely a mechanical transformation. Even the current shm based architecture would stay intact.
The issue is costlier (runtime, complexity, memory) resource sharing, not the cost of the fork itself. Pre-forking isn't going to help with any of that.
Isn’t this why pgbouncer is so effective? Maybe it’s not the forking itself, but there is something about creating connections that is expensive, warranting such external connection poolers.
So when we as a whole decided that multiprocessing is a much better approach from security and application stability point of view, they decide to go with threads?
Horses for courses I guess - purely threaded vs purely MP both have different set of tradeoffs and shoehorning one over the other always fails some use cases. The article says they are also considering the possibility of having to keep both process and thread models indefinitely for this and other reasons.
I know nothing of PG internals but I can see why process per connection model doesn't work for large machines and/or high number of connections. One way to do it would be to keep connection handling per thread and still keep multiprocess approach where it makes sense for security and doesn't add linear overheads.
I wonder if it would be easier to create a C virtual machine that emulates all the OS interaction, then recompile Postgres and the extensions to run on this. Perhaps TruffleC would work?
Because it makes it a lot easier to address some of postgres' weaknesses? This is a proposal by long time contributor to postgres, that a number of other long time contributors agree with (and others disagree with!). Why shouldn't have Heikki brought this up for discussion?
This is interesting because Google just created AlloyDb[0] which is decidedly multiprocess for performance and switches out the storage layer from a read/write model to write+replicate + read-only model.
The deep dive[1] has some details; the tl;dr: is that the main process only has to output Write Ahead Logs to the durable storage layer which minimizes transaction latency. The log processing service materializes postgres-compatible on-disk blocks that read-only replicas can read from, with a caching layer for block reads which sends cache invalidations from the LPS to read replicas.
I'm not sure if similar benefits could be seen within a single machine; using network DMA or even rDMA to transfer bytes to and from remote machines also avoids TLB invalidation. There are some mentions in the mailing list of waiting for Linux to support shared page mappings between processes as a solution.
I'm not exactly sure I understand the reasoning behind process separation as crash recovery; as far as I understand each connection is responsible for correctness and so if a process crashes there seems to be an assumption that the database can recover and keep working by killing that process but that seems like it risks silent data corruption; perhaps it's equivalently mitigated by separating materialization of blocks from the sync'd WAL in a separate process from the multithreaded connection process producing WAL entries?
I have. What I've observed more is outside attackers with their own agenda use the "nobody objected because they were unprepared and unable to respond in the 2 minutes I gave them to object" as proof their agenda is supported.
Heikki Linnakangas is one of the top Postgres contributors of all time, he isn't just "someone." The fact he's working for a startup on a fork (that already exists, which you can run right now on your local machine) doesn't warrant any snide dismissal. Robert Haas admitted that it would be a huge amount of work and that it would only be achievable by a small few people anyway, Heikki being among them.
Anyway, I think there are definitely limits that are starting to appear with Postgres in some spots. This is probably one of the most difficult possible solutions to some of those problems, but even if they don't switch to a fully threaded model, being more CPU efficient, better connection handling, etc will all go a substantial way. Doing some of the really hard work is better than none of it, probably.
Is there any reason at all people use intrinsically bug-prone and broken multithreading mode instead of fork() and IPC apart from WinAPI having no proper fork?
TLB misses? They are just a detail of particular CPU implementation, and the architectures change. Also, aren't they per core and not per process? What would that solve then to switch to MT?
I feel this sort of undertaking could only be done by those programmers who truly value domain knowledge above all else (money, etc). I'm more of the entrepreneureal mind so I generally only learn as much as needed to do some task (even if it's very difficult), but just seeking information as a means to an end doesn't feel fulfilling to me. Of course many people DO find that, and its upon those people's shoulders that heroic things like this rest, and I'm very thankful to them.
Please don't use mutable global state in your work. Global variables are universally bad and don't provide much of a benefit. The number of desirable architectural refactoring that I've witnessed turning into a muddy mess because of them is daunting. This is one more example of this.
Thank you for sharing your ideological views, but this is not the appropriate venue for that. If you want to have a software _engineering_ discussion about the trade offs involved in sharing global mutable state, this is a good venue for that. All engineering is trade offs. As soon as you make blanket statements that X is always bad, you’ve transitioned into the realm of ideology. Now presumably you mean to say it’s almost always bad. But that really depends on the context. It may well be almost always bad in average software projects, but PostgreSQL is not your average software project. Databases are a different realm.
Discrediting my argument by labeling it as ideology and by implying that "blanket statements are always bad" is a logical fallacy that does not touch the merits of what is discussed and I would argue that your argument instead of mine is the one that does not belong here.
If you want to contribute to the discussion, I'd be happy to be given an example of successful usage of global variables that made a project a long term success under changing requirements compared to the alternatives.
Global mutable state being a poor choice in software architecture isn’t an ideology. There is no ideology that argues it is awesome.
If you want to have a software _engineering_ discussion about the trade offs involved in sharing global mutable state, this is a good venue for that.
All engineering is trade offs. As soon as you start telling people they’re making blanket statements that X is always bad, you’ve transitioned into the realm of nitpicking.
It's awesome where performance considerations are paramount. It's awesome in databases. It's awesome in embedded software. It's awesome in operating system kernels.
The fact is sometimes it's good. Saying it's universally bad is going beyond the realm of logic and evidence and into the realm of ideology.
Using globals is simpler, it's also pretty natural in event driven architectures. Passing everything via function arguments is welcome for library code, but there's little point to using it in application code. It just complicates things.
Knizhnik made these variables thread local, which is fine if you have a fixed association of threads to data. This looses some flexibility if your runtime needs to incorporate multiple sessions on one thread (for example to hide IO latency) in the future. In the end, the best solution is to associate the data that belongs to a session with the session itself, making it independent on which thread it's running on. This is described by Knizhnik as "cumbersome", which is exactly why people should have not started with global variables in the first place. (No blame, Postgres is from 1986 and times were very different back then).
You know what a database is, do you? It is the place where you store your mutable global state. You can't kick the can down the road forever, someone has to tackle the complexity of managing state.
If Tom Lane says it will be a disaster, I believe it will be a disaster.