Hacker News new | past | comments | ask | show | jobs | submit login
OrioleDB – solving some PostgreSQL wicked problems (github.com/orioledb)
308 points by samokhvalov on Feb 25, 2022 | hide | past | favorite | 95 comments



Wow, this looks to potentially save my team hundreds of hours of work. This product and its aims are nearly 1:1 with a massive Postgres extension project we were to start developing next quarter. Who should I get in touch with to see about us contributing instead to Oriole? I would love for our team to support the project with man hours and possibly financial support as well, as it seems this will accelerate our project significantly.


Hi! My name is Alexander Korotkov, I'm founder of OrioleDB. Thank you for your feedback and support. Please, write me at alexander@orioledb.com Thank you!


Alexandr is a very impressive contributor to Postgres community. Met him few times. He’s on the mission to change and improve community. He has a clear vision what has to be changed, and he’s still 30+ yo and has many many years ahead. Just talking him 15 mins will save you years. Hope he can make a business out of this and find resourceful business partners.


> he’s still 30+ yo and has many many years ahead

I know where you are going with this, but I'm not sure his age is relevant as somebody 50+ yo could have many years ahead of them. Having said all of that, here's his contributions to Postgres which speaks to his credibility:

     month  | commits | files | churn |                        langs                        
   ---------+---------+-------+-------+-----------------------------------------------------
    2022-02 |       3 |     7 |   290 | c,sql,text,unknown
    2021-12 |       1 |     5 |    30 | c,sql,unknown
    2021-11 |       2 |     1 |    38 | c
    2021-09 |       2 |     4 |    74 | c,c_header
    2021-07 |       3 |     7 |   135 | c,c_header,sql,unknown
    2021-06 |       4 |    12 |   992 | c,c_header,sql,unknown
    2021-05 |       4 |     5 |    88 | c,sql,unknown
    2021-04 |       1 |     4 |    30 | c,c_header,unknown
    2021-03 |       1 |     1 |     2 | unknown
    2021-02 |       2 |     3 |    20 | c,c_header
    2021-01 |       6 |    16 |  1781 | c,c_header,makefile,sql,unknown
    2020-12 |      11 |    69 | 12331 | c,c_header,makefile,perl,sql,unknown
    2020-11 |       8 |    11 |   288 | c,c_header,makefile,sql,unknown
    2020-09 |       4 |     4 |   258 | c,sql,unknown
    2020-08 |       1 |     3 |    23 | c
    2020-07 |       7 |    12 |   486 | c,c_header,makefile,sql,unknown
    2020-06 |       8 |     8 |   766 | c,unknown
    2020-05 |       5 |     5 |    96 | c,sql,unknown
    2020-04 |       6 |    28 |  1225 | c,c_header,makefile,perl,sql,unknown
    2020-03 |       8 |   119 |  5651 | c,c_header,makefile,sql,unknown
    2020-01 |       1 |    10 |   627 | c,c_header,sql,unknown
    2019-11 |       4 |     5 |   303 | c,text
    2019-10 |       2 |     7 |   312 | c,c_header,sql,unknown
    2019-09 |      18 |    48 |  3054 | c,c_header,sql,text,unknown
    2019-08 |       3 |     4 |   280 | c,sql,unknown
    2019-07 |       6 |    18 |  1046 | c,c_header,sql,unknown
    2019-06 |       6 |    12 |   134 | c,c_header,sql,unknown
    2019-05 |       6 |     9 |   309 | c,unknown
    2019-04 |       4 |    17 |  1651 | c,c_header,sql,unknown
    2019-03 |      15 |    55 | 11647 | c,c_header,gitignore,makefile,perl,sql,text,unknown
    2018-12 |       6 |    24 |  1289 | c,c_header,text,unknown
    2018-10 |       3 |     1 |    21 | unknown
    2018-09 |       8 |    43 |  2589 | c,c_header,makefile,perl,sql,text,unknown
    2018-08 |       5 |    11 |  2474 | c,makefile,sql,unknown
    2018-07 |       4 |    17 |   101 | c,c_header
    2018-06 |       7 |     9 |   117 | c,unknown
Note, the analysis is only for the email akorotkov@postgresql.org, so if there is more, let me know and I'll update the metrics.


this is the guy who leads the project (as far as I know): https://www.linkedin.com/in/alexander-korotkov-41b1b582/



This is exciting:

> ...This log architecture is optimized for raft consensus-based replication allowing the implementation of active-active multimaster.

I'm a developer but manage 2 PostgreSQL instances each with an async replica. I also manage 1 three-node CockroachDB cluster. It's night and day when it comes to any type of ops (e.g. upgrading a version). There's a lot of reasons to use PostgreSQL (e.g. stored procedures, extensions, ...), but if none of those apply to you, CockroachDB is a better default.

I just finished interviewing someone who went into detail about the difficulty they're having doing a large scale migration or a mission critical large PG cluster (from one DC to another)..it's a multi-team, year long initiative that no one is comfortable about.

In my experience, PG is always the most out-of-date piece of any system. I still see 9.6 here and there, which is now EOL.


Raft doesn’t magically make a database multi master. It’s a consensus algorithm that has a leader election process. There’s still a leader, and therefore a single master to which writes must be directed. The problem it solves is ambiguity about who the active master is at the moment.


Right. The idea of active-active OrioleDB multimaster is to apply changes locally and in parallel send it to the leader. Then sync on commit and ensure there is no conflicts.

The design document will be published later.


Please have it subjected to a Jepsen test suite ASAP!


My thinking is since it's postgres wire compatible the existing tests should work?


Jepsen does much more than basic end-to-end tests, including intentionally partitioning the cluster. Tests written for a non-distributed system are downright friendly compared to what Jepsen does to distributed systems.


Does Jepsen just run automated tests?


I mean they're automated but also hand crafted.

https://github.com/jepsen-io/jepsen


Those are very expensive and booked months in advance. Aphyr sometimes comments on HN and is of course a better source on this but that's what I recall reading.


It’s the price you pay to reassure customers that your newfangled distributed database won’t experience data loss or inconsistency.


Feature wise I like postgresql.

Operationally? Mysql. I pick MySQL 8 whenever I can because over the lifetime of a project it's easier.

Pragmatically I'm coming to the uncomfortable (as a Foss advocate) idea that Microsoft SQL Server is a good choice and worth paying for. Features and relative ease of use.


Can you explain more why MySQL is operational more easy? I personally don‘t see any difference.


Besides what's already mentioned by siblings:

- Forwards (and usually also backwards) compatible disk format, meaning version updates don't require more than a few minutes of downtime. On large datasets, postgresql can require days or weeks.

- Replication works across version differences, making upgrades without any downtime at all easier.

- No need for a vacuum process that can run into trouble with a sustained high write load.

- Page-level compression, reducing storage needs for some types of data quite a bit.

- Clustered primary keys. Without these, some types of queries can become rather slow when your table doesn't fit in memory. (Example: a chat log containing many billions of messages, which you're querying by conversion-id.)

- xtrabackup (innobackupex) is awesome.


> - No need for a vacuum process that can run into trouble with a sustained high write load.

MySQL has a similar mechanism called the purge job. It can lead to similar problems, for example DML slow down.

https://mariadb.com/kb/en/innodb-purge/


> - Forwards (and usually also backwards) compatible disk format, meaning version updates don't require more than a few minutes of downtime.

How so? Postgres' data formats are forwards (and mostly backwards) compatible with release 8.4 in 2009; where effectively only the catalogs need upgrading. Sure, that can be a lot of data, but no different from MySQL or any other RDBMS with transactional DDL.

> - Replication works across version differences

Logical replecation is available since at least 9.6


Not the dump format. The actual data on disk of the database itself.

If you try upgrading even one major postgres version and you're not aware of this you lose all your data (you don't really as you can roll back to the previous version, but it doesn't even tell you that!)


That is what I'm talking about, yes.

Page format for tables hasn't changed since 8.4, tuple format hasn't changed in a backwards-incompatible manner since 8.4 (probably earlier).

BTree indexes have changed a bit, but ones created in 9.0.x can still be loaded and used in 14.x.

GIN/GiST/SP-GiST/BRIN all don't seem to have had backwards-incompatible changes, looking at https://why-upgrade.depesz.com/show?from=8.4&to=14.2

The only thing that changed is the catalogs, upgradable through pg_upgrade.


Pretty much every major PG release clearly states the on-disk format isn't fixed between major versions, and is subject to change:

https://www.postgresql.org/docs/14/upgrading.html

https://www.postgresql.org/docs/13/upgrading.html

https://www.postgresql.org/docs/12/upgrading.html

https://www.postgresql.org/docs/11/upgrading.html

https://www.postgresql.org/docs/10/upgrading.html

  For major releases of PostgreSQL, the internal data storage format is subject
  to change, thus complicating upgrades.


Yes, that's more 'keeping the option open' than 'we do this regularly', and I can't seem to find any documentation that MySQL gives any better guarantee.

Note that for all of 10, 11, 12, 13 and 14 no changes have been made in the storage format of tables that made it mandatory to rewrite any user-defined schema.

I will admit that some changes have been made that make the data not forward-compatible in those versions; but MySQL seems to do that at the minor release level instead of the major release level. MySQL 8 doen't even support minor release downgrades; under PostgreSQL this works just fine.


pg_upgrade will take care of that. And you can still run upgrade from 8.4 to 14 using the --link mode which means no data will be copied - only the system catalogs need to be re-created.


You're totally missing the point. It's like you're saying "other than that, how did you like the play mrs Lincoln?"

pg_upgrade should not be needed at all! It isn't for mysql. Or if it is needed it should be automatic.

Just the command line description and the manual of pg_upgrade makes you want to run away screaming. It's not ok.


>> Just the command line description and the manual of pg_upgrade makes you want to run away screaming. It's not ok.

Why is it not ok? This is pg_upgrade:

> Major PostgreSQL releases regularly add new features that often change the layout of the system tables, but the internal data storage format rarely changes. pg_upgrade uses this fact to perform rapid upgrades by creating new system tables and simply reusing the old user data files. If a future major release ever changes the data storage format in a way that makes the old data format unreadable, pg_upgrade will not be usable for such upgrades. (The community will attempt to avoid such situations.)

This is mysql_upgrade:

> Each time you upgrade MySQL, you should execute mysql_upgrade, which looks for incompatibilities with the upgraded MySQL server:

> * It upgrades the system tables in the mysql schema so that you can take advantage of new privileges or capabilities that might have been added.

> * It upgrades the Performance Schema, INFORMATION_SCHEMA, and sys schema.

> * It examines user schemas.

> If mysql_upgrade finds that a table has a possible incompatibility, it performs a table check and, if problems are found, attempts a table repair.

The description seems comparable, and if anything pg_upgrade looks saner by not attempting silly things like "table repair".

>> pg_upgrade should not be needed at all! It isn't for mysql. Or if it is needed it should be automatic.

It isn't for mysql since 8.0.16, where this becomes an automatic process. Personally, I prefer having an explicit step to perform the upgrade, instead of running a new binary to start the database process and also optionally perform the upgrade at the same time.


> Clustered primary keys. Without these, some types of queries can become rather slow when your table doesn't fit in memory. (Example: a chat log containing many billions of messages, which you're querying by conversion-id.)

https://use-the-index-luke.com/blog/2014-01/unreasonable-def...


Clustered indexes can be overused but they are sorely missing from PG, having had them forever in SQL server it was another surprise to see they don't exist in PG and there is even a confusing CLUSTER command to reorder the heap table but does not actually make a clustered index.

Clustered index are great for storage space and IO for common write and retrieval patterns. If your table is accessed in one way most of the time and that way needs to be fast and therefore requires an index a clustered index saves write IO (only writing the table no secondary index) disk space (no secondary index redundantly storing the indexed columns) and retrieval time (no indirection when querying the clustered index).

This is great for certain kinds of tables, for instance log tables that need to written to quickly but queried quickly (usually by time range) and grow quite large and are append only. GIS tables which can also get quite large and are by points can be packed really tight and row found quickly. Entity tables that are mostly retrieved via primary key ID during app usage, you trade a slight performance penalty when using secondary index for maximum performance when retrieving by ID which effect everything in including foreign key checks this again saves space on writing the ID twice to disk. Tables that actually are an index for something else such that the table is kept up to date with a trigger etc and usually exist for specific access pattern and performance.


I have been working with Oracle for more than 20 years now. I think I only had very few situations where an "index organized table" (=clustered index) was useful or actually provided a major performance benefit over a "heap table". So I never really miss them in Postgres.


It is such a common access pattern that many database engines always have a clustered index (MySql - InnoDB, Sqlite) whether you use them directly or not.

I like having a choice as there is in Sql Server or Oracle, but for many use cases its a waste to write to a heap and to an index (which is just a hidden IOT) then look up in the index and dereference to the heap both in space and time.


> Well, you can’t do that with an equals filter. But how often do you use non-equals filters like > or < on the primary key?

You can; he's wrong. He missed the vastly more common scenario: your primary key is a composite key and you're filtering on a prefix of that composite key. We do that all the time. No < or > required; you get a range of keys with an equijoin. It's additionally common for the first key in the composite key to be some kind of timestamp or monotonically increasing value, so new writes always go together at the end of the table. This sort of key is absolutely begging to be used with a clustered primary key index.

We have a table with 100+ billion rows that uses this pattern. It uses partition switching to append new data. At this size it is absolutely imperative to reduce the number of indexes to keep insert performance high, and we are always querying multiple consecutive rows based on a prefix equijoin. I'd be a fool to follow the author's advice here.

I suspect the author and I live in very different worlds, and he doesn't know my world exists.


> Replication works across version differences, making upgrades without any downtime at all easier.

Postgres has logical replication these days, which can be used to achieve this


Except logical replication doesn't handle DDL yet. And 2 phase commit (prepared transactions) as well.


Two phase commit for logical replication is now available in Postgres 14.

The biggest issue I see at the moment is that sequences are not replicated.


I haven't used the features but I thought replication and failover were much easier (out of the box at least).


You also don’t usually need connection pooling with MySQL because it can handle thousands of simultaneous connections without falling over.


I know PG14 significantly improved this situation [0]. I'm not familiar enough with MySQL to know how it compares now though.

[0] https://pganalyze.com/blog/postgres-14-performance-monitorin...


There is Vitess for MySQL, no such cluster manager for Postgres. Citus is now owned by Microsoft, and getting harder to use outside of Microsoft’s cloud. Eg, no more Citus on AWS


SQL Server is good, but note the system is oriented differently than Postgres - Postgres is very friendly towards programming directly in the DB. In SQL Server world, you can do that, but you'd be much better off with a solution where the logic lives elsewhere (Not counting Reports or Bulk Imports which have MSSQL-native solutions).


Very much this - most of my experience is with Postgres but the rough impression I have is that SQL Server is the best of both Postgres and MySQL, and Postgres and MySQL are improving, but not in any great direction towards what SQL Server offers.


From the operations perspective, SQL Server is definitely way better that Postgres - monitoring is such a pain point in Postgres, for example. From application development/feature perspective, it's much closer, and server-side programming in Postgres (if that's what you like/need) is simply better that in SQL Server.


*better than, not better that.


> if none of those apply to you, CockroachDB is a better default.

Though it scales horizontally, you will need to throw at least 10x or more hardware at it to achieve the same throughput


Do you have benchmarks to back that up? The efficiency per core is worse, but I’ve not seen 10x worse. One thing to be careful of is making sure to have parallelism in your benchmark. Single-threaded Postgres may well be 10x faster on workloads of interest, but that’s not very representative of real use.


We switched to CockroachDB for some of our most heavily loaded tables a few years ago - and the scalability and operational simplicity has been great.

I don’t have a like for like benchmark, but at least for writes - we’re running with 5 replicas of each range so that’s 5x the disk write i/o just from that.

Add in some network overhead and running raft etc, and you are going to be needing a lot more resources than for PG to maintain a similar throughput - but adding resources when you can scale horizontally is so much easier that the tradeoff is worth it in my experience.


I'd go for YugabyteDB instead of CockroachDB but your point of distributed SQL being better is spot on.


Maybe postgres is out of date because even an old version like 9.6 is performant, featureful, and stable?


> In my experience, PG is always the most out-of-date piece of any system.

Yes


> PostgreSQL 14 with extensibility patches,

can you elaborate on these patches? Have you worked on any parser-level hooks? I'm building an extension on PG, and have been trying to extend the parser without touching PG's tree, but it looks impossible at this point.


@akorotkov I'm interested also. I see the forked and patched version here, https://github.com/orioledb/postgres, but I don't see any documentation on what changes you made.


There is not exactly a documentation. But there you can see that commits after "Stamp 14.2." are extendability patches. Commit messages briefly describe the changes. https://github.com/orioledb/postgres/commits/patches14 Patches will be polished and published on pgsql-hackers.


That's amazing! Thanks for making this open source.

Could this be extended to support temporal (and Bitemporal) tables? Some storage level optimizations could benefit performance and easy of use for use cases temporarily is important, like financial services systems.

Thanks


That would be very nice. We ‘solve’ this now with custom ducktape triggers and stored procs. Work in banking.


> Reduced maintenance needs. OrioleDB implements the concepts of undo log and page-mergins, eliminating the need for dedicated garbage collection processes. Additionally, OrioleDB implements default 64-bit transaction identifiers, thus eliminating the well-known and painful wraparound problem.

I love this, no more vacuums?


On of the major differences between Oracle and Postgres was this - Postgres kept old versions of rows around to be vacuumed, while Oracle moves them to "undo segment". So they are copying the Oracle design.

That has its own set of tradeoffs - if you're running a long transaction with higher level of serialization, the database has to look for old data in the undo segment which is not only likely vastly slower than normal table, but might not even have the data anymore (any ETL developer have probably seen "Snapshot too old" error from Oracle).

It probably is better tradeoff for some workloads, like high volume of updates with no long-running transactions, but it is more of a different design choice than a "fix" by itself.


> That has its own set of tradeoffs - if you're running a long transaction with higher level of serialization, the database has to look for old data in the undo segment which is not only likely vastly slower than normal table, but might not even have the data anymore (any ETL developer have probably seen "Snapshot too old" error from Oracle).

I would say "Snapshot too old" isn't essential drawback of undo log. DBMS can keep the the undo log records until all snapshots, which needs them, are released. And this is how OrioleDB behaves. This is also kind of "bloat", but cleanup is cheaper. You just have to cut unclaimed undo log instead of expensive vacuum.


Yes, it made me wary - design choice with different tradeoffs is portrayed as "better".


Isn't the goal to have multiple storage engines, so you could use the one that best fits your needs? Current storage engine probably won't be removed when this is merged.


Right, no more vacuums. And within typical workloads there should be no bloat.


Sounds really exciting. My wife noticed me reading the notes with such intent she asked me what I was reading.


Finally someone tackling the real problems that everyone glosses over when gushing about postgres. Bravo! This is a really excellent project.


How does this compare to zheap? I believe zheap made some similar design choices (undo based and avoiding the wraparound problems), though the replication features seem unique to oriole.


Additionally OrioleDB fundamentally changes design of buffer management and improves the vertical scalability a lot. https://www.slideshare.net/AlexanderKorotkov/solving-postgre...

Also, I have to mention that zheap development is inactive for now. And its undo implementation was never complete. Zheap undo works only if you don't update any indexed column. Otherwise, it works as plain heap with write-amplification and bloat.


This is a bit of a digression, but since this topic attracted a lot of eyes interested in Postgres and databases…

Is there any hope of improvements to database backup (dump) functionality of Postgres? Coming from SQL Server (and it seems to me that there’s not a lot of people who have experience with both) - I was really taken by surprise how badly performant Postgres dumps are in terms of both backup and restore speed, as well as backup size of similarly sized databases.


How (if at all) is this related to the ZHEAP development? Sounds like this targets more or less the same problems


Please, see this thread of comments. https://news.ycombinator.com/item?id=30466322


Is this a fork or something intended to be upstreamed?


It's both. There is a PG fork which contains some patches to increase extensibility to certain APIs and then an extension developed against that fork.

The idea would be to upstream the fork I would imagine but continue development of the engine out of tree similar to Citus, Timescale, etc.


Yes, that's it.

1. The first goal is to become a pure extension. That should be done in 2-3 PostgreSQL release cycles. 2. The long term goal is to become a part of PostgreSQL.


When you say 2-3 release cycles, do you mean like targeting 14.5(probably 2022), or more like targeting 17.x(probably 2025)?


I mean major releases. So, targeting 17.0.


Awesome. Thank you for helping out the community


For those who understand russian - there's a great overview of this project on RuPostgres channel https://www.youtube.com/watch?v=1GgyEqLNXiM


Thanks for mentioning.

Alexander also presented OrioleDB in English at an EDB's event: https://vimeo.com/649616139


wow.... this is amazing!

I dindn't get this:

`OrioleDB implements default 64-bit transaction identifiers, thus eliminating the well-known and painful wraparound problem`

I thought that PG's 32-bit transaction identifier and wraparound issue was due to the current MVCC implementation that keeps old versions of rows in page tables, so there is the need to vacuum to clean up the dead rows. So my assumption was that using another approach for MVCC, like the UNDO log, this problem wouldn't exist (since page tables would only have the latest version).

What am I missing here?


PostgreSQL 32-bit transaction identifiers are cycled in a ring. This is why PostgreSQL needs to "freeze" old transaction identifiers before they could start being identified as future. This routine is "vacuum freeze" and it have to scan and re-write pages, which could have no dead rows.


Insanely impressive. If this was available as an extension, I could see us using it in production asap.


Is it just me or is the database scene the new JavaScript scene, new database every week etc.


Despite the distinct name, this project aims to upstream all changes to PostgreSQL (see Alexander's comment: https://news.ycombinator.com/reply?id=30464060&goto=item%3Fi...)

So rather than a "new database", it's probably better to think of it as an iteration of PostgreSQL


only the additional “hooks” for Table Access Methods need to get up streamed into the core Postgres. OrioleDB is a separate storage engine and not designed to replace the current engine, and be used as an adjunct storage option


My only issue is things calling themselves a database when they are either a wrapper layer around an actual DB (most of the new "databases" I see posted) or as in this case a storage engine.

Not trying to take away from the great work done here but the product name and link title would be more clear if it was referred to as a storage engine rather than a DB.


important distinction - this uses the Postgres Table Access Method to add an additional storage engine to the Postgres platform.. this is NOT a replacement for the current storage engine and can co-exist and be used on a per-table basis


Good point. Especially since the final destination is to merge into Postgres mainline in the future, it would make more sense to call this Oriole Storage Engine...


Exist a sore need for better DBs and is due the action is there.

My only gripe is that most are focused in the MOST niches of the scenarios of the MOST exotics of the deployments (the kind that worry about stuff like multi-master across continents).

The RDBMS need more fundamental improvements at the core (like for example, start supporting algebraic types!).


The IT world is still growing, people need scale and existing solutions might not scale like people want.


This looks insanely impressive. Will have to dive in and give this a spin.


> OrioleDB ... solving some PostgreSQL wicked problems

Why was Postgres forked?

I don't want this to come across as negative towards Postgres (or discouraging toward Alexander) but given that these fixes aren't upstream in Postgres - developed by a known/respected contributor, is this indicative to some type of politics / infighting happening within the Postgres community on the architectural direction of their (amazing) database?

I see some comments in this thread touch a little on this topic but why does this need to wait 2-3 release cycles before it's upstreamed to Postgres? That's going to be roughly 4 years from now.


From my point of view, PostgreSQL development is quite conservative. Every small change may become a subject of heated debates, nothing to say about big changes. This approach to the development have advantages and disadvantages. But if we imagine OrioleDB accepted to the upstream, that would be most dramatical changes since PostgreSQL was released to Open Source. This is why I decided to make a fork and then merge it to upstream step-by-step.

It's not that bad to use a fork during 2-3 release cycles give that: 1) It's not highly divergent fork (the patchset is small). 2) They patchset will be smoothly decrease over that period of time. Also, even 2-3 release cycles isn't that bad in comparison with decades while we have small progress in wicked problems.


It’s not a fork. It’s an extension. That’s both in the doc and build instructions.

That is a very appropriate way to run ahead and add experimental features that could be folded in later, even just later maintaining the extension as part of the OOB release.

It speaks to the architecture of Postgres that this is possible to implement this way.

I would love to see this be successful. Much cleaner than stacking software in front of and around Postgres, and not just a wire-compatible DB like Cockroach.


Fascinating and outstanding work. I'm looking forward to see how well it integrates with Hasura!


How can I contact you to get involved?! This is amazing.


Feel free to contact me at alexander@orioledb.com


Awesome concept!

Any concerns about the name being very similar to OracleDB? At least when I first saw the name I internally within my head read it as that (so could maybe just be me!).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: