I tried a similar approach in a previous startup - treat data as documents and store in a JSONB field.
Postgres was awesome and handled this brilliantly, but the lack of schema and typing killed it. We just ended up fighting data quality the whole time. We couldn't assume that any document had all the required fields, or that they were in a format that made sense e.g. the Price column sometimes had currency symbols, and sometimes commas-and-periods in UK/US format and sometimes in Euro format - sorting by Price involved some complicated parsing of all the records first.
We moved back to relational tables.
I won't say I'd never do this again, but I would definitely not just throw JSON documents to a database and expect good things to happen.
Had a similar experience. What killed it for me, is that no statistics can be gathered for JSONB columns. This in turn really messes with the query planner once you do something like `select a.* from a join b on a.id = b.a_id where b.my_jsonb_column ->> 'foo' = 'bar';`.
Given the lack of statistics, the query planner loves going for a nested loop rather than hash or merge join where those would appropriate, leading to abysmal performance.
There is an thread[0] on the PostgreSQL mailing list to add at least some statistics on JSONB column, but this has gone nowhere since 2022.
it's kind of dumb that postgres uses a nested loop join instead of a hash join there. hash join almost always has the best worst-case behavior, and without stats it should be the default choice.
If you have schema requirements, why _not_ do it with tables and columns instead? The point of the jsonb column would be precisely to store arbitrary stuff that has no schema.
I usually see this (dynamic scripting langs, schemaless databases) play out as a very good environment to prototype in, very few barriers to change things or make stuff up as you go. Then the prototype is not discarded but "iterated" to form the actual production code. The (now needlessly so) flexible JSON store is grand-fathered in.
I have yet to come across an actual use case for JSON stores in a production app with an established design. What the hell do you mean you have no idea what the data might hold?? Why are you storing unknown, undefined, undefinable things??? Or perhaps, there actually is a schema i.e. fields we rely on being present, but we were too lazy to define it formally?
Well, there are a decent number of those use cases, but the one I find most compelling is storing an exact record of what another service sent me in case I need it later. I pull out and store in my typed schema the relevant bits of, say, an OpenID Connect response, but I'll often store the entire response as well in case I want to add functionality in the future based on additional fields, etc. in that response and roll forward without having to have an intermediate state.
Adjacent data that is very alike for most usage but different in subtle ways.
Classified ads, the additional details for a Car are different than a Shirt, but both would be ads. And adding a nearly infinite number of fields or a flexible system in a set of schema or detail tables is significantly worse than unstructured JSON.
Another would be records from different, but related systems. Such as the transaction details for an order payment. Paypal data will be different from your CC processor, or debit transaction, but you can just store it as additional details for a given "payment" record.
Another still would be in healthcare, or any number of other systems where the structures will be different from one system to another depending on data interchange, where in the storage you don't care too much, only in the application layer will it make any necessary difference.
Data owned by another service/application/org that is written and read without modification.
For example, a healthcare document that wasn't built or technically owned by the application storing it.
For example, a web text editor that serializes it's state as json.
Not json, but a web scraper storing html documents.
These have structure, it's only that the structure is built/maintained outside of the application storing it. You could of course transform it but I think it's a bit obvious where that might not be worth the cost/complexity.
You can always change the source of truth so instead of being the JSON texts it's the properly normalized schema and then you can have a VIEW that returns JSON for apps that need JSON. And if need be you can denormalize a bit and have a JSONB column that stores the document as the serialization of the rows' other columns.
That's what's really nice about SQL: you can perform these sorts of schema surgeries and still retain backwards-compatibility using VIEWs.
A use case for me is I’m calling a lot of different 3rd party APIs that return similar things but in widely different data structures with a lot of information specific to that site. I take what is standard across all the API’s and store that in my relational database but anything I didn’t use I store in a JSONB field just in case I need it some time in the future.
In other words I’m not really using anything in that JSONB field… at least right now.
We're evaluating solutions to ingest business app events and infrastructure logs. Document-ish JSON-ish "DB"/search engine seems like best solution, because no one does structured logs properly.
Now we have to pick the stack. ELK, Loki + Graphana, Graylog or maybe just dump into MongoDB?
I think my learning was "all data has schema, eventually"
We thought it was just a bunch of data documents. But it turned out that to actually use that data in an application it had to have predictability, we had to know certain fields were present and in a fixed format. You know: a schema.
I'm confused about this. Since when can't you enforce a schema on a document you're storing? Did you have no way to change the data before it hit the database?
the view I now have is that for a relational table, yes you have to suffer through migrations but at least they are in sql. for document based stores, you still have to have migrations, but they are just implemented in code
json documents sound great, especially initially, but end up being a maintenance nightmare
> for document based stores, you still have to have migrations, but they are just implemented in code
The problem with that - in my experience - is that migrating the structure for thousands (if not millions) of documents is way slower than running a DDL command (as it means reading each document, parsing it, modifying it and writing it back). Many DDL commands are just metadata update to the system catalogs so they are quite fast (e.g. adding a new column with a default value). With documents you wind up with millions of single row updates.
This can be mitigated by doing a "lazy" migration when a document with the old structure is first read. But that makes the code much more complicated.
Or, to put it another way, yes you have to write and maintain an upfront schema, but a document-based system has a schema too, it's just distributed amongst 50 codebase and 10 people's heads.
I don’t do much backend work at the moment but I found using JSONB fields for prototyping endpoints to be quite handy. Once I had a feature figured out and working, I’d create a proper schema.
I've found Postgres' Hstore type useful for this, too.
I often create a "metadata" hstore field on a table and use it for random bits of data I don't want to create an actual field for yet. When I find that the application needs that bit of data and in a certain format, I'll move it into an actual field.
I get that, but I think the temptation to add too much structure to the JSONB would be too great; this is just a temporary or prototype field, so it just needs a name and a value. Hstore works great for this.
This is an issue regardless of your type of Document storage. There are different options for verifying document details and versioning when approaching Document oriented databases. A typical usage is to have a per-record version number and pass documents through a verifier either as part of insert, update or query to ensure all migrations to the current version have occurred.
True, but for example with db-level validations I don’t have to worry that a new developer ships a batch job that runs raw inserts for performance reasons and bypasses app layer validations.
> I would generally advocate for no write access outside of the app as well.
FWIW I think OP was referring to app code still, but code opting in to skipping app data validations. Rails for example makes this very easy to do, and there are tons and tons of people out there selling this as a performance improvement (this one trick speeds up bulk operations!!!!). There are times where it’s useful, but it’s a very sharp knife that people can easily misuse.
So yeah, anyway, have both db constraints and app validations.
Thanks for pointing that out, I did misread that and have seen/written such things in rails land. We used active record import which gives the option of running validations.
But yeah, layering your safety nets is generally wise. I would include testing and code review in that as well.
What? You should be exposing your data access layer as a microservice and all your db level validations should be done there...
You can even write a small DSL syntax language to make it easier to use for developers, and perhaps an Obviously Reduntant Middleware that sits between them to convert their programming language objects to the DSL. Add some batch support, perhaps transactional locks (using mongo, we want to be webscale after all) and perhaps a small terminal based client and voila, no one should ever need to deal with petty integrity in the db again.
Most databases eventually have a human typing an INSERT statement or the moral equivalent. It's not duplication to have your correctness functions protect that, too.
You know I’ve found that LLM’s are awfully good at taking random JSON crap and “converting it” to a consistent format. You don’t even need a large LLM… something like ChatGPT 3.5 or an equivalent class of LLM can do the work just fine.
It’s not perfect and requires a fair amount of effort to nail the prompt but when it works it works.
Neat. When I migrated a project from mongo to postgres I took a similar approach, except I only implemented the mongo feel I needed within my own project instead of building a proper library as done here. I was surprised how much performance improved despite using a hacky wrapper.
Cripes, this is a good blog post. Did it get posted to HN for discussion? If not, it should have been. I especially like this part:
> I'm trying to stay humble. Mongo must be an incredibly big project with lots of nuance. I'm just a solo developer and absolutely not a database engineer. I've also never had the opportunity to work closely with a good database engineer. However I shouldn't be seeing improvements like this with default out of the box PostgreSQL compared to all the things I tried over the years to fix and tune Mongo.
Humbleness: That is rare to see around here. It is impressive that you go such speed-ups for your use case. Congrats and thank you to share with the blog post.
EDIT
The Morgan Freeman meme at the end gave me a real laugh. I would say the same about my experience with GridGain ("GridPain").
I remember your post back then and it did not made sense at all, many pointed out it was lacking information and you probably did something wrong with mongo.
All the stuff under Mongo Problems is garbage, sorry.
I regularly find the hybrid model is a sweet spot. I keep core fields as regular columns and dynamic data structures as JSONB. It brings the best of both worlds together.
I do this too with Postgres and it is just the best of both.
A robot is a record. A sensor calibration is a record. A warehouse robot map with tens of thousands of geojson objects is a single record.
If I made every map entity its own record, my database would be 10_000x more records and I’d get no value out of it. We’re not doing spatial relational queries.
It's great when you have no reason EVER to decompose the data.
That being said, when you start going "wait why is one record like this? oh no we have a bug and have to fix one of the records that looks like this across all data" and now you get to update 10,000x the data to make one change.
Small price to pay in my opinion. How often will that happen vs how often the database is used. Migrations like that can be done incrementally over time. It's a solved problem.
It’s also trivial to do. My JSON fields are all backed by JSON Schema. And I just write a data migration that mutates the data in some way and have the migration run by one host in a rate limited manner. It’s not quite as good as a traditional change in schema but it’s such a non-issue.
I am glad it works! I have just been subject to several systems that have grown over time that worked very well until it became a problem (and then a huge one) so I am glad you are taking a disciplined approach.
Yup you’re absolutely right. There is no one size fits all. The more you can plan, and the more clear your case is, the less you need to pay for having flexibility.
Yeah we do this as well. Have previously been a heavy MongoDB user but when we migrated to Postgres (using some relational schema with some JSONB for nested objects etc) it just made querying so much easier and reliability has been great.
It's technologically cool, but I would love a "why" section in the README. Is the idea you're a mongo Dev/love the mongo api and want to use it rather than switch to pg apis? Or want to copy some code over from an old project?
I'm sure there are use cases, I'm just struggling to grasp them. Especially if it's about reusing queries from other projects, AI is pretty good at that
For sure, but it feels really risky. If it's a small codebase, I would be more confident knowing what queries I was using and just switch them. If it's a large codebase, I'd want some really comprehensive test coverage, including performance tests
If I were to start a new project, I would directly use postgres, and possibly add a JSONB column ONLY FOR OPTIONAL fields that you don't query frequently. Throwing everything in a document is just fermenting chaos and pain. That being said, I do love the syntax and structure of Mongo pipelines over SQL.
At $WORK, we use the same approach for integrations with 3rd party systems.
The data that is common for all integrations are stored as columns in a relational table. Data that are specific for each integration are stored in JSONB. This is typically meta data used to manage each integration that varies.
It works great and you get the combination of relational safety and no-schema flexibility where it matters.
As far as I can tell, Pongo provides an API similar to the MongoDB driver for Node that uses PostgreSQL under the hood.
FerretDB operates on a different layer – it implements MongoDB network protocol, allowing it to work with any drivers and applications that use MongoDB without modifications.
I dont want to sound rude, but as a bootstrap founder it kinda boggles my mind how much money people can raise for a product like ferretdb. I just don't see how it can make VC level return without at the very least changing licenses which seems to ne the premise behind creating this MongoDB proxy. I am sure there is a narrative for it though so best of luck!
Also check you managed service links on GitHub, half are dead.
If bait-and-switch were our strategy, we would have chosen a different license from the beginning. The Apache license allows everyone to fork FerretDB away and do whatever they like with it. It is unlike MongoDB with their initial AGPL that, in theory, allows everyone to, say, run MongoDB SaaS, but in practice, has enough strings attached to scare people off.
We want to have a piece of a bigger pie, not a bigger piece of an existing pie. Providing alternatives makes the whole market bigger.
> Also check you managed service links on GitHub, half are dead.
The posted project looks like a client that connects to pg but behaves like Mongo, where Ferret is a server that accepts Mongo client connections and uses pg as backend storage.
Yes, I'm using MongoDB API in Pongo to keep the muscle memory. So, it's a library that translates the MongoDB syntax to native PostgreSQL JSONB queries.
Yes to "The posted project looks like a client that connects to pg but behaves like Mongo, where Ferret is a server that accepts Mongo client connections and uses pg as backend storage."
I use JSONB columns a lot, it has its place. It can fit certain applications, but it does introduce a lot of extra query complexity and you lose out on some ways to speed up query performance that you could get from a relational approach.
Which is to say JSONB is useful, but I wouldn’t throw the relational baby out with the bath water.
I think if a field in a JSONB column is used frequently in the most common query patterns, it might be time to "surface" it up among other non-JSONB relational columns. There will be some additional overhead e.g. it should be made a read-only column for consistency's sake; if you update the value of this field in the JSONB column, this surfaced column should also be updated.
However, a bit of duplication is not a terrible trade-off for significantly improved query performance.
I’ve been doing some reasonably serious playing with the idea of using jsonb columns as a kind of front end to relational tables. So basically, external interactions with the database are done using JSON, which gives end users some flexibility, but internally we effectively create a realtime materialised view of just those properties we need from the json.
Anyone else tried this approach? Anything I should know about it?
My mental model has always been to only use JSONB for column types where the relations within the json object are of no importance to the DB. An example might be text editor markup. I imagine if you start wanting to query within the json object you should consider a more relational model.
My experience has been that users like to include a lot of information that's not relevant to the application I'm working on.
My application couldn't really care less about customer names, for example. But the people who buy my software naturally do care - and what's worse, each of my potential customers has some legacy system which stores their customer names in a different way. So one problem I want to address is, how do I maintain fidelity with the old system, for example during the data migration, while enabling me to move forward quickly?
My solution has been to keep non-functional data such as customer names in JSON, and extract only the two or three fields that are relevant to my application, and put them into a regular SQL database table.
So far this has given me the best of both worlds: a simple and highly customisable JSON API for these user-facing objects with mutable shapes, but a compact SQL backend for the actual work.
I always thought JSONB is a great use case for client-specific customizations. For example - a CRM application where you have structured DB tables like "account", "opportunity", etc. but then a client-specific table might be "cohort" that is linked to "account" and has fields "cohort size", "cohort name", etc. You just chuck the cohort table in JSONB and now the company who uses the CRM that has the custom cohort table can change that table without creating/migrating your RDMS table structure.
That sounds like something that would be better tailored to well normalized tables and custom row entries. The question I'd ask is if something you're concerned about being a column could instead be a row in a different table.
I do something similar, building a lightweight search index over very large relational datasets.
So the tables are much simpler to manage, much more portable, so I can serve search off scalable hardware without disturbing the underlying source of truth.
The downside is queries are more complex and slower.
I've had extremely disappointing performance with postgres JSONB columns. I've used it a lot for MVPs and betas where it's much easier to use JSONB and lock in your exact relational schema lately.
I've now decided this path is a mistake because the performance is so bad. Even with low thousands/tens of thousands of rows it becomes a huge problem, queries that would take <1ms on relational stuff quickly start taking hundreds of ms.
Optimizing these with hand rolled queries is painful (I do not like the syntax it uses for jsonb querying) and for some doesn't really fix anything much, and indexes often don't help.
It seems that jsonb is just many many order of magnitudes slower, but I could be doing something wrong. Take for example storing a dictionary of string and int (number?) in a jsonb column. Adding the ints up in jsonb takes thousands of times longer rather than having these as string and int in a standard table.
Perhaps I am doing something wrong; and I'd love to know it if I am!
Did you have arrays in your jsonb data? I don't currently use jsonb for production performance sensitive queries, but in the past what I learned was that it's great at automatically indexing everything except arrays. You have to manually declare indexes on your arrays.
When jsonb works, it's incredible. I've had many... suboptimal experiences with mongo, and jsonb is just superior in my experience (although like I said, I haven't used it for performance critical stuff in production). For a long time, it kinda flew under the radar, and still remains an underappreciated feature of Postgres.
An "index on expression" should perform the same regardless of the input column types. All that matters is the output of the expression. Were you just indexing the whole jsonb column or were you indexing a specific expression?
For example, an index on `foo(user_id)` vs `foo(data->'user_id')` should perform the same.
They should learn about b-trees and how indexed queries can be done with them either with or without an explicit query language. Then they can decide what kind of data storage service they need. Understand what's happening inside the black box.
While I’d agree that understanding SQL basics is an important fundamental for novices to learn, I started using MongoDB 11 years ago and haven’t looked back.
I made a joke tweet about this in Nov 2023 (even called it "Pongo"). This is definitely a just a funny coincidence, but I'm going to pretend like I can see into the future:
https://x.com/navbryce/status/1720580136737894661
> What makes mongo mongo is its distibruted nature
Since when? Mongo was popular because it gave the false perception it was insanely fast until people found out it was only fast if you didn't care about your data, and the moment you ensure write happened it ended up being slower than an RDB....
Since forever, sharding, distributing postgres / mysql was not easy. There were a few proprietary extensions. Nowadays it's more accessible.
This was typical crap you had to say to pass fang style interview "oh of course I'd use mongo because this use case doesn't have relations and because it's easy to scale", while you know postgres will give you way less problems and allow you to make charts and analytics in 30m when finance comes around.
I made the mistake of picking mongo for my own startup, because of propaganda coming from interviewing materials and I regretted it for the entire duration of the company.
Distributing PostgreSQL still requires proprietary extensions.
With the most popular being Citus which is owned by Microsoft and so questions should definitely remain about how long they support that instead of pushing users to Azure.
People like to bash MongoDB but at least they have a built-in, supported and usable HA/Clustering solution. It's ridiculous to not have this in 2024.
Trying to use MongoDB by default for new projects because of the built in HA, dumped Postgres because the HA story is so bad on bare metal (alright if you are ok burning money on RDS or simiar).
Current preference:
1. HA MongoDB
2. HA MariaDB (Galera) or MySQL Cluster
3. Postgres Rube Goldberg Machine HA with Patroni
4. No HA Postgres
I'm planning to add methods for raw JSON path or, in general, raw SQL syntax to enable such fine-tuning and not need to always use MongoDB API. I agree that for many people, this would be better.
So how easy is it to distribute it? I don’t have experience with it but the tutorials look terrible compared to, say, Scylla, Yuga, Cockroach, TiDB etc. Again, honest question?
Relatively easy... though, and I may be out of date, you have to choose replication or sharding at any given level... so if you want horizontal scale plus redundancy, you will wind up with slightly more complexity. My experience trying to recover a broken cluster after 5 hours of Azure going up/down in series was anything but fun.
Would have to test, but the library for this post may well work with CockroachDB if you wanted to go that route instead of straight PostgreSQL. I think conceptually the sharding + replication of other DBs like Scylla/Cassandra and CockroachDB is a bit more elegant and easier to reason with. Just my own take though.
Right if you want to use a sharded and replicated system with a document data model, might as well use the real thing rather than introduce the risk and impedance mismatch of an emulation layer. The whole point here is for folks to not have to have the cognitive and performance downsides of a tabular data model
Pongo seems to be a middleware between your app and Postgres. So it will most certainly work absolutely fine on YugabyteDB, if one’s okay with occasional latency issues.
One could optimise it more for a distributed sql by implementing key partition awareness and connecting directly to a tserver storing the data one’s after.
We evaluated these over the years for different purposes/clients. And I don’t expect the ‘quirky’ part in 2024 I guess; it gets hard of course when the loads get specific; but in general I expect things to be automatic (multi master, sharding and replication) and simple to set up. Point out other nodes and done.
People don't usually distribute Postgres (unless you count read replicas and cold HA replicas). But well, people don't usually distribute MongoDB either, so no difference.
In principle, a cluster of something like Mongo can scale much further than Postgres. In practice, Mongo is full of issues even before you replicate it, and you are better with something that abstracts a set if incoherent Postgres (or sqlite) instances.
Postgres supports foreign data wrapper (FDW), which is the basic building block for a distributed DB. It doesn't support strong consistency in distributed settings as of yet, although it does provide two-phase commit which could be used for such.
I doubt it ever will. The point of distributing a data store is latency and availability, both of which would go down the drain with distributed strong consistency
I think of the Stripe Mongo install, as it was a decade or so ago. It really was sharded quite wide, and relied on all shards having multiple replicas, as to tolerate cycling through them on a regular basis. It worked well enough to run as a source of truth for a financial company, but the database team wasn't small, dedicated to keeping all that machinery working well.
Ultimately anyone doing things at that scale is going to run a small priesthood doing custom things to keep the persistence payer humming, regardless of what the underlying database is. I recall a project abstracting over the Mongo API, as to allow for swapping the storage layer if they ever needed to
> What makes mongo mongo is its distibruted nature, without it you could just store json(b) in an RDBMS.
Welllllllll I think that's moving the goalposts. Being distributed might be a thing _now_ but I still remember when it was marketed as the thing to have if you wanted to store unstructured documents.
Now that Postgres also does that, you're marketing Mongo as having a different unique feature. Moving the goalposts.
It doesn't really seem reasonable to accuse someone of moving goalposts that you've just brought into the conversation, especially when they were allegedly set by a third party.
There are separate drivers for Java, node.js, python, REST, etc.
In addition to that, it has Mongo API, which is fully Mongo compatible - you can use standard Mongo tools/drivers against it, without having to change Mongo application code.
Both are for Oracle Database only, and both are free.
Yes, I'm using JSONB underneath and translating the MongoDB syntax to native queries. As they're not super pleasant to deal with, then I thought that it'd be nice to use some familiar to many MongoDB API.
Regarding IDs, you can use any UUID-compliant format.
I'd like to know why AWS went with Aurora DB for their DocumentDB backend. Did the Mongo license change trigger a rush to build something Mongo compatible, but not quite MongoDB?
This looks great, I'll definitely give it a try.
As many mentioned already, having classic columns and a JSON(B) column seems to be a common solution.
How do you handle data validation for the JSON documents?
My current project uses Django for metadata. I've been thinking about creating a layer similar to model fields in Django. You would declare a JSON "model" through those fields and assign it to the actual model JSON field.
You can just specify a model/DTO object and serialize it as JSON when saving. Many frameworks do that automatically so you don't need to think about it. At work we just annotate the field in the model as a json-field, and the framework will handle the json-conversion automatically and store the other fields in the model as regular database columns.
Adding fields is not an issue, as it will simply be missing a value when de-serializing. Your business logic will need to handle its absence, but that is no different than using MongoDB or "classic" table columns
That's a very low cost approach, I love it!
I still think the Django ecosystem would benefit from a standardized/packaged approach including migrations. I'll ponder a bit more
And it only costs 75k a seat per year per developer, with free bi yearly license compliance audits, a million in ops and hardware to get near prod and all the docu is paywalled. What a deal!
A client had a DB hosted by Oracle. The client was doing most of their compute on AWS, and wanted to have a synchronised copy made available to them on AWS. Oracle quoted them a cool $600k/year to operate that copy, with a 3 year contract.
Client of mine wanted to shift a rac cluster from some aging sparc gear into a VMware or openstack or whatever farm they had on premise; oracle demanded they pay CPU licenses for every single CPU in the cluster as each one could “potentially” run the oracle database, quoted them seven figures.
MongoDB has supported the equivalent of Postgres' serializable isolation for many years now. I'm not sure what "with strong consistency benefits" means.
Or is it? Jepsen reported a number of issues like "read skew, cyclic information flow, duplicate writes, and internal consistency violations. Weak defaults meant that transactions could lose writes and allow dirty reads, even downgrading requested safety levels at the database and collection level. Moreover, the snapshot read concern did not guarantee snapshot unless paired with write concern majority—even for read-only transactions."
That report (1) is 4 years old, many things could have changed. But so far any reviewed version was faulty in regards to consistency.
Jepsen found a more concerning consistency bug than the above results when Postgres 12 was evaluated [1]. Relevant text:
We [...] found that transactions executed with serializable isolation on a single PostgreSQL instance were not, in fact, serializable
I have run Postgres and MongoDB at petabyte scale. Both of them are solid databases that occasionally have bugs in their transaction logic. Any distributed database that is receiving significant development will have bugs like this. Yes, even FoundationDB.
I wouldn't not use Postgres because of this problem, just like I wouldn't not use MongoDB because they had bugs in a new feature. In fact, I'm more likely to trust a company that is paying to consistently have their work reviewed in public.
FWIW, the latest stable release is 7.0.12, released a week or so ago: https://www.mongodb.com/docs/upcoming/release-notes/7.0/. (I'm not sure why the URL has /upcoming/ in it, actually: 7.0 is definitely the stable release.)
MongoDB had "strong consistency" back in 2013 when I studied it for my thesis. The problem is that consistency is a lot bigger space than being on or off, and MongoDB inhabited the lower classes of consistency for a long time while calling it strong consistency which lost a lot of developer trust. Postgres has a range of options, but the default is typically consistent enough to make most use-cases safe, whereas Mongo's default wasn't anywhere close.
They also had a big problem trading performance and consistency, to the point that for a long time (v1-2?) they ran in default-inconsistent mode to meet the numbers marketing was putting out. Postgres has never done this, partly because it doesn't have a marketing team, but again this lost a lot trust.
Lastly, even with the stronger end of their consistency guarantees, and as they have increased their guarantees, problems have been found again and again. It's common knowledge that it's better to find your own bugs than have your customers tell you about them, but in database consistency this is more true than normal. This is why FoundationDB are famous for having built a database testing setup before a database (somewhat true). It's clear from history that MongoDB don't have a sufficiently rigorous testing procedure.
All of these factors come down to trust: the community lacks trust in MongoDB because of repeated issues across a number of areas. As a result, just shipping "strong consistency" or something doesn't actually solve the root problem, that people don't want to use the product.
It's fair to distrust something because you were burned by using it in the past. However, both the examples you named -- Postgres and FoundationDB -- have had similar concurrency and/or data loss bugs. I have personally seen FoundationDB lose a committed write. Writing databases is hard and it's easy to buy into marketing hype around safety.
I think you should reconsider your last paragraph. MongoDB has a massive community, and many large companies opt to use it for new applications every day. Many more people want to use that product than FoundationDB.
Can you elaborate on why ‘many large companies’ are choosing MongoDB over alternatives and what their use cases are? I’ve been using Mdb for a decade and with how rich the DB landscape is for optimising particular workloads I just don’t see what the value proposition is for Mdb is compared to most of them. I certainly wouldn’t use it for any data intensive application when there’s other fantastic OLAP dbs, nor some battle hardened distributed nodes use case, so that leaves a ‘general purpose db with very specific queries and limited indexes’. But then why not just use as PG as others say?
This kinda misses my point. By having poor defaults in the past, marketing claims at-odds with reality, and being repeatedly found to have bugs that reduce consistency, the result is that customer have no reason to trust current claims.
They may have fixed everything, but the only way to know that is to use it and see (because the issue was trusting marketing/docs/promises), and why should people put that time in when they've repeatedly got it wrong, especially when there are options that are just better now.
Right, I was curious if you put even more time in :)
I see lots of comments from people insisting it's fixed now but it's hard to validate what features they're using and what reliability/durability they're expecting.
Yes, I have worked on an application that pushed enormous volumes of data through MongoDB's transactions.
Deadlocks are an application issue. If you built your application the same way with Postgres you would have the same problem. Automatic retries of failed transactions with specific error codes are a driver feature you can tune or turn off if you'd like. The same is true for some Postgres drivers.
If you're seeing frequent deadlocks, your transactions are too large. If you model your data differently, deadlocks can be eliminated completely (and this advice applies regardless of the database you're using). I would recommend you engage a third party to review your data access patterns before you migrate and experience the same issues with Postgres.
Not necessarily, and not in the very common single-writer-many-reader case. In that case, PostreSQL's MVCC allows all readers to see consistent snapshots of the data without blocking each other or the writer. TTBOMK, any other mechanism providing this guarantee requires locking (making deadlocks possible).
So: Does Mongo now also implement MVCC? (Last time I checked, it didn't.) If not, how does it guarantee that reads see consistent snapshots without blocking a writer?
Locking doesn't result in deadlocks, assuming that it's implemented properly.
If you know the set of locks ahead of time, just sort them by address and take them, which will always succeed with no deadlocks.
If the set of locks isn't known, then assign each transaction an increasing ID.
When trying to take a lock that is taken, then if the lock owner has higher ID signal it to terminate and retry after waiting for this transaction to terminate, and sleep waiting for it to release the lock.
Otherwise if it has lower ID abort the transaction, wait for the conflicting transaction to finish and then retry the transaction.
This guarantees that all transactions will terminate as long as each would terminate in isolation and that a transaction will retry at most once for each preceding running transaction.
It's also possible to detect deadlocks by keeping track of which thread every thread is waiting for and signaling the either the highest transaction ID in the cycle or the one the lowest ID is waiting for to abort, wait for ID it was waiting for terminate and retry.
Yes, I'm aware that deadlock can be avoided if the graph having an edge uv whenever a task tries to acquire lock v while already holding lock u is acyclic, and this property can either be guaranteed by choosing a total order on locks and then only ever acquiring them in this order or, or dynamically maintained by detecting tasks that potentially violate this order and terminating them, plus retries.
However, those techniques apply only to application code where you have full control over how locks are acquired. This is generally not the case when feeding declarative SQL queries to a DBMS, part of whose job is to decide on a good execution plan. And even in application code, assuming a knowledgeable programmer, they need to either know about all locks in the world or run complex and expensive bookkeeping to detect and break deadlocks.
The fundamental problem is that locks don't compose the way other natural CS abstractions (like, say, functions) do: https://stackoverflow.com/a/2887324
Would love a C# version of this. I usually use Mongodb for all of our projects. But we need to use Postgres for a project. This would come in very handy.
Yes, a similar idea, but I don't aim to be 100% MongoDB compliant or full replacement. My goal is to use as many of PostgreSQL features as possible. Having the library level as translation will allow more scenarios like, e.g. sharing connection and using PostgreSQL hosting.
Hint: I'm an ex-Marten maintainer, so the similarity is not accidental ;)
As Op said, not needing to rewrite applications or using the muscle memory from using Mongo is beneficial. I'm not planning to be strict and support only MongoDB API; I will extend it when needed (e.g. to support raw SQL or JSON Path). But I plan to keep shim with compliant API for the above reasons.
MongoDB API has its quirks but is also pretty powerful and widely used.
Oh, so you are, then we can rest assured this will end up being a solid project!
I personally can't stand mongodb, its given me alot of headaches, joined a company and the same week I joined we lost a ton of data and the twat who set it up resigned in the middle of the outage. Got it back online and spend 6m moving to postgresql.
Thanks, that's the goal: to bring the solid and verified approach in Marten to Node.js land. The concept is similar, but the feature set will be different.
Mongodb and dynamodb are completely different dbs. One is unlimited scale KV but very expensive , another is document nosql db that sells you idea “it just works” for lots of features , indexes on anything , aggregation , time series . Vector DB, sharding , replicas etc . It’s a very powerful db for sure.
https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...