Your database skills are not 'good to have' (2023)

ergonaught · 2024-04-01T14:36:42 1711982202

Every time I read someone's "horror story" of "queries that used a lot of joins, like even 3 or 4 tables", I...well, sigh, at this point, I guess. It's been a very long time since I needed advanced SQL to do things that probably shouldn't be happening in the database anyway, but the fundamentals seem largely foreign concepts today (speaking generally), and they are definitely still relevant and enormously helpful.

rhdunn · 2024-04-01T14:58:23 1711983503

Doesn't that happen when you use a highly normalized database table design, though where you create tables to avoid repeating data? Or if you have a lot of 1/many to many relationships in mapping tables. I.e. you'll end up with a lot of tables you need to join on to collate that data (authors, subjects, languages, etc.) to display to the user.

You could break the joins into three parts:

1. searching/filtering -- which tables you need to join on to match the selected queries;

2. sorting -- determining which column(s) of data to sort the results by; and

3. displaying -- which tables you need to join on to retrieve the information needed to display the information to the user.

These may be different or have overlaps.

stouset · 2024-04-01T15:03:28 1711983808

I think they’re saying that JOINing over three or four tables is really just not that big a deal for most situations, particularly if you have the needed indices.

rhdunn · 2024-04-01T15:09:22 1711984162

Yeah. That's something you would expect databases to be good at. Expecially as normalization and many-to-many relationships make multiple JOINs common, which was my point.

ergonaught · 2024-04-01T15:05:59 1711983959

Joins are rarely the problem. See the sigh.

boesboes · 2024-04-01T14:46:28 1711982788

That seems to be the exact point the author makes: you need to know some fundamentals to properly use databases. Even if you are not doing super complex stuff.

ergonaught · 2024-04-01T15:04:45 1711983885

I was certainly not disagreeing with the author.

west0n · 2024-04-01T14:52:24 1711983144

lesson: DONOT join more than 3 tables in MySQL, lol

cess11 · 2024-04-01T18:45:59 1711997159

Why not? I've worked with a database where we needed to join more than double that on some common queries, in part because financial stuff tends to touch several other business domains at once.

The explains can be hairy at first but once you've started systematically analysing a query and know some of the quirks of the MySQL query planner you'll manage to identify missing indices and squeeze out very, very good performance.

That was a database where we had warm tables bigger than fifty million rows. I personally migrated it from single instance MySQL 5 to a MySQL 8 clustered rig and optimised hundreds of queries written by people learning basics on the job, going from horrifying to very reasonable latencies as seen from the web clients.

cogman10 · 2024-04-01T15:05:40 1711983940

Wait... why?

Admittedly I'm mostly using MSSQL, but joining multiple tables seems a pretty natural thing to do (assuming you are joining on well indexed tables).

Izkata · 2024-04-01T15:30:21 1711985421

Table locks.

This is specific to the default MyISAM engine, where it locks all tables used in a query for the duration of the query. So two queries that touch the same table can't run concurrently.

You may never hit this as a problem if your queries are fast enough, but it is something to be aware of.

cogman10 · 2024-04-01T15:35:34 1711985734

Didn't MySQL switch to InnoDB as the default engine a long time ago (my googling says 2010)? Does InnoDB have the same limitations?

Izkata · 2024-04-01T15:39:15 1711985955

I actually didn't realize they switched (we're still on an ancient version and unable to upgrade for now).

InnoDB uses row-level locks, so it's extremely unlikely to cause such issues.

erik_seaberg · 2024-04-01T19:42:17 1712000537

InnoDB also supports transactions.

bearjaws · 2024-04-01T15:30:31 1711985431

You can join dozens of tables with almost no impact?

The only issue is joining with aggregations and sorting, that's where things get ugly with MySQL.

steve_adams_86 · 2024-04-01T15:09:32 1711984172

In my experience (I know nothing about this) if I needed to join a lot, I’d actually create new tables which routinely aggregated that data into tables I could query more efficiently instead. Especially if it was analytical data which was used for things like dashboard reporting.

It seems excessive at first but the performance is so dramatically better that it makes a lot of sense. Especially if the queries are part of your hot path. I always thought of it like having core business logic tables, then utility tables which are essentially derived from those core tables.

west0n · 2024-04-01T15:14:13 1711984453

this is so called "Materialized View"

mewpmewp2 · 2024-04-01T15:04:58 1711983898

What? It should be fine to join that amount and more, given you have proper indexes.

marcosdumay · 2024-04-01T15:42:58 1711986178

More than this.

Joining the tables in the database should perform better than joining them on the application layer. If it doesn't, there's something very odd.

This holds for any production ready DBMS. Even the ones that don't care about performance or preserving your data.

sumtechguy · 2024-04-01T17:47:26 1711993646

As a wizened old grey beard once told me 'almost always with SQL, it depends'. Basically you have to revisit your ground truth all the time. Sometimes keeping stuff client (like a lookup table) is the right way. Most of the time server is better. But not always. 'it depends'.

marcosdumay · 2024-04-02T02:37:46 1712025466

"Odd" is not "wrong".

But wrong is almost always odd, so the normal expectation is that you understand the oddity on your system. Otherwise you are blind to problems.

Anyway, it's also not common for an application to have "reduce total CPU usage" as a goal. Available CPU on the database is way more valuable than on the application server, and so it makes sense to trade them up.

sumtechguy · 2024-04-02T13:09:43 1712063383

In most SQL cases your limiting factor will be your network. So you want to minimize what is coming over (usually). However, that tradeoff is time. If it takes more time for your client to render than it would just to send it you usually let the server do it. For example if I have a 50 row lookup with 10 columns table that is joined to a 300k row table, that for some reason must be returned to the client. Now given those conditions. It might be faster to glue it back together on the client. It just depends on how much is in those 50 rows. Just bunch of ints (probably not). A few 400 byte strings? That could be interesting and faster to do on the client. Like he said 'it depends'.

baq · 2024-04-01T15:22:42 1711984962

This was valid advice in MySQL 3 days.

Twirrim · 2024-04-01T15:45:12 1711986312

People love to keep trotting it out, which is annoying. "Can't use MySQL because <thing-that-hasn't-been-true-for-over-a-decade>", which I guess is arguably adjacent to other points in the article about the stuff databases have added over the years:

Check in on your databases periodically, and refresh your memory. There's a very real chance they've picked up features that'll make your life a lot easier.

frfl · 2024-04-01T15:05:48 1711983948

Can anyone share good resources or codebases that have high quality web backend with optimized queries and data structures (db schema, backend internal data structures, overall backend architecture)?

I feel like a lot of code you see in articles or modern companies if often just following the popular architecture/paradigm du jour. As opposed to what this article highlights in terms of making backends return all the data you need in the minimal number of queries. I feel like I understand what's being described here, but I've never actually had the chance to work on anything non-trivial that does it like this, most places I've worked have gone no-SQL or used an orm or were not more than very thin wrappers around some CRUD queries.

stavros · 2024-04-01T15:13:37 1711984417

There's nothing in the ORM that prevents you from doing what the article describes; hell, the article describes doing it with the ORM, and, double hell, the ORM does the right thing by default, at least as far as avoiding N queries goes.

The solution is to know how a database works, instead of thinking that Product.objects.filter(color="red") and [product for product in Product.objects.all() if product.color = "red"] are exactly the same thing.

frfl · 2024-04-01T15:15:52 1711984552

I understand that. I was asking more about if anyone knows any open source codebases that are high quality, non-trivial and well designed, that one should read and study to better see how others have written good code and systems.

js2 · 2024-04-01T15:40:44 1711986044

Sentry is built on-top of Django. It's a good example of a real-world non-trivial code base that has had to adapt as it grew up and scaled up. I haven't looked at the code in a few years, but I used to run it on-premise and found it fairly approachable and well-written, but with just the sorts of hacks you'd expect to see in a software code base that had evolved over time.

https://github.com/getsentry/sentry

Here's all the database models:

https://github.com/getsentry/sentry/tree/master/src/sentry/m...

Here's an example of hacks in the form of massive caching to improve performance:

https://github.com/getsentry/sentry/blob/master/src/sentry/d...

ultrasaurus · 2024-04-01T19:28:53 1711999733

(Disclaimer: Former Sentry employee)

There's a ton of DB passion at Sentry but most importantly for anyone this deep in the thread, Sentry also detects and helps fix common DB mistakes whether it's you or the ORM: https://docs.sentry.io/product/issues/issue-details/performa...

If you're at the “why it’s slow” runbook, Sentry turns the first 2 steps into 1 click: https://docs.sentry.io/product/performance/queries/

baq · 2024-04-01T15:21:21 1711984881

Yes, the sqlalchemy docs. Dead serious.

frfl · 2024-04-01T15:31:40 1711985500

That's somewhat surprising, but then again, maybe not. Thanks, I'll take note of that.

For anyone else, I also remember PostgresSQL manual/docs being very high quality, so that might be worth reading just to better grok SQL/relational DB topics.

cess11 · 2024-04-01T18:38:26 1711996706

If you're digging into sqlalchemy and enjoying it, you would probably enjoy the source of Ecto, the kinda-sorta-but-not-really ORM that's commonly used with Elixir.

happytoexplain · 2024-04-01T15:21:25 1711984885

As a database novice, what is the difference? I assume one of the two is wastefully fetching all products and then filtering, while the other is omitting the filtered results as part of the fetch, and is therefore more efficient? If so, which is which?

kileywm · 2024-04-01T15:27:02 1711985222

Product.objects.filter(color="red")

This is the efficient one, filtering results upstream, via SQL, before they are returned to the client.

alpha_squared · 2024-04-01T15:28:12 1711985292

Your intuition is correct. Product.objects.filter() will create an optimized query to issue to the database. Meanwhile, Products.objects.all() will retrieve all the records and leaves you with the task of filtering them in-memory.

thedanbob · 2024-04-01T15:27:41 1711985261

Correct. I'm not familiar with that ORM, but the first should produce SQL like "SELECT * FROM products WHERE color = 'red'" and only load the records you're interested in, while the second loads everything and throws away probably most.

Izkata · 2024-04-01T16:13:22 1711988002

It's Django (python) if you weren't sure of the name and wanted a look.

The table naming convention from the model name is a bit different, but impossible to know from just that code snippet. Also Django likes to explicitly list every field instead of *

sgarland · 2024-04-01T17:50:33 1711993833

> Also Django likes to explicitly list every field instead of *

Tbf this is a good pattern. It’s rare that you _need_ SELECT *, and if you do, you may not in six months after DDL has added more columns to the table.

sgarland · 2024-04-01T15:24:22 1711985062

Given that one says Products.objects.all(), which one do you think it is?

Good on you for correctly determining the difference, though!

happytoexplain · 2024-04-01T15:31:09 1711985469

Ah, I see - since I'm not familiar with this language/API, I couldn't really infer what was happening behind the scenes. E.g. '.all()' could be constructing a request, not necessarily executing, and then the 'if' might modify that request intelligently.

Izkata · 2024-04-01T16:22:26 1711988546

In this ORM, both versions construct without executing until the results are accessed. The list comprehension looping over the results is an access, the "if" happens after that.

The one exception to the obvious "results are accessed" intuition is slices (such as [:3] to mean first 3 elements), which thanks to how python implemented them lets the ORM use them to further modify the query by adding LIMIT instead of executing it.

stavros · 2024-04-01T15:50:34 1711986634

You're right, this is Python/Django, where you don't get these semantics (Product is the database model, and it doesn't have access to the "if" to use in the filter). It was more of a general example, though, of "loop vs filtering in the SELECT", so the language doesn't matter much (well, apart from effectively making my point, which it perhaps failed at).

ryanjshaw · 2024-04-01T16:13:52 1711988032

In Microsoft land there's a sample reference architecture for .NET web apps: https://github.com/dotnet/eShop

brightball · 2024-04-01T15:31:33 1711985493

Modern Ruby on Rails makes this really easy with Strict Loading.

https://til.hashrocket.com/posts/l4t5xok8hg-activerecord-str...

One of the biggest perks with the association loads is that it avoids joins by default, which can become a scaling barrier as you grow. Instead, it will capture the associated ids, find all of the related records in a single query and then transparently handle everything for you behind the scenes.

Without any optimizations, you'll get essentially 1 query per loaded association rather than the N+1 "one per association per row".

You can always tune things to utilize joins, subqueries and optimize specific cases later but this is a sensible setting that will avoid the vast majority of issues.

eatonphil · 2024-04-01T14:27:51 1711981671

Good call linking to Use The Index Luke. A great resource.

https://use-the-index-luke.com/

sherburt3 · 2024-04-01T15:08:33 1711984113

His book SQL Performance Explained is also very enlightening as well. I think the most useful takeaway I got from that book is that the effectiveness of an index is heavily determined by how the application will be querying that table. If you want to design an effective index for a table you should experiment with multiple options and benchmark the queries that the application is going to run.

saikatsg · 2024-04-01T14:42:17 1711982537

Thanks for sharing!

throwaway5959 · 2024-04-01T15:27:41 1711985261

The number one mistake is really that the industry continues to pile more and more work on software engineers, including expecting them to be experts at a topic that used to be a whole job (DBA).

pradn · 2024-04-01T15:42:49 1711986169

In the past twenty years, the job of "software engineer" has taken upon ops/QA/DBA/support.

gigatexal · 2024-04-01T14:41:40 1711982500

I love everything about this article. This is the kind of database black arts that I really enjoy.

cogman10 · 2024-04-01T15:18:20 1711984700

Black arts?

Databases, pretty much all of them, are relatively simple creatures. If you can understand how a Tree structure and pointers work, you basically have most of the fundamental knowledge underlying the "black art" concepts of databases.

The primary key/clustered index is how the data is physically stored and sorted, in a btree. Additional secondary indexes are generally quiet literally just secondary tables clustered on the secondary index key with primary key pointers. The job of the optimizer is to take when you have "Select foo from bar where baz=2" and ask questions like "Is there an index on baz. Does that index contain foo?" and then executing based on that. If there is an index on baz, it traverses the baz index looking for 2. If that index covers foo, then the query is done and it just pulls foo from the baz index. If it doesn't, then it queries the foo table with each of the primary keys in the baz index. If no index exists, then the db has to do a linear search through the table looking for "baz=2".

When databases are slow, it's because you are forcing them to do O(n). When they are fast, it's because you use an index to turn that into a O(log n) search. (or in some cases, O(1).)

The black black arts of databases is when you start getting into things like CTEs and windowed functions and their interactions with the indexes/optimizers. That's when it can be almost trivial to introduce O(n^2) performance if you aren't careful.

NoSQL databases generally speaking, are implemented similar to how SQL databases are (just trees) but with less features and less guarantees.

sgarland · 2024-04-01T15:35:08 1711985708

> The primary key/clustered index is how the data is physically stored and sorted, in a btree.

This is only true for RDBMS with a clustered index, like MySQL (assuming InnoDB) and MSSQL. Postgres stores tuples in a heap. The indices are generally B+trees, yes, but there is an extra level of indirection via the visibility map.

There are a million gotchas, is the problem. You declared a partial index on foo as WHERE foo = true, but you’re querying for WHERE foo IS NOT false? I have bad news about the secret third state, WHERE foo IS NULL (not to mention the subtle differences between equality and IS).

You made a UUIDv4 as the PK, thinking that since Postgres doesn’t cluster on it, it won’t suffer the same performance issues? Visibility map just wrecked your IOPS, enjoy the 7x hit to pages.

You have an index and it’s not being used? Could be incorrect equality check (see first example), or inadequate auto-analyze on a write-heavy table leading to incorrect statistics, or any other number of things.

RDBMS are much easier to understand if you know what a B+tree looks like, yes, and I highly recommend any backend dev take the time to do so. But there is so, so much more that can go wrong.

cogman10 · 2024-04-01T15:42:35 1711986155

> RDBMS are much easier to understand if you know what a B+tree looks like, yes, and I highly recommend any backend dev take the time to do so. But there is so, so much more that can go wrong.

I agree, but that stuff that goes wrong isn't (in my experience) super common. At least, not until your tables reach fairly large (as in 10s of GBs of data) sizes. Which is why I'd suggest you still employ DBAs :).

You wouldn't, for example, think about pulling out a filter index if you weren't dealing with fairly large tables where a full index might have an overly large negative impact.

On the flip side, I've seen more than a few tables where a fundamental understanding of how the data is structured would have prevented a multitude of issues like bad or missing indexes. (My favorite that is depressingly common `Create Index blah ON foo(id, thingToIndex)`)

sgarland · 2024-04-01T17:17:27 1711991847

> I agree, but that stuff that goes wrong isn't (in my experience) super common. At least, not until your tables reach fairly large (as in 10s of GBs of data) sizes. Which is why I'd suggest you still employ DBAs :).

As a DBRE who oversees tables in the 100s of GB, some in TB range, these things happen entirely too often. But then, it’s my job. And yes, I whole-heartedly agree that past a certain scale, you need DB folks.

gigatexal · 2024-04-01T15:57:50 1711987070

The art comes from the querying engine and statistics and the time of day and if I’ve sacrificed a goat or not. Okay maybe not those last two but there’s still some nuance to this engine to engine.

gigatexal · 2024-04-01T15:58:50 1711987130

What indexes give you O(n)? IIRC btrees are o log n.

cogman10 · 2024-04-01T16:23:15 1711988595

No index gives you O(n) (technically O(n log n)). I'm talking about doing a table scan instead of a seek.

Or did you mean O(1)? Some RDBMs have Hash indexes which given you (effectively) O(1) searches.

For example: https://www.postgresql.org/docs/current/indexes-types.html#I...

dieselgate · 2024-04-01T14:44:20 1711982660

Can’t believe they mentioned “Autonomy” here which definitely dates the piece - I’ve never ever heard it referenced. Was a part of an Autonomy cms upgrade at a previous job and it’s pretty old and bad - but got to learn a lot.. Am interviewing for a job now and they seem to have Coldfusion as part of their stack - prefer that to autonomy..!

lakomen · 2024-04-01T17:32:05 1711992725

I've been thinking recently, well for a longer time.

The next project will be like this: I will name the endpoints PagenameFunctionality. E.g. /IndexGetMyProfile There will be global scope functions or endpoints but many scoped endpoints.

And there I will do the proper queries just for this use case, and so on for each new use case.

In the context of SPA/PWA/TWA I think this is a good approach.

Graphql is too complex on the client and RESTful too vague.

cess11 · 2024-04-01T20:43:25 1712004205

Can call it Grammarless RPC.

pants2 · 2024-04-01T15:38:05 1711985885

Has Postgres yet enabled any sort of locking of the query plan?

It's been a while since I've used it at scale, but at my previous job half of our incidents were caused by Postgres randomly deciding to change the query plan for a call that was working just fine. Then I'd go in there and re-write the SQL a few times until it figured out what to do, rinse and repeat every few months.

polishdude20 · 2024-04-01T15:49:36 1711986576

I'd love to have that. At a previous gig a query ballooned up to taking 30 seconds up from 1 second. The reason was the query planner decided to use a different plan. Any statistics update didn't work. Finally I tried just swapping the join order. The plan changed all was well. Until the planner decided to change the plan again...

natoliniak · 2024-04-01T15:43:09 1711986189

I'm dealing with a very large partitioned table (several millions of inserts per day) and i solve this by running `vacuum analyze` nightly which somehow solved 99% our read issues

Prior to that, we tried various indexing strategies, de-normalizing some of the data, but ultimately i found that weird plan inconsistency to be solved by the vacuum job

sgarland · 2024-04-01T15:44:08 1711986248

Not natively. There is pg_store_plans [0] which I haven’t used, but looks interesting.

Most of the time, query flips are due to inaccurate statistics over time. Try changing the default statistics target for the affected column.

[0]: https://github.com/ossc-db/pg_store_plans

timacles · 2024-04-01T20:13:48 1712002428

No offense guys, but just get a real DBA.

Postgres doesnt just randomly do things and will always produce the same result given the same circumstances.

Either you werent running routine maintenance and your statistics are messed up, or the statistics arent properly configured or your data is just poorly stored.

ryanjshaw · 2024-04-01T14:37:03 1711982223

I don't understand why they didn't generate a bunch of static pages from the database and avoid a whole lot of stress?

mrkeen · 2024-04-01T14:59:06 1711983546

Faceting.

> A user will land on the search page and have the ability to “drill down” and narrow the results based on those properties. To make things much harder, all of these properties would come with exact counts.

ryanjshaw · 2024-04-01T16:07:52 1711987672

Have a look at the archived page: https://web.archive.org/web/20070519214259/http://nymag.com/...

Click on "Types of Clothing" - unfortunately the link isn't archived but you can see the URL: http://nymag.com/search/fashion-search.cgi?nymbreadcrumb_pus...

Judging from the query params, it looks like you can only filter one item at a time - seems like static pages would work.

renegade-otter · 2024-04-05T11:18:01 1712315881

No, you could have drilled into multiple facets. I don't think the Archive cached the deep-nested results to see that.

So, you could go "Dolce & Gabbana -> Bag -> Red". Then you could remove the middle one, for example, and end up with "Dolce & Gabbana -> Red"

This gives you unlimited permutations of results, so static pages were a no go.

samsk · 2024-04-01T15:03:25 1711983805

This! You come to a fintech to review 'some' performance issues and find a stupid data model and transaction processing completely ignoring how databases work and what limitations that concrete database has ! But hey, look, we used this shiny tech and that newest stack, where is the problem ???

chrsig · 2024-04-01T15:39:56 1711985996

> Imagine if SELECT field1, field2 FROM my_table was faster than SELECT field2, field1 FROM my_table. Why would it do that? I have no idea to this day, and I don’t even want to know.

> I don’t even want to know.

I feel this. I'm working on a team where I have significantly more tenure at the company than the other engineers on the team, and they'll often ask me questions starting with "why".

Normally, I love to encourage this -- curiosity is an amazing quality to have in coworkers. There are still some times when I just need to look at them like I'm dead inside and warn them that it's not worth their time on this earth to find out.

gigatexal · 2024-04-01T16:00:36 1711987236

See this is what I find interesting. Some do others just say the DB is this idiosyncratic thing that sometimes does Z and sometimes does Y.

west0n · 2024-04-01T14:50:57 1711983057

TLTR:

Understand and optimize relational databases (MySQL, PG) rather than introducing a new type of database (e.g., DynamoDB, Kafka, Redis) whenever issues arise. Common mistakes:

- Lack of indexes

- Using ORM (generating too many SQL queries or inefficient queries that return all columns)

Some suggestions:

- Use relational databases where possible, instead of NoSQL

- Use read-only replicas instead of Redis when applicable

- Keep relational databases clean and data size small

The author's message is: Hey, folks, if you don't have the ability to use relational databases properly, then your usage of Kafka and DynamoDB will certainly be a mess as well.

corytheboyd · 2024-04-01T14:59:52 1711983592

You can use an ORM and not explode the number of queries to infinity.

I’ve never seen it in the wild, so I get the blanket ban. In theory though you COULD contain ORM usage behind an intentional data access interface.

I can only really speak for Rails from my own experience, but it most certainly tries exactly 0% to get you to care about this problem. It encourages deeply nested, at all layers of abstraction, through class inheritance and all that garbage, unrestricted access to the database through a very powerful ORM.

You end up tuning performance of queries that end up changing in a month because of this. It’s wild.

cameronh90 · 2024-04-01T15:12:11 1711984331

I'm a big fan of SQLAlchemy. While not without its quirks, it lets you write relational-style queries, but in native Python, and automatically maps the result sets to objects. If you have N+1 issues with SQLAlchemy, you probably would have had them with your own custom-written SQL queries also.

cogman10 · 2024-04-01T15:21:46 1711984906

Yeah, this is why I generally just don't like ORMs. They often just leaky abstractions where you end up not only needing to know how the ORM works, but also SQL when you need to go back and fix the ORM.

There are exceptions, like jooq, which try not to really be an ORM and instead are just a language extension to help write sql (funnily, being an even more leaky abstraction is preferable here :D)

west0n · 2024-04-01T15:15:25 1711984525

so I leaved Rails, lol

corytheboyd · 2024-04-01T18:06:01 1711994761

Don’t blame you!

LargeWu · 2024-04-01T14:26:08 1711981568

Smart developers both know SQL well, AND use ORM's.

Sure ORM's are not going to write the most optimized code by default, and they're not going to build indexes for you, etc. But they are a huge productivity enhancer and work well enough for a lot of the queries you're going to need. Most ORM's these days have advanced options too which enable the more fine-tuned use cases without having to drop into raw SQL.

randomdata · 2024-04-01T14:44:56 1711982696

Your second paragraph reads more like you are talking about query builders rather than anything in the domain of ORM. ORM happens after the query has executed and you have a relation already in hand. Everything that happens before the relation is made available is outside of the ORM purview.

Some ORM toolkits bundle query builders as a secondary feature, but they are still logically distinct features.

itsdrewmiller · 2024-04-01T15:01:57 1711983717

The only reason ORMs are relevant to this conversation at all is because they include query generation. The vast majority for popular ones do that - one that I know of that doesn’t (dapper) labels itself a “micro ORM” because of it.

randomdata · 2024-04-01T15:11:04 1711984264

ORM never includes query generation. Some libraries often include both ORM toolkits and query builders, but that's like calling sorting 'encryption' because libraries often include functionality for both. That would be quite silly, and confusing for the reader.

simonw · 2024-04-01T15:23:45 1711985025

Every ORM I've ever used (or contributed to building) has included a query builder as a core component.

What ORMs have you seen that don't do that?

randomdata · 2024-04-01T15:33:55 1711985635

ORM is the act of mapping relations (sets of tuples) to objects (graph structures). How could query building be a part of that?

There are ORM toolkits that help produce the code to perform that mapping to save you from doing it by hand (although you can do it by hand, if you so wish!), but not even they could shove query building in the middle. Query execution has to happen before ORM can take place. You cannot perform ORM until you have the relation already in hand. (Mutation cases are in reverse, but I know you are capable of understanding that without it spelt out in detail)

What would it even mean for ORM to include query building?

If I'm reading you right, you seem to be saying that database helper libraries often include both query building and ORM toolkit features, usually along with other features like database migration management. Which is very much true. But why would you call one feature by the name of another? If you call query building ORM, is ORM best called database migration?

Izkata · 2024-04-01T15:48:36 1711986516

The mapping is bi-directional. Query generation is object -> relation, result mapping is relation -> object.

randomdata · 2024-04-01T15:50:35 1711986635

> The mapping is bi-directional.

"(Mutation cases are in reverse, but I know you are capable of understanding that without it spelt out in detail)"

Guess you were't capable after all. How do tech people manage to be so out to lunch all the time?

Izkata · 2024-04-01T15:54:07 1711986847

I am the 4th person you've responded to in this thread.

randomdata · 2024-04-01T15:56:09 1711986969

We already know. There are little username thing-ys that communicate that. Is there something you are trying to add?

Izkata · 2024-04-01T16:07:46 1711987666

This only makes sense in a back and forth with one other person:

> Guess you were't capable after all.

randomdata · 2024-04-01T16:12:31 1711987951

No...? The assertion about being able to understand the reverse case went out to anyone reading the comment. This is clearly a community forum, not some kind of private messaging system. Comments are not directed towards anyone in particular.

Stop and think. If you actually did somehow mistakenly believe it was a one-on-one with another person, why would you but your head into the conversation? That would be completely illogical. There is a curious contraction here.

Izkata · 2024-04-01T16:29:02 1711988942

It's the "after all" that implies you thought you were responding to one person the whole time, which is why I corrected that.

randomdata · 2024-04-01T16:39:33 1711989573

You can rephrase it that way, sure: "After all this (being explicitly told that there is an inverse case), you still weren't capable of understanding what was written."

The first half of your comment does not logically precede that, though. There is no such implication.

simonw · 2024-04-01T19:23:53 1711999433

You're the only person I've encountered that has expressed a strong opinion that the query building part of an ORM should be considered separate from the rest of the ORM.

Do you have any examples of open source ORM libraries that stick to the terminology you are advocating here? I've not seen one myself.

Maybe this is partly an ecosystem thing. What ecosystems do you mainly work in? I'm primarily Python with a bit of JavaScript, so I don't have much exposure to ORM terminology in Java or C#.

randomdata · 2024-04-02T05:56:11 1712037371

Irrespective of naming, can we at least agree that there are, at minimum, two completely different features being spoken about here?

1. A feature that prepares a query to send to the database engine.

2. A feature that transforms data structures from one representation to another.

And we agree that these are independent? You can prepare a query without data transformation, and you can perform data transformation without preparing a query? I think we can prove that, if you are still unconvinced.

Okay, so that just leaves naming. What should we call these distinct features?

Here are my suggestions:

1. Query building. Building a query is the operation being performed.

2. Object-relational mapping; ORM for short. Mapping objects to relations (and vice-versa) is the operation being performed.

These descriptive names seem well suited to the task in my opinion, but I am open to better ideas. What have you got?

Now, there is something else in the wild that we haven't really talked about yet, but may be that which that you are alluding to. Another abstraction, or pattern if you will, that rests above (to use my suggested terms, bear with me) both query building and objet-relational mapping to unify them into some kind of cohesive system that actively manages records. This is where it starts to become sensible to consider (again, using my suggested terms for lack of anything better) query building and ORM intertwined.

I would suggest we call that active record. In large part because that's what we already call it. In fact, what is probably the most popular and well known active record library is literally named ActiveRecord, so-named because it was designed after the active record pattern. But, again, open to better ideas.

> What ecosystems do you mainly work in?

Python, Javascript (well, Typescript, if you want to go there).

simonw · 2024-04-02T17:06:20 1712077580

My mental model of what an ORM does is that it provides you with an object-oriented API that disguises some of the underlying mechanics of how the SQL queries work.

Take Django for example:

    Entry.objects.filter(created__year__gte=2024).exclude(tags__tag="python")

That's an object-oriented API that builds a query that looks something like this:

    SELECT
      "blog_entry"."id",
      "blog_entry"."created",
      "blog_entry"."slug",
      "blog_entry"."metadata",
      "blog_entry"."search_document",
      "blog_entry"."import_ref",
      "blog_entry"."card_image",
      "blog_entry"."series_id",
      "blog_entry"."title",
      "blog_entry"."body",
      "blog_entry"."tweet_html",
      "blog_entry"."extra_head_html",
      "blog_entry"."custom_template"
    FROM
      "blog_entry"
    WHERE
      (
        "blog_entry"."created" >= "2021-01-01 00:00:00"
        AND NOT (
          EXISTS(
            SELECT
              1 AS "a"
            FROM
              "blog_entry_tags" U1
              INNER JOIN "blog_tag" U2 ON (U1."tag_id" = U2."id")
            WHERE
              (
                U2."tag" = 'python'
                AND U1."entry_id" = ("blog_entry"."id")
              )
            LIMIT
              1
          )
        )
      )
    ORDER BY
      "blog_entry"."created" DESC

After executing the query it uses the returned data to populate multiple Entry objects (sometimes with clever tricks to fill in related objects so they don't have to be fetched separately, see select_related() and prefetch_related().

But the bit that builds the SQL SELECT query and the bit that populates the returned objects is pretty tightly coupled.

randomdata · 2024-04-02T17:27:23 1712078843

That is what was described originally by the active record paper. Traditionally it has been called the active record pattern, but if we'd rather call that ORM today, despite the misnomer, so be it. I couldn't care less.

But, I do care about being able to precisely communicate with other people. What are we going to call the other things?

gorjusborg · 2024-04-01T15:26:31 1711985191

> Smart developers both know SQL well, AND use ORM's.

> But they are a huge productivity enhancer and work well enough

I disagree with the implication that using an ORM is the default that 'smart developers' always choose.

Good developers choose the right tool for the job, so the tool selected depends on the job.

Also, what the term ORM means (and its effects on the design) vary based on language ecosystem and library. For instance, if the application doesn't need to dynamically navigate relations, and instead can get by with just treating the database as a set of structs, an ORM like Hibernate would probably introduce more complexity than it gives benefit. In such cases, a library like Jooq or JDBI could make the overall design simpler, perform better, and make it easier to reason about the behavior of the system. I would argue those are not ORMs, yet they can yield a better product, depending on the situation.

onlyrealcuzzo · 2024-04-01T14:29:39 1711981779

If performance is mission critical - ORMs seem like extra hoops to jump through.

Am I missing something?

ivanhoe · 2024-04-01T14:44:41 1711982681

It's Pareto principle at play, in RL you usually need to highly optimize just a very small portion of the queries - let's call it 20% for the sake of tradition, though it's realistically <5% IME - to get almost all of the speed-ups worth the trouble. ORMs then help you to write the other 80% quickly and in a maintainable fashion.

phailhaus · 2024-04-01T14:34:21 1711982061

ORM's allow you to make changes without having to refactor every single SQL statement you've written manually. They can also be typed, unlike raw SQL strings, which helps a lot during development. It's another abstraction, just like the one between your language of choice and assembly code.

ryanjshaw · 2024-04-01T14:42:35 1711982555

Modern ORMs are pretty efficient [1]. Some developers use the ORM for the "write model" and native SQL for reads/reports, if you are read-heavy.

[1] https://michaelscodingspot.com/npgsql-dapper-efcore-performa...

kayodelycaon · 2024-04-01T14:50:38 1711983038

Manually writing queries has a lot of overhead cost. Your entire application now has the UI directly tied to what the query returns. You can make intermediate classes to separate the two but eventually you'll end up making a worse ORM you need to test and maintain.

What an ORM gives you is a tunable level of abstraction. I know Ruby on Rails best so I'll use it as an example. ActiveRecord has a lot of tuning you can do, like partial selects and snippets of sql where you need it. If that isn't enough, there is find_by_sql as an escape hatch. If that is still not fast enough, you can call ActiveRecord::Base.connection and skip the ORM entirely.

This will keep complexity limited to places where it needs to be complex.

In my experience, every level of optimization you do comes with decreasing maintainability.

randomdata · 2024-04-01T14:57:52 1711983472

> Your entire application now has the UI directly tied to what the query returns.

Not really. Without ORM, your entire application is now tied to sets of tuples (i.e. relations) instead of rich graph structures (i.e. objects), but the relation need not be equal to the relation returned by the query. You could still have a transformation from one relation to another.

Not since the PHP days of embedding MySQL calls right in the middle of HTML have I seen anyone carry relations from end-to-end, though. The idea that anyone is not doing ORM these days seems like a straw man.

jimbokun · 2024-04-01T19:13:27 1711998807

Depends. Clojure uses objects rarely. But still has really powerful abstractions for writing applications quickly.

lelanthran · 2024-04-01T17:39:16 1711993156

> Not really. Without ORM, your entire application is now tied to relations (i.e. sets of tuples) instead of rich graph structures (i.e. objects), but the relation need not be equal to the relation returned by the query.

And, in some edge-cases, the relation is more useful than the object graph.

BurningFrog · 2024-04-01T14:43:41 1711982621

It can be, but in real applications only a small percentage of queries are performance problems, and you're probably better off optimizing those than writing and maintaining hand written queries.

brandall10 · 2024-04-01T14:39:59 1711982399

It's a higher level abstraction that that boosts dev speed and readability and is often good enough by default, esp. for CRUD work. More advanced ones have the ability to optimize things fairly well within the constraints of their API.

And it's not one or the other. There's always an SQL escape hatch when you need one for complex queries.

evantbyrne · 2024-04-01T15:10:47 1711984247

Yes. Every mature ORM that I'm aware of includes tools to trivialize fetching related records in an efficient manner. All people typically have to do is call the right method. In Django, for example, these methods are select_related and prefetch_related.

Also, they don't have to be slow. When benchmarking my hobby golang ORM, I found the performance cost vs hand writing optimal SQL to be in the microseconds per returned row. Small price to pay for everything it gives you: schema migrations, query builder, object mapping, validation, etc. And if you happen to have a particularly hot endpoint that needs to return tens of thousands of rows at once during a HTTP lifetime, you can always dip back down to the driver level.

It pains me that every time the topic of ORMs come up on HN that the comments are full of people warning not to use such a powerful category of tools. Maybe the tools they are using aren't very mature, maybe the underlying technologies are bit dated, maybe the documentation doesn't emphasize performance enough. I'm not really sure what's going on. Hopefully we can improve things so fewer people have issues and instead see performance improvements from using ORMs.

lrem · 2024-04-01T19:00:22 1711998022

I haven’t seriously used an ORM for the simple reason that I find SQL simple enough to not need to learn an extra DSL to hide it. Furthermore, when I’m done figuring out the query I need in a Jupyter notebook (technically Google colab, should be same), I can just paste it to the C++/Go server or the dashboard config. Raw performance didn’t ever matter that much to me.

With all this: is there an ORM with documentation that’s good at explaining what would I gain over plain old SQL?

evantbyrne · 2024-04-01T20:17:51 1712002671

Have you ever used a rapid development framework like Django or Laravel? A good ORM is basically the core of any decent backend framework that interfaces with a RDBMS. You get to define a schema once which can then be used as a foundation for migrations, validation, queries, serialization, etc. Many ORMs may only help solve one or two of those things well, and in personal experience it is rare to see even okayish JavaScript ORMs, which may contribute to why people hate the category of tools.

lrem · 2024-04-02T05:06:18 1712034378

Yeah, I’ve dabbled a bit in Django. Maybe because of the size of the projects, but it didn’t feel hugely beneficial over Flask.

evantbyrne · 2024-04-02T12:40:34 1712061634

Flask is okayish imo. People tend to use it with an ORM called sqlalchemy. There is also FastAPI as an alternative to Flask and Django in the Python web framework space. Just about anything can be used for small projects in a vacuum. I used to make websites with framework-less PHP files and mysqli.

matthewmacleod · 2024-04-01T14:37:29 1711982249

Absolutely this - the attitude of “never use an ORM” is an immediate red flag for me in a developer, indicating that they don’t understand the capabilities of a wide variety of ORMs and the trade-offs involved.

epgui · 2024-04-01T16:05:04 1711987504

I think “never use X” is probably a bit far. But personally I can’t really think of a situation where I would rather have an ORM.

I used to use them all the time. When I started doing pure FP I just realized I just didn’t really need my old ORM. When I came back to OOP languages (out of necessity, not pleasure), I realized I didn’t really need my old ORM there either.

It’s not a strong opinion and it’s not a religious thing, it’s just a different perspective whereby you see that something isn’t really necessary.

Benefits or not, you do pay quite a heavy price for the ORM, and I think most people don’t see this price because it has been normalized so much.

randomdata · 2024-04-01T14:38:50 1711982330

The bigger red flag is confusing ORM with ORM toolkits.

BurningFrog · 2024-04-01T14:39:56 1711982396

It sounds like that "nerd macho" thing, where your main goal at work is to show how smart you are, not build working software systems.

renegade-otter · 2024-04-05T11:23:33 1712316213

While my article reads like an anti-ORM manifesto, my warning is about relying on ORMs to do all the work. How am I supposed to understand advanced fine-tuning options if I do not understand how the low-level query works?

Copilot may be your little helper in writing code, but it is not going to consistently write the best code that YOU can write.

marcosdumay · 2024-04-01T15:49:47 1711986587

So... Your claim is that the ORM optimizer is better than the DBMS one?

What DBMS do you use? I've used a few where that is true, and really, instead of using an ORM for that, you should stick to Postgres.

bovermyer · 2024-04-01T14:43:40 1711982620

I will use an ORM if it makes sense for the app in question, and then heavily tune it.

Usually, though, I get better performance and less fragility out of leaning on carefully-crafted SQL+indexes to do the heavy lifting.

toomuchtodo · 2024-04-01T14:35:31 1711982131

Do you have any references or resource recommendations for the migration path from ORMs to highly optimized SQL ops?

davidw · 2024-04-01T14:41:14 1711982474

Most good ORM's let you write your own SQL to fetch things when you realize there's a particular operation that requires it.

So write the initial system with an ORM and then optimize as needs be. That might be obvious from the start, or might become evident later.

xxs · 2024-04-01T14:40:36 1711982436

a very good point. The downside is that I find most people just consider ORM a auto-sql, and the (data) modeling remains in a similar fashion.

ORM benefits mostly from navigation + caching.