Hacker News new | past | comments | ask | show | jobs | submit login
Your database skills are not 'good to have' (2023) (renegadeotter.com)
112 points by saikatsg 7 months ago | hide | past | favorite | 122 comments



Every time I read someone's "horror story" of "queries that used a lot of joins, like even 3 or 4 tables", I...well, sigh, at this point, I guess. It's been a very long time since I needed advanced SQL to do things that probably shouldn't be happening in the database anyway, but the fundamentals seem largely foreign concepts today (speaking generally), and they are definitely still relevant and enormously helpful.


Doesn't that happen when you use a highly normalized database table design, though where you create tables to avoid repeating data? Or if you have a lot of 1/many to many relationships in mapping tables. I.e. you'll end up with a lot of tables you need to join on to collate that data (authors, subjects, languages, etc.) to display to the user.

You could break the joins into three parts:

1. searching/filtering -- which tables you need to join on to match the selected queries;

2. sorting -- determining which column(s) of data to sort the results by; and

3. displaying -- which tables you need to join on to retrieve the information needed to display the information to the user.

These may be different or have overlaps.


I think they’re saying that JOINing over three or four tables is really just not that big a deal for most situations, particularly if you have the needed indices.


Yeah. That's something you would expect databases to be good at. Expecially as normalization and many-to-many relationships make multiple JOINs common, which was my point.


Joins are rarely the problem. See the sigh.


That seems to be the exact point the author makes: you need to know some fundamentals to properly use databases. Even if you are not doing super complex stuff.


I was certainly not disagreeing with the author.


lesson: DONOT join more than 3 tables in MySQL, lol


Why not? I've worked with a database where we needed to join more than double that on some common queries, in part because financial stuff tends to touch several other business domains at once.

The explains can be hairy at first but once you've started systematically analysing a query and know some of the quirks of the MySQL query planner you'll manage to identify missing indices and squeeze out very, very good performance.

That was a database where we had warm tables bigger than fifty million rows. I personally migrated it from single instance MySQL 5 to a MySQL 8 clustered rig and optimised hundreds of queries written by people learning basics on the job, going from horrifying to very reasonable latencies as seen from the web clients.


Wait... why?

Admittedly I'm mostly using MSSQL, but joining multiple tables seems a pretty natural thing to do (assuming you are joining on well indexed tables).


Table locks.

This is specific to the default MyISAM engine, where it locks all tables used in a query for the duration of the query. So two queries that touch the same table can't run concurrently.

You may never hit this as a problem if your queries are fast enough, but it is something to be aware of.


Didn't MySQL switch to InnoDB as the default engine a long time ago (my googling says 2010)? Does InnoDB have the same limitations?


I actually didn't realize they switched (we're still on an ancient version and unable to upgrade for now).

InnoDB uses row-level locks, so it's extremely unlikely to cause such issues.


InnoDB also supports transactions.


You can join dozens of tables with almost no impact?

The only issue is joining with aggregations and sorting, that's where things get ugly with MySQL.


In my experience (I know nothing about this) if I needed to join a lot, I’d actually create new tables which routinely aggregated that data into tables I could query more efficiently instead. Especially if it was analytical data which was used for things like dashboard reporting.

It seems excessive at first but the performance is so dramatically better that it makes a lot of sense. Especially if the queries are part of your hot path. I always thought of it like having core business logic tables, then utility tables which are essentially derived from those core tables.


this is so called "Materialized View"


What? It should be fine to join that amount and more, given you have proper indexes.


More than this.

Joining the tables in the database should perform better than joining them on the application layer. If it doesn't, there's something very odd.

This holds for any production ready DBMS. Even the ones that don't care about performance or preserving your data.


As a wizened old grey beard once told me 'almost always with SQL, it depends'. Basically you have to revisit your ground truth all the time. Sometimes keeping stuff client (like a lookup table) is the right way. Most of the time server is better. But not always. 'it depends'.


"Odd" is not "wrong".

But wrong is almost always odd, so the normal expectation is that you understand the oddity on your system. Otherwise you are blind to problems.

Anyway, it's also not common for an application to have "reduce total CPU usage" as a goal. Available CPU on the database is way more valuable than on the application server, and so it makes sense to trade them up.


In most SQL cases your limiting factor will be your network. So you want to minimize what is coming over (usually). However, that tradeoff is time. If it takes more time for your client to render than it would just to send it you usually let the server do it. For example if I have a 50 row lookup with 10 columns table that is joined to a 300k row table, that for some reason must be returned to the client. Now given those conditions. It might be faster to glue it back together on the client. It just depends on how much is in those 50 rows. Just bunch of ints (probably not). A few 400 byte strings? That could be interesting and faster to do on the client. Like he said 'it depends'.


This was valid advice in MySQL 3 days.


People love to keep trotting it out, which is annoying. "Can't use MySQL because <thing-that-hasn't-been-true-for-over-a-decade>", which I guess is arguably adjacent to other points in the article about the stuff databases have added over the years:

Check in on your databases periodically, and refresh your memory. There's a very real chance they've picked up features that'll make your life a lot easier.


Can anyone share good resources or codebases that have high quality web backend with optimized queries and data structures (db schema, backend internal data structures, overall backend architecture)?

I feel like a lot of code you see in articles or modern companies if often just following the popular architecture/paradigm du jour. As opposed to what this article highlights in terms of making backends return all the data you need in the minimal number of queries. I feel like I understand what's being described here, but I've never actually had the chance to work on anything non-trivial that does it like this, most places I've worked have gone no-SQL or used an orm or were not more than very thin wrappers around some CRUD queries.


There's nothing in the ORM that prevents you from doing what the article describes; hell, the article describes doing it with the ORM, and, double hell, the ORM does the right thing by default, at least as far as avoiding N queries goes.

The solution is to know how a database works, instead of thinking that Product.objects.filter(color="red") and [product for product in Product.objects.all() if product.color = "red"] are exactly the same thing.


I understand that. I was asking more about if anyone knows any open source codebases that are high quality, non-trivial and well designed, that one should read and study to better see how others have written good code and systems.


Sentry is built on-top of Django. It's a good example of a real-world non-trivial code base that has had to adapt as it grew up and scaled up. I haven't looked at the code in a few years, but I used to run it on-premise and found it fairly approachable and well-written, but with just the sorts of hacks you'd expect to see in a software code base that had evolved over time.

https://github.com/getsentry/sentry

Here's all the database models:

https://github.com/getsentry/sentry/tree/master/src/sentry/m...

Here's an example of hacks in the form of massive caching to improve performance:

https://github.com/getsentry/sentry/blob/master/src/sentry/d...


(Disclaimer: Former Sentry employee)

There's a ton of DB passion at Sentry but most importantly for anyone this deep in the thread, Sentry also detects and helps fix common DB mistakes whether it's you or the ORM: https://docs.sentry.io/product/issues/issue-details/performa...

If you're at the “why it’s slow” runbook, Sentry turns the first 2 steps into 1 click: https://docs.sentry.io/product/performance/queries/


Yes, the sqlalchemy docs. Dead serious.


That's somewhat surprising, but then again, maybe not. Thanks, I'll take note of that.

For anyone else, I also remember PostgresSQL manual/docs being very high quality, so that might be worth reading just to better grok SQL/relational DB topics.


If you're digging into sqlalchemy and enjoying it, you would probably enjoy the source of Ecto, the kinda-sorta-but-not-really ORM that's commonly used with Elixir.


As a database novice, what is the difference? I assume one of the two is wastefully fetching all products and then filtering, while the other is omitting the filtered results as part of the fetch, and is therefore more efficient? If so, which is which?


Product.objects.filter(color="red")

This is the efficient one, filtering results upstream, via SQL, before they are returned to the client.


Your intuition is correct. Product.objects.filter() will create an optimized query to issue to the database. Meanwhile, Products.objects.all() will retrieve all the records and leaves you with the task of filtering them in-memory.


Correct. I'm not familiar with that ORM, but the first should produce SQL like "SELECT * FROM products WHERE color = 'red'" and only load the records you're interested in, while the second loads everything and throws away probably most.


It's Django (python) if you weren't sure of the name and wanted a look.

The table naming convention from the model name is a bit different, but impossible to know from just that code snippet. Also Django likes to explicitly list every field instead of *


> Also Django likes to explicitly list every field instead of *

Tbf this is a good pattern. It’s rare that you _need_ SELECT *, and if you do, you may not in six months after DDL has added more columns to the table.


Given that one says Products.objects.all(), which one do you think it is?

Good on you for correctly determining the difference, though!


Ah, I see - since I'm not familiar with this language/API, I couldn't really infer what was happening behind the scenes. E.g. '.all()' could be constructing a request, not necessarily executing, and then the 'if' might modify that request intelligently.


In this ORM, both versions construct without executing until the results are accessed. The list comprehension looping over the results is an access, the "if" happens after that.

The one exception to the obvious "results are accessed" intuition is slices (such as [:3] to mean first 3 elements), which thanks to how python implemented them lets the ORM use them to further modify the query by adding LIMIT instead of executing it.


You're right, this is Python/Django, where you don't get these semantics (Product is the database model, and it doesn't have access to the "if" to use in the filter). It was more of a general example, though, of "loop vs filtering in the SELECT", so the language doesn't matter much (well, apart from effectively making my point, which it perhaps failed at).


In Microsoft land there's a sample reference architecture for .NET web apps: https://github.com/dotnet/eShop


Modern Ruby on Rails makes this really easy with Strict Loading.

https://til.hashrocket.com/posts/l4t5xok8hg-activerecord-str...

One of the biggest perks with the association loads is that it avoids joins by default, which can become a scaling barrier as you grow. Instead, it will capture the associated ids, find all of the related records in a single query and then transparently handle everything for you behind the scenes.

Without any optimizations, you'll get essentially 1 query per loaded association rather than the N+1 "one per association per row".

You can always tune things to utilize joins, subqueries and optimize specific cases later but this is a sensible setting that will avoid the vast majority of issues.


Good call linking to Use The Index Luke. A great resource.

https://use-the-index-luke.com/


His book SQL Performance Explained is also very enlightening as well. I think the most useful takeaway I got from that book is that the effectiveness of an index is heavily determined by how the application will be querying that table. If you want to design an effective index for a table you should experiment with multiple options and benchmark the queries that the application is going to run.


Thanks for sharing!


The number one mistake is really that the industry continues to pile more and more work on software engineers, including expecting them to be experts at a topic that used to be a whole job (DBA).


In the past twenty years, the job of "software engineer" has taken upon ops/QA/DBA/support.


I love everything about this article. This is the kind of database black arts that I really enjoy.


Black arts?

Databases, pretty much all of them, are relatively simple creatures. If you can understand how a Tree structure and pointers work, you basically have most of the fundamental knowledge underlying the "black art" concepts of databases.

The primary key/clustered index is how the data is physically stored and sorted, in a btree. Additional secondary indexes are generally quiet literally just secondary tables clustered on the secondary index key with primary key pointers. The job of the optimizer is to take when you have "Select foo from bar where baz=2" and ask questions like "Is there an index on baz. Does that index contain foo?" and then executing based on that. If there is an index on baz, it traverses the baz index looking for 2. If that index covers foo, then the query is done and it just pulls foo from the baz index. If it doesn't, then it queries the foo table with each of the primary keys in the baz index. If no index exists, then the db has to do a linear search through the table looking for "baz=2".

When databases are slow, it's because you are forcing them to do O(n). When they are fast, it's because you use an index to turn that into a O(log n) search. (or in some cases, O(1).)

The black black arts of databases is when you start getting into things like CTEs and windowed functions and their interactions with the indexes/optimizers. That's when it can be almost trivial to introduce O(n^2) performance if you aren't careful.

NoSQL databases generally speaking, are implemented similar to how SQL databases are (just trees) but with less features and less guarantees.


> The primary key/clustered index is how the data is physically stored and sorted, in a btree.

This is only true for RDBMS with a clustered index, like MySQL (assuming InnoDB) and MSSQL. Postgres stores tuples in a heap. The indices are generally B+trees, yes, but there is an extra level of indirection via the visibility map.

There are a million gotchas, is the problem. You declared a partial index on foo as WHERE foo = true, but you’re querying for WHERE foo IS NOT false? I have bad news about the secret third state, WHERE foo IS NULL (not to mention the subtle differences between equality and IS).

You made a UUIDv4 as the PK, thinking that since Postgres doesn’t cluster on it, it won’t suffer the same performance issues? Visibility map just wrecked your IOPS, enjoy the 7x hit to pages.

You have an index and it’s not being used? Could be incorrect equality check (see first example), or inadequate auto-analyze on a write-heavy table leading to incorrect statistics, or any other number of things.

RDBMS are much easier to understand if you know what a B+tree looks like, yes, and I highly recommend any backend dev take the time to do so. But there is so, so much more that can go wrong.


> RDBMS are much easier to understand if you know what a B+tree looks like, yes, and I highly recommend any backend dev take the time to do so. But there is so, so much more that can go wrong.

I agree, but that stuff that goes wrong isn't (in my experience) super common. At least, not until your tables reach fairly large (as in 10s of GBs of data) sizes. Which is why I'd suggest you still employ DBAs :).

You wouldn't, for example, think about pulling out a filter index if you weren't dealing with fairly large tables where a full index might have an overly large negative impact.

On the flip side, I've seen more than a few tables where a fundamental understanding of how the data is structured would have prevented a multitude of issues like bad or missing indexes. (My favorite that is depressingly common `Create Index blah ON foo(id, thingToIndex)`)


> I agree, but that stuff that goes wrong isn't (in my experience) super common. At least, not until your tables reach fairly large (as in 10s of GBs of data) sizes. Which is why I'd suggest you still employ DBAs :).

As a DBRE who oversees tables in the 100s of GB, some in TB range, these things happen entirely too often. But then, it’s my job. And yes, I whole-heartedly agree that past a certain scale, you need DB folks.


The art comes from the querying engine and statistics and the time of day and if I’ve sacrificed a goat or not. Okay maybe not those last two but there’s still some nuance to this engine to engine.


What indexes give you O(n)? IIRC btrees are o log n.


No index gives you O(n) (technically O(n log n)). I'm talking about doing a table scan instead of a seek.

Or did you mean O(1)? Some RDBMs have Hash indexes which given you (effectively) O(1) searches.

For example: https://www.postgresql.org/docs/current/indexes-types.html#I...


Can’t believe they mentioned “Autonomy” here which definitely dates the piece - I’ve never ever heard it referenced. Was a part of an Autonomy cms upgrade at a previous job and it’s pretty old and bad - but got to learn a lot.. Am interviewing for a job now and they seem to have Coldfusion as part of their stack - prefer that to autonomy..!


I've been thinking recently, well for a longer time.

The next project will be like this: I will name the endpoints PagenameFunctionality. E.g. /IndexGetMyProfile There will be global scope functions or endpoints but many scoped endpoints.

And there I will do the proper queries just for this use case, and so on for each new use case.

In the context of SPA/PWA/TWA I think this is a good approach.

Graphql is too complex on the client and RESTful too vague.


Can call it Grammarless RPC.


Has Postgres yet enabled any sort of locking of the query plan?

It's been a while since I've used it at scale, but at my previous job half of our incidents were caused by Postgres randomly deciding to change the query plan for a call that was working just fine. Then I'd go in there and re-write the SQL a few times until it figured out what to do, rinse and repeat every few months.


I'd love to have that. At a previous gig a query ballooned up to taking 30 seconds up from 1 second. The reason was the query planner decided to use a different plan. Any statistics update didn't work. Finally I tried just swapping the join order. The plan changed all was well. Until the planner decided to change the plan again...


I'm dealing with a very large partitioned table (several millions of inserts per day) and i solve this by running `vacuum analyze` nightly which somehow solved 99% our read issues

Prior to that, we tried various indexing strategies, de-normalizing some of the data, but ultimately i found that weird plan inconsistency to be solved by the vacuum job


Not natively. There is pg_store_plans [0] which I haven’t used, but looks interesting.

Most of the time, query flips are due to inaccurate statistics over time. Try changing the default statistics target for the affected column.

[0]: https://github.com/ossc-db/pg_store_plans


No offense guys, but just get a real DBA.

Postgres doesnt just randomly do things and will always produce the same result given the same circumstances.

Either you werent running routine maintenance and your statistics are messed up, or the statistics arent properly configured or your data is just poorly stored.


I don't understand why they didn't generate a bunch of static pages from the database and avoid a whole lot of stress?


Faceting.

> A user will land on the search page and have the ability to “drill down” and narrow the results based on those properties. To make things much harder, all of these properties would come with exact counts.


Have a look at the archived page: https://web.archive.org/web/20070519214259/http://nymag.com/...

Click on "Types of Clothing" - unfortunately the link isn't archived but you can see the URL: http://nymag.com/search/fashion-search.cgi?nymbreadcrumb_pus...

Judging from the query params, it looks like you can only filter one item at a time - seems like static pages would work.


No, you could have drilled into multiple facets. I don't think the Archive cached the deep-nested results to see that.

So, you could go "Dolce & Gabbana -> Bag -> Red". Then you could remove the middle one, for example, and end up with "Dolce & Gabbana -> Red"

This gives you unlimited permutations of results, so static pages were a no go.


This! You come to a fintech to review 'some' performance issues and find a stupid data model and transaction processing completely ignoring how databases work and what limitations that concrete database has ! But hey, look, we used this shiny tech and that newest stack, where is the problem ???


> Imagine if SELECT field1, field2 FROM my_table was faster than SELECT field2, field1 FROM my_table. Why would it do that? I have no idea to this day, and I don’t even want to know.

> I don’t even want to know.

I feel this. I'm working on a team where I have significantly more tenure at the company than the other engineers on the team, and they'll often ask me questions starting with "why".

Normally, I love to encourage this -- curiosity is an amazing quality to have in coworkers. There are still some times when I just need to look at them like I'm dead inside and warn them that it's not worth their time on this earth to find out.


See this is what I find interesting. Some do others just say the DB is this idiosyncratic thing that sometimes does Z and sometimes does Y.


TLTR:

Understand and optimize relational databases (MySQL, PG) rather than introducing a new type of database (e.g., DynamoDB, Kafka, Redis) whenever issues arise. Common mistakes:

- Lack of indexes

- Using ORM (generating too many SQL queries or inefficient queries that return all columns)

Some suggestions:

- Use relational databases where possible, instead of NoSQL

- Use read-only replicas instead of Redis when applicable

- Keep relational databases clean and data size small

The author's message is: Hey, folks, if you don't have the ability to use relational databases properly, then your usage of Kafka and DynamoDB will certainly be a mess as well.


You can use an ORM and not explode the number of queries to infinity.

I’ve never seen it in the wild, so I get the blanket ban. In theory though you COULD contain ORM usage behind an intentional data access interface.

I can only really speak for Rails from my own experience, but it most certainly tries exactly 0% to get you to care about this problem. It encourages deeply nested, at all layers of abstraction, through class inheritance and all that garbage, unrestricted access to the database through a very powerful ORM.

You end up tuning performance of queries that end up changing in a month because of this. It’s wild.


I'm a big fan of SQLAlchemy. While not without its quirks, it lets you write relational-style queries, but in native Python, and automatically maps the result sets to objects. If you have N+1 issues with SQLAlchemy, you probably would have had them with your own custom-written SQL queries also.


Yeah, this is why I generally just don't like ORMs. They often just leaky abstractions where you end up not only needing to know how the ORM works, but also SQL when you need to go back and fix the ORM.

There are exceptions, like jooq, which try not to really be an ORM and instead are just a language extension to help write sql (funnily, being an even more leaky abstraction is preferable here :D)


so I leaved Rails, lol


Don’t blame you!


Smart developers both know SQL well, AND use ORM's.

Sure ORM's are not going to write the most optimized code by default, and they're not going to build indexes for you, etc. But they are a huge productivity enhancer and work well enough for a lot of the queries you're going to need. Most ORM's these days have advanced options too which enable the more fine-tuned use cases without having to drop into raw SQL.


Your second paragraph reads more like you are talking about query builders rather than anything in the domain of ORM. ORM happens after the query has executed and you have a relation already in hand. Everything that happens before the relation is made available is outside of the ORM purview.

Some ORM toolkits bundle query builders as a secondary feature, but they are still logically distinct features.


The only reason ORMs are relevant to this conversation at all is because they include query generation. The vast majority for popular ones do that - one that I know of that doesn’t (dapper) labels itself a “micro ORM” because of it.


ORM never includes query generation. Some libraries often include both ORM toolkits and query builders, but that's like calling sorting 'encryption' because libraries often include functionality for both. That would be quite silly, and confusing for the reader.


Every ORM I've ever used (or contributed to building) has included a query builder as a core component.

What ORMs have you seen that don't do that?


ORM is the act of mapping relations (sets of tuples) to objects (graph structures). How could query building be a part of that?

There are ORM toolkits that help produce the code to perform that mapping to save you from doing it by hand (although you can do it by hand, if you so wish!), but not even they could shove query building in the middle. Query execution has to happen before ORM can take place. You cannot perform ORM until you have the relation already in hand. (Mutation cases are in reverse, but I know you are capable of understanding that without it spelt out in detail)

What would it even mean for ORM to include query building?

If I'm reading you right, you seem to be saying that database helper libraries often include both query building and ORM toolkit features, usually along with other features like database migration management. Which is very much true. But why would you call one feature by the name of another? If you call query building ORM, is ORM best called database migration?


The mapping is bi-directional. Query generation is object -> relation, result mapping is relation -> object.


> The mapping is bi-directional.

"(Mutation cases are in reverse, but I know you are capable of understanding that without it spelt out in detail)"

Guess you were't capable after all. How do tech people manage to be so out to lunch all the time?


I am the 4th person you've responded to in this thread.


We already know. There are little username thing-ys that communicate that. Is there something you are trying to add?


This only makes sense in a back and forth with one other person:

> Guess you were't capable after all.


No...? The assertion about being able to understand the reverse case went out to anyone reading the comment. This is clearly a community forum, not some kind of private messaging system. Comments are not directed towards anyone in particular.

Stop and think. If you actually did somehow mistakenly believe it was a one-on-one with another person, why would you but your head into the conversation? That would be completely illogical. There is a curious contraction here.


It's the "after all" that implies you thought you were responding to one person the whole time, which is why I corrected that.


You can rephrase it that way, sure: "After all this (being explicitly told that there is an inverse case), you still weren't capable of understanding what was written."

The first half of your comment does not logically precede that, though. There is no such implication.


You're the only person I've encountered that has expressed a strong opinion that the query building part of an ORM should be considered separate from the rest of the ORM.

Do you have any examples of open source ORM libraries that stick to the terminology you are advocating here? I've not seen one myself.

Maybe this is partly an ecosystem thing. What ecosystems do you mainly work in? I'm primarily Python with a bit of JavaScript, so I don't have much exposure to ORM terminology in Java or C#.


Irrespective of naming, can we at least agree that there are, at minimum, two completely different features being spoken about here?

1. A feature that prepares a query to send to the database engine.

2. A feature that transforms data structures from one representation to another.

And we agree that these are independent? You can prepare a query without data transformation, and you can perform data transformation without preparing a query? I think we can prove that, if you are still unconvinced.

Okay, so that just leaves naming. What should we call these distinct features?

Here are my suggestions:

1. Query building. Building a query is the operation being performed.

2. Object-relational mapping; ORM for short. Mapping objects to relations (and vice-versa) is the operation being performed.

These descriptive names seem well suited to the task in my opinion, but I am open to better ideas. What have you got?

Now, there is something else in the wild that we haven't really talked about yet, but may be that which that you are alluding to. Another abstraction, or pattern if you will, that rests above (to use my suggested terms, bear with me) both query building and objet-relational mapping to unify them into some kind of cohesive system that actively manages records. This is where it starts to become sensible to consider (again, using my suggested terms for lack of anything better) query building and ORM intertwined.

I would suggest we call that active record. In large part because that's what we already call it. In fact, what is probably the most popular and well known active record library is literally named ActiveRecord, so-named because it was designed after the active record pattern. But, again, open to better ideas.

> What ecosystems do you mainly work in?

Python, Javascript (well, Typescript, if you want to go there).


My mental model of what an ORM does is that it provides you with an object-oriented API that disguises some of the underlying mechanics of how the SQL queries work.

Take Django for example:

    Entry.objects.filter(created__year__gte=2024).exclude(tags__tag="python")
That's an object-oriented API that builds a query that looks something like this:

    SELECT
      "blog_entry"."id",
      "blog_entry"."created",
      "blog_entry"."slug",
      "blog_entry"."metadata",
      "blog_entry"."search_document",
      "blog_entry"."import_ref",
      "blog_entry"."card_image",
      "blog_entry"."series_id",
      "blog_entry"."title",
      "blog_entry"."body",
      "blog_entry"."tweet_html",
      "blog_entry"."extra_head_html",
      "blog_entry"."custom_template"
    FROM
      "blog_entry"
    WHERE
      (
        "blog_entry"."created" >= "2021-01-01 00:00:00"
        AND NOT (
          EXISTS(
            SELECT
              1 AS "a"
            FROM
              "blog_entry_tags" U1
              INNER JOIN "blog_tag" U2 ON (U1."tag_id" = U2."id")
            WHERE
              (
                U2."tag" = 'python'
                AND U1."entry_id" = ("blog_entry"."id")
              )
            LIMIT
              1
          )
        )
      )
    ORDER BY
      "blog_entry"."created" DESC
After executing the query it uses the returned data to populate multiple Entry objects (sometimes with clever tricks to fill in related objects so they don't have to be fetched separately, see select_related() and prefetch_related().

But the bit that builds the SQL SELECT query and the bit that populates the returned objects is pretty tightly coupled.


That is what was described originally by the active record paper. Traditionally it has been called the active record pattern, but if we'd rather call that ORM today, despite the misnomer, so be it. I couldn't care less.

But, I do care about being able to precisely communicate with other people. What are we going to call the other things?


> Smart developers both know SQL well, AND use ORM's.

> But they are a huge productivity enhancer and work well enough

I disagree with the implication that using an ORM is the default that 'smart developers' always choose.

Good developers choose the right tool for the job, so the tool selected depends on the job.

Also, what the term ORM means (and its effects on the design) vary based on language ecosystem and library. For instance, if the application doesn't need to dynamically navigate relations, and instead can get by with just treating the database as a set of structs, an ORM like Hibernate would probably introduce more complexity than it gives benefit. In such cases, a library like Jooq or JDBI could make the overall design simpler, perform better, and make it easier to reason about the behavior of the system. I would argue those are not ORMs, yet they can yield a better product, depending on the situation.


If performance is mission critical - ORMs seem like extra hoops to jump through.

Am I missing something?


It's Pareto principle at play, in RL you usually need to highly optimize just a very small portion of the queries - let's call it 20% for the sake of tradition, though it's realistically <5% IME - to get almost all of the speed-ups worth the trouble. ORMs then help you to write the other 80% quickly and in a maintainable fashion.


ORM's allow you to make changes without having to refactor every single SQL statement you've written manually. They can also be typed, unlike raw SQL strings, which helps a lot during development. It's another abstraction, just like the one between your language of choice and assembly code.


Modern ORMs are pretty efficient [1]. Some developers use the ORM for the "write model" and native SQL for reads/reports, if you are read-heavy.

[1] https://michaelscodingspot.com/npgsql-dapper-efcore-performa...


Manually writing queries has a lot of overhead cost. Your entire application now has the UI directly tied to what the query returns. You can make intermediate classes to separate the two but eventually you'll end up making a worse ORM you need to test and maintain.

What an ORM gives you is a tunable level of abstraction. I know Ruby on Rails best so I'll use it as an example. ActiveRecord has a lot of tuning you can do, like partial selects and snippets of sql where you need it. If that isn't enough, there is find_by_sql as an escape hatch. If that is still not fast enough, you can call ActiveRecord::Base.connection and skip the ORM entirely.

This will keep complexity limited to places where it needs to be complex.

In my experience, every level of optimization you do comes with decreasing maintainability.


> Your entire application now has the UI directly tied to what the query returns.

Not really. Without ORM, your entire application is now tied to sets of tuples (i.e. relations) instead of rich graph structures (i.e. objects), but the relation need not be equal to the relation returned by the query. You could still have a transformation from one relation to another.

Not since the PHP days of embedding MySQL calls right in the middle of HTML have I seen anyone carry relations from end-to-end, though. The idea that anyone is not doing ORM these days seems like a straw man.


Depends. Clojure uses objects rarely. But still has really powerful abstractions for writing applications quickly.


> Not really. Without ORM, your entire application is now tied to relations (i.e. sets of tuples) instead of rich graph structures (i.e. objects), but the relation need not be equal to the relation returned by the query.

And, in some edge-cases, the relation is more useful than the object graph.


It can be, but in real applications only a small percentage of queries are performance problems, and you're probably better off optimizing those than writing and maintaining hand written queries.


It's a higher level abstraction that that boosts dev speed and readability and is often good enough by default, esp. for CRUD work. More advanced ones have the ability to optimize things fairly well within the constraints of their API.

And it's not one or the other. There's always an SQL escape hatch when you need one for complex queries.


Yes. Every mature ORM that I'm aware of includes tools to trivialize fetching related records in an efficient manner. All people typically have to do is call the right method. In Django, for example, these methods are select_related and prefetch_related.

Also, they don't have to be slow. When benchmarking my hobby golang ORM, I found the performance cost vs hand writing optimal SQL to be in the microseconds per returned row. Small price to pay for everything it gives you: schema migrations, query builder, object mapping, validation, etc. And if you happen to have a particularly hot endpoint that needs to return tens of thousands of rows at once during a HTTP lifetime, you can always dip back down to the driver level.

It pains me that every time the topic of ORMs come up on HN that the comments are full of people warning not to use such a powerful category of tools. Maybe the tools they are using aren't very mature, maybe the underlying technologies are bit dated, maybe the documentation doesn't emphasize performance enough. I'm not really sure what's going on. Hopefully we can improve things so fewer people have issues and instead see performance improvements from using ORMs.


I haven’t seriously used an ORM for the simple reason that I find SQL simple enough to not need to learn an extra DSL to hide it. Furthermore, when I’m done figuring out the query I need in a Jupyter notebook (technically Google colab, should be same), I can just paste it to the C++/Go server or the dashboard config. Raw performance didn’t ever matter that much to me.

With all this: is there an ORM with documentation that’s good at explaining what would I gain over plain old SQL?


Have you ever used a rapid development framework like Django or Laravel? A good ORM is basically the core of any decent backend framework that interfaces with a RDBMS. You get to define a schema once which can then be used as a foundation for migrations, validation, queries, serialization, etc. Many ORMs may only help solve one or two of those things well, and in personal experience it is rare to see even okayish JavaScript ORMs, which may contribute to why people hate the category of tools.


Yeah, I’ve dabbled a bit in Django. Maybe because of the size of the projects, but it didn’t feel hugely beneficial over Flask.


Flask is okayish imo. People tend to use it with an ORM called sqlalchemy. There is also FastAPI as an alternative to Flask and Django in the Python web framework space. Just about anything can be used for small projects in a vacuum. I used to make websites with framework-less PHP files and mysqli.


Absolutely this - the attitude of “never use an ORM” is an immediate red flag for me in a developer, indicating that they don’t understand the capabilities of a wide variety of ORMs and the trade-offs involved.


I think “never use X” is probably a bit far. But personally I can’t really think of a situation where I would rather have an ORM.

I used to use them all the time. When I started doing pure FP I just realized I just didn’t really need my old ORM. When I came back to OOP languages (out of necessity, not pleasure), I realized I didn’t really need my old ORM there either.

It’s not a strong opinion and it’s not a religious thing, it’s just a different perspective whereby you see that something isn’t really necessary.

Benefits or not, you do pay quite a heavy price for the ORM, and I think most people don’t see this price because it has been normalized so much.


The bigger red flag is confusing ORM with ORM toolkits.


It sounds like that "nerd macho" thing, where your main goal at work is to show how smart you are, not build working software systems.


While my article reads like an anti-ORM manifesto, my warning is about relying on ORMs to do all the work. How am I supposed to understand advanced fine-tuning options if I do not understand how the low-level query works?

Copilot may be your little helper in writing code, but it is not going to consistently write the best code that YOU can write.


So... Your claim is that the ORM optimizer is better than the DBMS one?

What DBMS do you use? I've used a few where that is true, and really, instead of using an ORM for that, you should stick to Postgres.


I will use an ORM if it makes sense for the app in question, and then heavily tune it.

Usually, though, I get better performance and less fragility out of leaning on carefully-crafted SQL+indexes to do the heavy lifting.


Do you have any references or resource recommendations for the migration path from ORMs to highly optimized SQL ops?


Most good ORM's let you write your own SQL to fetch things when you realize there's a particular operation that requires it.

So write the initial system with an ORM and then optimize as needs be. That might be obvious from the start, or might become evident later.


a very good point. The downside is that I find most people just consider ORM a auto-sql, and the (data) modeling remains in a similar fashion.

ORM benefits mostly from navigation + caching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: