Yeah so I'm at this point in our main service which is a rather large REST API. For certain endpoints — naturally lists of things — we run into polynomial time complexity, the N+1 queries problem. We query for a list of entities which, when serialized, queries for some child entity and on and on it goes. We've been able to stop some of the bleeding by refactoring queries to use `select_related` and `prefetch_related` (it's a Django app) so that we limit the number of queries, but this only gets us so far.
I'm now at a point where I'm not sure what the next step should be. I believe caching may be the next logical step, but I've been hesitant due to concerns about the staleness problem, knowing when to bust the cache, etc.
And then even if we do go the caching route, I'm unsure at what level we want to cache. Do we cache individual objects or do we cache the large, fully hydrated models (the ones that require many queries to assemble)? Then there's the problem of having to mix and match fetching objects from the cache in the context of other arbitrary DB queries that can be constructed.
How do people generally go about introducing caching? Do you try to limit the scope of caching as much as possible (like to the lowest/smallest objects) and see whether that's sufficient to get the performance you want and then increase the caching surface area from there?
Also, what approaches do people take from the technical side? We already use Redis for some things so I implemented a quick proof-of-concept that seemed to work in the small scale. I've also seen people use Elasticsearch to keep fully hydrated/rich models in there and then use ES as the read store. Are there merits to this approach? Crazy?
Really keen to hear anyone's thoughts or guidance on this!
Number of queries is not the problem, it's the total query time for the frontend that you're ultimately concerned with.
An API is designed to aggregate and format data from various data-sources. The speed of your REST API will ultimately come down to the SQL queries that need to be executed and how your RDBMS executes them. Using a naive ORM (such as Django or GraphQL) typically leads to poor performance as you lack control over query execution.
Have you written your SQL from scratch and benchmarked slow JOINs? Sometimes doing multiple queries is faster than JOINs (when you only need to fetch related data for a small number of entities for example).
Are you using covering indices to accelerate entity lookups? This is probably the main way of speeding up your DB. De-normalizing your schema in certain cases is also a hack to be aware of.
Have you tuned caching in your RDBMS? You don't really need Redis when RDBMS can do caching for you (hint in-memory tables are awesome, RAM is cheaper than dev time).
Have you tried stored procedures? If you have complicated queries this may be helpful.
There's a whole list of things you can do to get great performance out of your RDBMS. Start there instead of introducing another layer of abstraction/tech that will complicate the architecture. I can guarantee that good schema & query design can yield massive improvements.
Thanks for the thorough reply, very helpful! I have looked into some of these, but not all of them and not for all cases. I really do need to look at constructing some of the queries by hand and comparing with the queries constructed by Django's ORM - I know that in many cases the ORM generates terribly inefficient queries.
Stored procedures is not something I've tried yet, but will look into it. Any more color or examples how this might work in an application?
Are far as RDBMS caching goes, you're referring to the ordinary query caching, related to things like shared buffers in PG, etc?
Just want to make sure I'm clear on all the suggestions you raised. Thank you!
You should be able to achieve a 10X improvement in speed just with good query and schema design, but ultimately everything depends on your data layout.
Avoid ORMs, GraphQL etc - write raw SQL specific to the views that your clients need, if you care about performance.
Low hanging fruit is:
- connection pool: your DB can in all probability execute many simple PK lookups at once, make sure you've got a large enough pool of DB connections to let it do this
- covering indices: if you are computing over columns A,B,C but looking up by PK you can add a covering index over (PK,A,B,C) which means no disk lookups. make sure all queries are using an index!
- joins: ORMs generally don't know how to do joins and use multiple queries instead. You need to get the right balance, for example for HN it could be: get PKs for top 30 articles in one query (one covering index), and lookup their attributes in another query (using a join across N tables where PK IN (1,2,3) which doesn't use a covering index). Reason being: covering indices are expensive, so you minimise the data they hold by splitting up your queries.
- de-normalization: if you have very complex joins, de-normalizing your schema may help
I wouldn't worry about stored procedures to begin with, unless you have a very complex schema/query plan. For caching you just need to make sure you have allocated enough RAM for your DB to keep indices in memory, the more you allocate the more pages it can cache in RAM. Focus on schema + SQL.
Raw SQL is not a cure-all; it’s an exchange of design simplicity and team velocity for application performance.
Prisma could handle the query generation part somewhat intelligently, at least in the future.
Denormalization is not a low hanging fruit. It may seem so because the mass of all the tangled fruits makes the top branch droop down where it can be munched on, but never quite satisfactorily picked.
I don't agree. Extra layers of abstraction like Prisma / ORMs only add complexity.
Sure it means you don't have to learn SQL at all, and that's really the only advantage. But time spent learning an API/ORM is better spent learning SQL.
Word of advice: avoid stored procedures unless you can store them in your codebase and have a CI/CD pipeline integrating their deployment seamlessly with the rest of the DB/API/app.
Storing code in a database is a terrible idea unless the entire workflow can support developing, debugging, deploying and monitoring that code.
My advice would be to avoid attempting to solve the problem by introducing a cache layer or any auxiliary query cache, and instead remove the problem at storage level by leveraging a database solution that 1) stores relationships as links 2) allows queries to resolve relationships to an arbitrary level without specifying join conditions. These two things make sure your entity resolution can happen with direct lookups instead of index lookups (or pog forbid, table scans).
If you absolutely cannot do that, cache at entity resolution level. Although, it really doesn’t solve the actual problem, just makes it more bearable by offloading primary DB workload to another DB, so why not run a read replica instead?
Thanks for the reply. Regarding the type of DB you referenced, what are some examples of DBs that allow relationships as links and querying to an arbitrary level?
With regards to using a read replica, we do have one that we use for some queries, but that doesn't really resolve the problem with having to do many queries for certain endpoints. We can only eliminate so many queries before the backend code becomes unmanageable, because we effectively forgo any of Django's ORM in favor of our own handcrafted queries that are hard to reuse.
Potentially anything which supports Cypher or has Apache TinkerPop integration, Neo4j with GraphQL, PostgreSQL with Apache AGE, EdgeDB, SurrealDB (very fresh) and Dgraph (almost but not quite production grade).
Got it, thank you. We are currently using PostgreSQL so I will look into AGE, Edge and Surreal. Any opinions on one vs the other? I guess some of that depends on which one has the best packages for Python..
No real opinion there, my usual stack is different — those are just some of the players that came to mind. AGE has a Python driver, so maybe that’s a good place to start.
Note that depending on your data model, you may not need ”complex” edges, just ”simple” links directly from vertex to vertex. Not all databases support this.
Then again, you already might have N:M tables which exactly correspond to edges in a graph model. This distinction is something to keep in mind.
Thanks, I will keep that in mind. EdgeDB looks really nice. It would be a rather large project to migrate our PG over to it, but maybe not so bad. I hopped on their Discord to ask if there any tools to aid with migration. They said:
>>> Tools to make migrations from Django (and other ORMs) are possible but we haven't yet started working on them. So you'd have to write the migration script manually.
Your comment on our Discord actually prompted an internal discussion on how we can implement this and we now have a pretty good idea. :) Not ETA or anything at this point, but we'll be looking into this soon.
Haha great. Would be hugely helpful to anyone thinking about migrating, but I'm sure you have your work cut out for you already and there's lots of competing things to prioritize.
I'm now at a point where I'm not sure what the next step should be. I believe caching may be the next logical step, but I've been hesitant due to concerns about the staleness problem, knowing when to bust the cache, etc.
And then even if we do go the caching route, I'm unsure at what level we want to cache. Do we cache individual objects or do we cache the large, fully hydrated models (the ones that require many queries to assemble)? Then there's the problem of having to mix and match fetching objects from the cache in the context of other arbitrary DB queries that can be constructed.
How do people generally go about introducing caching? Do you try to limit the scope of caching as much as possible (like to the lowest/smallest objects) and see whether that's sufficient to get the performance you want and then increase the caching surface area from there?
Also, what approaches do people take from the technical side? We already use Redis for some things so I implemented a quick proof-of-concept that seemed to work in the small scale. I've also seen people use Elasticsearch to keep fully hydrated/rich models in there and then use ES as the read store. Are there merits to this approach? Crazy?
Really keen to hear anyone's thoughts or guidance on this!