Hacker News new | past | comments | ask | show | jobs | submit login
Paranoid SQL Execution on Postgres (ardentperf.com)
97 points by luu on Feb 25, 2022 | hide | past | favorite | 20 comments



#0 should be: for anything other than DDL, run in a transaction. Immediately after executing, do sanity checks on data that you know should have been touched, and data you know shouldn’t have been touched. Ideally, you’re pair programming and your partner is the one doing those tests. Then you can rollback if you accidentally forgot a WHERE clause and destructively updated more than expected. Of course, every second you keep a transaction open contributes to bloat, so don’t be slow. But this has saved me countless times.


> other than DDL

Especially, definately, also DDL (where possible). There are only few DDL statements that cannot be run transactionally (most of which is [CREATE / RE]INDEX CONCURRENTLY), but the rest of DDL can best be tested before committing to the result, especially when migrating between versioned schemas: you don't want half-applied migrations.


Absolutely run DDL in a transaction. There are some nuances, but as a general rule it's a great one.


> run in a transaction

Or just use PL/pgSQL stored functions which are inherently transactions[1]

No bad thing anyway as it has a side-benefit of helping the fight against SQL injection attacks etc.

[1] https://www.postgresql.org/docs/14/plpgsql-transactions.html


Are these runtime tests or only for CI/CD?

How do you test what wasn't changed, how does that work. Do you select before and after and compare certain rows or tables and if so random ones or which ones? A small example would be awesome.


PostgreSQL has the RETURNING statement for DML queries. It‘s very effective for that case because you get all affected rows and their new values.


why do you need to run SELECT with a transaction?


Surprised nobody has mentioned Row Level Security or RLS for postgres. It allows tight and domain-specific access control to each table through security policies you design. It's so effective at controlling access that Postgrest (postgrest.org) uses it to completely manage the access control for the HTTP API that's used to access the underlying postgres instance.

What I personally like about RLS is that I can test every access scenario for a user's session in unit tests, and I don't have to worry about application bugs allowing one user to access another user's data or accessing tables that their session token/api key doesn't allow them access to.


When it comes to security and SQL in general I think stored procedures are underappreciated. Bring back the DBA roll and make procedures for everything. Then you don't even have to build that API-layer. SQL is your API. And for things like password hashes, there doesn't even have to be a user that can read the hash-column, let a stored procedure compare them for you and there's no risk of leaking password hashes (unless the whole server gets compromised of course).


I think there's a little momentum in that direction. I've noticed one of the "cool" startup stacks right now is typescript, serverless functions, postgres, and some places do a lot with sql functions, stored procedures, and pg/pl languages.

Overall I agree with you but the process/tooling is really the issue. Not inherently, it's just less developed and less standardized than for other kinds of code. DBAs aren't necessarily a silver bullet here either. They have a different, somewhat overlapping focus and while they have tools for managing changes to SQL-as-code, they're usually not built around the assumption that devs are shipping diffs to SQL as part of their normal day-to-day work.

Not insurmountable complexities by any means but unless you are or work closely with an expert in your chosen DB, it might be difficult to accurately estimate what you're getting into.

---

This probably is overall more secure but not in a magical way. There's nothing really special about a user here, something will eventually have to have a role that can read that column. For example a classic trap that has caused plenty of security breaches is needing to write a function as `security definer` to account for RLS policies but not remembering to exclude user-writeable schemas from search_path. No users necessary for that leak.


If you want to see paranoid SQL for Postgres, look at the output of pg_dump (with the default text format.) For every strange-looking bit of SQL syntax pg_dump prefers to the "usual", there's a story behind that choice — usually one involving some customization some tenant or DBA could do to the DB that would break the semantics of the regular SQL.


Great suggestions. Also:

Parse the SQL and make sure it is what you think it is. Some parsers will give you a list of all tables accessed. I use node-sql-parser.

Run in a read-only transaction: SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY;


It is also missing:

- qualify all type names that you might be using in casts


Interesting. By this, do you mean select '2020-01-01' :: date; is slow, but select '2020-01-01' ::pg_catalog.date; is fast?

If so, that is very surprising. Any ideas why that is slow?


No, just that an adversary character might create a type `date` in a schema that's on your path, so that any data goes through that type's input and output functions / the semantics of the query change.

It's quite similar to why you fully qualify table names and column names.


Can you give a short example? So you don’t trust your own administrator?


> So you don’t trust your own administrator?

Not per se, but it might just as well be that you share the database with several other applications, which might do DDL. Fully qualifying all identifiers is the easiest way to guarantee that you don't have negative dependencies to worry about (i.e. depending on the non-existance of some identifier in a certain schema); the issue will most likely happen only when a) an adversary gets CREATE TYPES access to the database, or when you use custom types and a name that's in use starts to be shadowed.

Examples:

CREATE DOMAIN "bigint" AS pg_catalog.text;

CREATE TYPE "text" AS ENUM ();

Though, in all earnesty, this is also an issue with custom operators:

CREATE OPERATOR = (function = always_false, left_arg = int, right_arg = int);

You can schema-qualify operators ( Col1 OPERATOR(pg_catalog.=) Col2 ), but in doing that you lose operator precedence.


A good corollary might be: "use very specific names for custom types."


I am not sure at the moment but don‘t you need superuser access to create operators? In that case not „any application“ could make your application faulty.


Nope:

> To be able to create an operator, you must have USAGE privilege on the argument types and the return type, as well as EXECUTE privilege on the underlying function. If a commutator or negator operator is specified, you must own these operators.

(https://www.postgresql.org/docs/14/sql-createoperator.html)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: