Hacker News new | past | comments | ask | show | jobs | submit login
Postgres 9.5 feature highlight: row-level security and policies (otacoo.com)
153 points by waffle_ss on Oct 3, 2014 | hide | past | favorite | 61 comments



Wow. This is super exciting. What does this mean for multi tenant applications? Does this mean we can push our multi-tenancy logic down to the DB (if not logic, at least the last line of "data defence"). Can we create one DB user for each app-level account, ensure that all tables have the account identifier, and then enforce (at a DB level), that one account should be unable to access another account's data?

What does this mean for DB pooling? Is it possible to maintain a set of open DB connections where the account credentials can be applied, query executed, and the account credentials removed, with the connection going back into the pool?


> Is it possible to maintain a set of open DB connections where the account credentials can be applied, query executed, and the account credentials removed, with the connection going back into the pool?

SET ROLE/RESET ROLE can be used for that, however for role A to be able to switch to role B, you need to "GRANT B TO A" which will lead to a combinatory explosion.

It'd probably be better to have a user with all roles granted as NOINHERIT (so the only thing it can do is SET ROLE) and all connections defaulting to that, then when you get a connection from the pool you "SET ROLE current_user" and you "RESET ROLE" the connection before storing it back.

I have not tested it.


i do just that and it works like a charm.


Also, what kind of hit does this have on performance?


Based purely on the syntax they are showing, it seems like in the worst case scenario it should just be the equivalent of having a couple extra where clauses for each potentially matched row, so the impact should be nominal. The real question is whether the optimizer can actually make the queries faster than they would have been before if there are a huge number of rows, but only a small number that the user has permission to access. E.g. can it create an indexed column on the permissions.


http://www.postgresql.org/docs/devel/static/ddl-rowsecurity....

I found the dev documentation to be much clearer than this blog post.


I find the Postgres dev documentation to be much clearer than most things, honestly.


Thanks for this. Much better readability, and it helped me see that you need to add policies after enabling RLS. At first I was scratching my head thinking it was magic


I open every "New Postgresql features" article in the hope that it will contain "Simple cluster setup, you just start nodes and they will figure out the rest." Unfortunately it didn't happen so far...

Is it such a hard problem, that no one is able to solve it, or it is not a concern or interest of the parties that are developing Postgresql or the tools around it?

(And by hard problem, I don't mean surviving an aphyr-level diagnostics without any issues, but it it would be certainly nice to see such too.)

Edit: I'd settle for a multi-master replicated cluster, where the node failures and startups (and the migration of the data) is handled transparently in the cluster. I know that there are many other aspects, but even this basic case is painfully hard to achieve (much harder than with MySQL).


It probably will never happen. Postgres follows an ideology of explicit and safe...it never tries to assume that you meant to do one thing if there is a possibility that you meant another.

With clusters, there are far too many variables for it to be plug and play. For example, do you want replication, or distributed partitions? If replication, do you want master-slave, or master-master? If distributed partitions, what are you going to trade off: consistency or availability? (That being said, Postgres will likely never change from its CA stance, you'll have to use an alternative distribution like PostgresXL).

Furthermore, clustering isn't likely to be much more than a pareto-tail problem for the next decade or so. It is useful for millions of use cases, and something that only Google's and Twitter's can benefit from is not one of them. If you have the data to justify extensive clustering requirements, I would hope you also have the resources to contribute your patches to Postgres. So far, they all seem to be content solving their own niche problems with niche solutions that slowly bleed down to us mere mortals (Ex. Cassandra), or just buying a solution like EnterpriseDB or Oracle.


Can't agree. You don't need a ton of data to run into perf issues if you are running on slow commodity hardware for example in Azure. Nor does the lack of synchronous replication with automatic failover make your job easy. Considering the lack of features in PG in this area, running it in Azure seems like a no go to me. We currently had to pick SQL Server which has failover and so on which works great but costs a lot.


It's a hard problem, and the Postgres team's careful approach to engineering means that they are solving it slowly, in comparison to the current fashion for ship first, patch later.

They bring stuff in piece by piece, eg. in 9.4 they added logical changeset extraction/logical log streaming replication, which is another step down the road to clustering. On top of this they are building bi-directional replication. And so it continues.

This is the same reason they haven't added upsert or merge yet.


It will happen eventually. There are use cases and postgres can do more to address them. The existence of third-party solutions is evidence of that.

What makes it really hard is solving the problem in a way that doesn't lead people into a trap. MongoDB offers solutions for some users, but a lot of users walk away very disappointed when the "solution" didn't do what they think it would do (often after running into problems in production).

But it still needs to be done. A general solution is impossible, so we need specific solutions. Each of those will need to be designed in a way that the user knows as early as possible whether it will work for them, or they need to use one of the other approaches.

What's happening right now is work on very powerful infrastructure for logical replication. Postgres invested in binary replication before, and the results have been great. But binary replication is not a good foundation for multi-master; so now there's investment in logical. There is also early-stage investment in parallel query, which I believe can pay off with mutli-machine parallelism, which is another form of scale-out.


The "trap" of MongoDB was more due to the exaggerated marketing around its features and data reliability out-of-the-box. Since Postgres already has plenty of mindshare, it wouldn't need to resort to such tactics, so it could describe its features with "giant red exclamation points" describing the exact pitfalls and extent of the use cases. So I'd be very much in favor of its developers prioritizing specific clustering solutions.


Documenting is one part, but that doesn't really answer the "simple setup" requirement if people have to read a lot.

Ideally, even someone who doesn't read the documents would be gently and intuitively guided toward the right solution and away from costly mistakes. Easier said than done, but let's try to get as close as we can.


You never want node failover to be transparent or automagic. Doing that is begging to end up in a "split-brain" scenario, and have fun cleaning up from that.

EDIT: Yes I'm simplifying; saying "never" overstates somewhat. If your cluster management tool can account for STONITH, you're probably safe — or at least safer. If you don't know what STONITH is without googling, you shouldn't be playing with multi-master, highly available databases and failover betwixt them.


There are robust ways to do this. The fact that some choose to implement their own buggy protocols for it does not mean that I do not want it. I use automatic failover with SQL Server today (and has been for years). I don't like the idea of getting up in the middle of the night to acknowledge some hardware error to initiate a failover.


The fact that a commercial software empire with tens of billions of dollars in revenue can, in conjunction with an OS over which they have complete control, implement working resource fencing does not imply that a mostly-volunteer Open Source project, running on at least a dozen-odd different hardware and software platforms, can too — and certainly not on the same time scale.


Clustering /is/ hard. There are tradeoffs to the various ways to do it, and explicit choices you have to make that have the ability to bite you in the ass later. As one of the stated goals of Postgres is reliability and correctness, they don't want to half-ass it. That being said, there is active work on getting multi-master working well out of the box, which unfortunately didn't land in 9.4. It may make it in 9.5, but that remains to be seen.


This would probably be a nice market for commercial vendors of PostgreSQL solutions.


It is.

I'm currently building this sort of thing and it's a glorious feeling. Clustering is hard, like "search engines before publishing of the page rank algorithm" hard. Sure there are lots of options, some very good ( I have fond memories of HotBot's advanced search, it was my go to search engine for a few years ) but tractability changed drastically after that paper. Now search engine theory and building a basic search engine from scratch is suitable for students instead of postgraduates. It's fun to work on a bleeding edge.


Don't expect it anytime soon. A large number of Postgres' key developers work for EnterpriseDB and that is a feature they charge for as part of Postgres Plus Advanced Server.


That is not a feature of Postgres Plus Advanced Server.


clustered read scalable deployments in minutes with no special skills

Full integration with high availability solutions such as Red Hat's Cluster Suite

With Postgres Plus Advanced Server, you can build sharded systems or other replication architectures with a variety of bundled solutions.

Master-to-Master, and similar styled applications are fully supported

Then their marketing copy is full of shit because they promise what rather sounds like a clustered setup with very little effort, maybe not zero config but doable with no special skills.


Their marketing copy lists many things that are just "postgresql". Postgresql already gives you "clustered read scalable deployments in minutes with no special skills". Setting up a read-only replication slave is trivial.


xDB is part of the Plus product. It's a trigger-based multi-master asynchronous replication system.

http://www.enterprisedb.com/products-services-training/produ...


No it is not a hard problem, it is an impossible, unsolvable problem. Software can not alter its behavior to match your desires if you do not tell it what you desire. Clustering is not some simple monolithic thing where you just "enable clustering" and that's that. There's billions of possible clustering setups. Setting it up is as simple as it can get, you have to pick which of the billions of setups you want.


>> Software can not alter its behavior to match your desires if you do not tell it what you desire

Why not have special types of transaction that would explicitly define consistency expectations across a cluster. Make the normal default mirror all data across all clusters and require a lock across the entire cluster when inserting/updating/deleting. You would then be free to alter expectations and introduce sharding to improve performance as needed.


Yes, there are billions of different setups, but there are a few basic ones, and you could start by solving one or two, before solving the generic case. For example, one could provide just a cluster setup for data replication. No fancy distributed data models, just copy the data around in the cluster, if a new node joins copy that data to it, if something fails, handle the failure. Postgresql claims to have multi-master setups, so this is really about the node handling and copying the data around.

After that, you could introduce locally distributed cluster. Later on you could introduce geographically distributed cluster setups. But just because the later is very complex, does not mean that you can't start with the basic setup.


> if something fails, handle the failure.

This generally involves halting the server or forcing a read-only mode until consistency can be assured.

But in general repmgr[1] can be used to simplify many common activities and configurations.

[1] http://www.repmgr.org


Setting up the few basic ones is already as easy as it can get.


Synchronous replication with automatic failover?


Yes.


Care to provide a link with the easy setup and cluster management? It could be a well-hidden gem that needs more publicity.



Trying to be ironic? That page does not present an easy way to set up what I asked for.


Yes it does. Here, I will click the correct link for you and everything: http://www.repmgr.org/


Yay, RLS! Maybe I'm weird but I've been waiting for this.

This will make per-tenant replication easier too


> Yay, RLS! Maybe I'm weird but I've been waiting for this.

We can be weird together!


What is RLS?


Read the article...


See title.


Looks like Oracle's Virtual Private Database? (or something like that, I cant remember the name)


Yes, it's almost the same


It's a nice example but when would you actually want RLS in the database? In practice you'd want to validate phone number formats etc, so between this and adding a "WHERE row_owner = X" to every query doesn't seem that different.


> so between this and adding a "WHERE row_owner = X" to every query doesn't seem that different.

The difference is you will forget to add the condition to a query at one point.


This will make it easier to replicate with say iPhone clients running Sqlite. Everyone can write to the same db, but only see their own stuff.

You could dramatically reduce SQL Injections by giving each user their own database login with limited rights. Login to web site as foo, which connects you to database as foo. With RLS you can do less damage.

Or let management connect directly to the database via Excel. Use RLS to prevent lower managers from seeing upper managers' salaries.


One example I've had in the past:

Inside governments, there are often individual pieces of data within a larger dataset that are protectively marked. For example, fake identities created by the government, identities of prominent people (like members of the Royal family) or maybe even people at high risk of identity theft like bank managers.

Your normal app just thinks the records for these people are missing or knows that they are present but can't access all of the data for them. These people have to make special arrangements to eg, apply for a driving licence.

The way it works in paper processes is that normal caseworkers will get the file and see that it's protectively marked and then hand over to a caseworker who has security clearance.

Obviously it depends on how much of your problem domain you're reifying in your database but it can be a nice option, particularly as part of defence-in-depth.


Whenever you have tables of data where users should only have access to certain rows and want that enforced at the DB layer rather than the app layer.

It probably only makes sense if your users also have user accounts on your postgres server.


>In practice you'd want to validate phone number formats etc

Uh huh? That's what check constraints are for.


I would really like to see an implementation of MySQL's SQL_CALC_FOUND_ROWS. It's super handy for pagination.


According to http://www.percona.com/blog/2007/08/28/to-sql_calc_found_row... (which is a very old thread and may be completely outdated), it's also often much slower than just doing two queries.


Good to know!

Although, like you said, that thread dates back to MySQL 5.0 - I wouldn't hedge my bets on it today...

Is there a "best practices" way to handle pagination of complex queries without running the query twice?


Server not found.



Ah, thank you. :D


Nice. :)


Is there an ETA for the production release of PG9.4 ?


I've been wondering this as well. They've had a release candidate out for a few months now, but no official release and nothing about an expected date on their web page.


Here is a post on their mailing list about it:

""" Since we bit the bullet and changed the on-disk format for JSONB data, the core committee feels that we should put out a new 9.4 beta release as soon as possible. Accordingly, we plan to wrap 9.4beta3 on Monday (Oct 6) for release Thursday Oct 9.

			regards, tom lane
"""

http://www.postgresql.org/message-id/26018.1412041228@sss.pg...


I'd guess by end of october or earlier december. They ran into some late issues with jsonb.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: