Pg_later: Asynchronous Queries for Postgres

davea37 · on Aug 18, 2023

It's so good, it gave the elephant an extra leg.

luto · on Aug 18, 2023

I don't quite understand why blog writers care to include a random AI at the top of their posts. Just give me the content, I don't want to scroll past a random elephant first.

brightball · on Aug 18, 2023

Because social sharing seems to work better with an image and hand crafting something for every post is exhausting.

weinzierl · on Aug 18, 2023

Everyone says they that and has been before AI. What they always miss is the little word "relevant". A relevant picture can be helpful, a random picture is just distracting and annoying.

roywiggins · on Aug 18, 2023

The image is there for social media shares, which grab a prominent image to put beside the link. Even if it's a bit distracting in the post, if it grabs your eye enough to click the link, then that's job done.

Links without an image are just physically smaller on Twitter or Facebook, they don't stand out as well.

mosselman · on Aug 18, 2023

I don't find it that random in a post about Postgres to show a picture of an elephant. If it had been a badger or something, then yes, but an Elephant is pretty relevant as animals go.

DaiPlusPlus · on Aug 18, 2023

> hand crafting something for every post is exhausting.

Spending hours doing trial-and-error with image-generation prompts is also exhausting.

Are we at the point where authors can feed their entire article text into an image-generator and it repeatedly (95%?) produces appropriate, if not very apt, artwork?

brightball · on Aug 21, 2023

I will say, when I have used them I haven't spent more than a few minutes on the prompts. But I'm typically more focused on the writing and the image just needs to be "good enough" unless it's something specifically relevant to the content.

TylerE · on Aug 18, 2023

Most likely some SaaS that is ultimately mechanical turking the work out to some country with cheap labor.

mhd · on Aug 18, 2023

When your LLM is fed by blind Buddhist monks.

revskill · on Aug 18, 2023

Why Buddhist ? Why not Christians ?

rrauenza · on Aug 18, 2023

https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

"The earliest versions of the parable of blind men and elephant is found in Buddhist, Hindu and Jain texts, as they discuss the limits of perception and the importance of complete context. The parable has several Indian variations [...]"

anotherevan · on Aug 19, 2023

Well it turns out an elephant's nose isn't the only appendage that is prehensile…

DaiPlusPlus · on Aug 18, 2023

If I understand the article correctly, the two main features of pg_later is that a SQL batch won't be aborted when your Postgres connection dies - and all results are retained until you retrieve them, and this is all mediated by Tembo's Postgres message-queue extension.

The thing is, both of those things can be done today without any extensions: just modify your SQL scripts to run under an Agent account, and change every `SELECT` into a `SELECT INTO` statement, all dumping into a tempdb - this technique also works on SQL Server and Oracle too.

(On the subject of Agents, I'm surprised pgAgent isn't built-in to Postgres; while MSSQL and Oracle have had it since the start)

dventimi · on Aug 18, 2023

I'm confused. Can I already do this today without extensions or do I need to install pgAgent first?

_sojh · on Aug 18, 2023

Idk about pgagent but any table is a resilient queue with the multiple locks available in pg along with some SELECT pg_advisory_lock or SELECT FOR UPDATE queries, and/or LISTEN/NOTIFY.

Several bg job libs are built around native locking functionality

> Relies upon Postgres integrity, session-level Advisory Locks to provide run-once safety and stay within the limits of schema.rb, and LISTEN/NOTIFY to reduce queuing latency.

https://github.com/bensheldon/good_job

> |> lock("FOR UPDATE SKIP LOCKED")

https://github.com/sorentwo/oban/blob/8acfe4dcfb3e55bbf233aa...

notatoad · on Aug 18, 2023

i think the parent commenter has confused "without any extensions" and "without installing any more extensions than the ones i've already installed"

DaiPlusPlus · on Aug 18, 2023

To my knowledge, pgAgent isn't an extension, it's just a client.

dventimi · on Aug 18, 2023

Thanks. That helps.

chuckhend · on Aug 18, 2023

Hi HN, lead dev on pg_later here. Reach out if you want to try pg_later or pgmq on Tembo Cloud. email me, adam at tembo dot io or join our slack: https://join.slack.com/t/tembocommunity/shared_invite/zt-20d...

why-el · on Aug 18, 2023

Do the queries still have to be serial? I imagine so. The README is not clear but it can be interpreted as me being able to fire more than one query (say 2, but the first one is still running, like pg_background). Is that the case?

chuckhend · on Aug 18, 2023

There's currently a single background worker reading jobs from a single queue, so processing is limited to serial at the moment. But we're planning on making both of those configurable, so that you can have priority queues, and multiple working pulling from the same queue, for example.

reacharavindh · on Aug 18, 2023

Not so much a big user of Postgres. But, out of curiosity, does Postgres have a way to run async queries _and_ have them run in the background with lowest priority such that it doesn’t affect other important queries and transactions?

As an example, a way to run an upsetting query about a cosmetic change in big table without locking rows for other transactions we care about on the same table..?

andruby · on Aug 18, 2023

There's no concept of nice or priority in Postgres.

Queries usually stay out of each other's way, unless they're modifying the same data, causing lock contention.

What I've done in the past for "less important background queries" is use very short lock_timeout and short statement_timeout values. The query will fail if it can't acquire the lock quickly (and in turn won't hold extra locks), so you put it in a loop with a sleep.

https://www.postgresql.org/docs/15/runtime-config-client.htm...

samaysharma · on Aug 18, 2023

There isn't an explicit concept of priority / resource allocation between queries in Postgres.

However, Postgres does that automatically for certain background processes like autovacuum, background worker etc. by allowing you to configure how fast / slow they go.

You could implicitly influence how fast / slow something goes by setting per role / database parameters and giving less resources to certain types of queries (https://www.postgresql.org/docs/15/sql-alterrole.html) or by using explicit locks + lock_timeout to create some kind of a priority.

gavinray · on Aug 18, 2023

I tested the following and it seems to work:

    CREATE OR REPLACE FUNCTION run_with_adjusted_settings(query_text text)
    RETURNS SETOF record
    LANGUAGE plpgsql
    AS $$
    DECLARE 
        result_record record;
    BEGIN
        -- set_config ( setting_name text, new_value text, is_local boolean ) → text
        -- Sets the parameter setting_name to new_value, and returns that value.
        -- If is_local is true, the new value will only apply during the current transaction.
        PERFORM set_config('statement_timeout', '10s', true);
        PERFORM set_config('work_mem', '1MB', true);
        PERFORM set_config('maintenance_work_mem', '1MB', true);
        PERFORM set_config('max_parallel_workers_per_gather', '1', true);

        -- Execute the provided query dynamically and return the results
        FOR result_record IN EXECUTE query_text
        LOOP
            RETURN NEXT result_record;
        END LOOP;

        RETURN;
    END;
    $$;

    -- Example usage:
    SELECT *
    FROM run_with_adjusted_settings('SELECT 1 as id, false as some_bool;') AS (id int, some_bool boolean);

drsim · on Aug 18, 2023

I'm not aware of a way to prioritize one process over another.

The way I've done it is, as you allude to, by managing locks. You set the transaction isolation level as appropriate.

You can also batch statements by using a cursor, rather than having a single large blocking transaction.

netcraft · on Aug 18, 2023

Ive wanted something like this for a while.

Years ago on db2 on AS/400 we could submit queries to "batch". It would save the results in a physical file and you could come back and query them later. We were running plenty of things that took hours or sometimes days to finish. Being able to have that running on the server, not tied to any given client and set their priority (and change the priority during the run) was a huge benefit.

Theres not too many cases where I need to run hour long queries anymore, but still would be a great feature to have for long running, lower priority jobs.

Just need this to become available for RDS / Aurora.

pramsey · on Aug 18, 2023

Similar to https://github.com/vibhorkum/pg_background

MuffinFlavored · on Aug 18, 2023

I'm having a hard time understanding. I see that it's basically a queue. Request your query to be inserted, get a job_id back, poll that job_id until it is done.

Under the hood, is the query ran like a normal query?

I guess I don't understand the distinction between "run queries async" or "put query in a queue, run them sync one a time, poll until the one you want is done"

I get how the latter is "async".

chuckhend · on Aug 18, 2023

It is similar functionality to https://docs.snowflake.com/en/developer-guide/sql-api/refere..., the `async` flag more specifically. You submit the query and are given the job id. A background worker runs it like a normal query, but the caller can go about their business while Postgres handles it for them.

zekenie · on Aug 18, 2023

Interesting! I’ve used materialized reviews for this in the past. I’m curious what the advantages to pg_later are over that.

chuckhend · on Aug 18, 2023

I don't think you'd replace a materialized view with pg_later, but it might help you populate or update your materialized view if you are trying to do that asynchronously. pglater.exec() works with DDL too!

I use it a lot for long running queries when doing data science and machine learning work, and a lot of times when executing queries from a jupyter notebook or CLI. That way if my jupyter kernel dies, my query execution continues even if the network or my environment has an issue. I've started using it a bit more with https://github.com/postgresml/postgresml for model training tasks too, since those can be quite long running depending on the situation.

magden · on Aug 18, 2023

It should be helpful for maintenance and DDL tasks. Such as I need to build a secondary index over a large table and do something once the index is ready. Start the index creation with pg_later, check the status later and do what you need to do next. Probably, they can support triggers so that once a task is completed I can be notified.

samaysharma · on Aug 18, 2023

We are planning to add support for "Push notifications for completed and failed jobs". That's in our roadmap listed here: https://tembo.io/blog/introducing-pg-later/#up-next

PaulHoule · on Aug 18, 2023

arangodb has had this for a long time

https://www.arangodb.com/docs/stable/http/general.html

chuckhend · on Aug 18, 2023

Snowflake has too! I'm not familiar with arangodb but I'll check it out.