Hacker News new | past | comments | ask | show | jobs | submit login
Pg_later: Asynchronous Queries for Postgres (tembo.io)
187 points by revskill on Aug 18, 2023 | hide | past | favorite | 37 comments



It's so good, it gave the elephant an extra leg.


I don't quite understand why blog writers care to include a random AI at the top of their posts. Just give me the content, I don't want to scroll past a random elephant first.


Because social sharing seems to work better with an image and hand crafting something for every post is exhausting.


Everyone says they that and has been before AI. What they always miss is the little word "relevant". A relevant picture can be helpful, a random picture is just distracting and annoying.


The image is there for social media shares, which grab a prominent image to put beside the link. Even if it's a bit distracting in the post, if it grabs your eye enough to click the link, then that's job done.

Links without an image are just physically smaller on Twitter or Facebook, they don't stand out as well.


I don't find it that random in a post about Postgres to show a picture of an elephant. If it had been a badger or something, then yes, but an Elephant is pretty relevant as animals go.


> hand crafting something for every post is exhausting.

Spending hours doing trial-and-error with image-generation prompts is also exhausting.

Are we at the point where authors can feed their entire article text into an image-generator and it repeatedly (95%?) produces appropriate, if not very apt, artwork?


I will say, when I have used them I haven't spent more than a few minutes on the prompts. But I'm typically more focused on the writing and the image just needs to be "good enough" unless it's something specifically relevant to the content.


Most likely some SaaS that is ultimately mechanical turking the work out to some country with cheap labor.


When your LLM is fed by blind Buddhist monks.


Why Buddhist ? Why not Christians ?


https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

"The earliest versions of the parable of blind men and elephant is found in Buddhist, Hindu and Jain texts, as they discuss the limits of perception and the importance of complete context. The parable has several Indian variations [...]"


Well it turns out an elephant's nose isn't the only appendage that is prehensile…


If I understand the article correctly, the two main features of pg_later is that a SQL batch won't be aborted when your Postgres connection dies - and all results are retained until you retrieve them, and this is all mediated by Tembo's Postgres message-queue extension.

The thing is, both of those things can be done today without any extensions: just modify your SQL scripts to run under an Agent account, and change every `SELECT` into a `SELECT INTO` statement, all dumping into a tempdb - this technique also works on SQL Server and Oracle too.

(On the subject of Agents, I'm surprised pgAgent isn't built-in to Postgres; while MSSQL and Oracle have had it since the start)


I'm confused. Can I already do this today without extensions or do I need to install pgAgent first?


Idk about pgagent but any table is a resilient queue with the multiple locks available in pg along with some SELECT pg_advisory_lock or SELECT FOR UPDATE queries, and/or LISTEN/NOTIFY.

Several bg job libs are built around native locking functionality

> Relies upon Postgres integrity, session-level Advisory Locks to provide run-once safety and stay within the limits of schema.rb, and LISTEN/NOTIFY to reduce queuing latency.

https://github.com/bensheldon/good_job

> |> lock("FOR UPDATE SKIP LOCKED")

https://github.com/sorentwo/oban/blob/8acfe4dcfb3e55bbf233aa...


i think the parent commenter has confused "without any extensions" and "without installing any more extensions than the ones i've already installed"


To my knowledge, pgAgent isn't an extension, it's just a client.


Thanks. That helps.


Hi HN, lead dev on pg_later here. Reach out if you want to try pg_later or pgmq on Tembo Cloud. email me, adam at tembo dot io or join our slack: https://join.slack.com/t/tembocommunity/shared_invite/zt-20d...


Do the queries still have to be serial? I imagine so. The README is not clear but it can be interpreted as me being able to fire more than one query (say 2, but the first one is still running, like pg_background). Is that the case?


There's currently a single background worker reading jobs from a single queue, so processing is limited to serial at the moment. But we're planning on making both of those configurable, so that you can have priority queues, and multiple working pulling from the same queue, for example.


Not so much a big user of Postgres. But, out of curiosity, does Postgres have a way to run async queries _and_ have them run in the background with lowest priority such that it doesn’t affect other important queries and transactions?

As an example, a way to run an upsetting query about a cosmetic change in big table without locking rows for other transactions we care about on the same table..?


There's no concept of nice or priority in Postgres.

Queries usually stay out of each other's way, unless they're modifying the same data, causing lock contention.

What I've done in the past for "less important background queries" is use very short lock_timeout and short statement_timeout values. The query will fail if it can't acquire the lock quickly (and in turn won't hold extra locks), so you put it in a loop with a sleep.

https://www.postgresql.org/docs/15/runtime-config-client.htm...


There isn't an explicit concept of priority / resource allocation between queries in Postgres.

However, Postgres does that automatically for certain background processes like autovacuum, background worker etc. by allowing you to configure how fast / slow they go.

You could implicitly influence how fast / slow something goes by setting per role / database parameters and giving less resources to certain types of queries (https://www.postgresql.org/docs/15/sql-alterrole.html) or by using explicit locks + lock_timeout to create some kind of a priority.


I tested the following and it seems to work:

    CREATE OR REPLACE FUNCTION run_with_adjusted_settings(query_text text)
    RETURNS SETOF record
    LANGUAGE plpgsql
    AS $$
    DECLARE 
        result_record record;
    BEGIN
        -- set_config ( setting_name text, new_value text, is_local boolean ) → text
        -- Sets the parameter setting_name to new_value, and returns that value.
        -- If is_local is true, the new value will only apply during the current transaction.
        PERFORM set_config('statement_timeout', '10s', true);
        PERFORM set_config('work_mem', '1MB', true);
        PERFORM set_config('maintenance_work_mem', '1MB', true);
        PERFORM set_config('max_parallel_workers_per_gather', '1', true);

        -- Execute the provided query dynamically and return the results
        FOR result_record IN EXECUTE query_text
        LOOP
            RETURN NEXT result_record;
        END LOOP;

        RETURN;
    END;
    $$;

    -- Example usage:
    SELECT *
    FROM run_with_adjusted_settings('SELECT 1 as id, false as some_bool;') AS (id int, some_bool boolean);


I'm not aware of a way to prioritize one process over another.

The way I've done it is, as you allude to, by managing locks. You set the transaction isolation level as appropriate.

You can also batch statements by using a cursor, rather than having a single large blocking transaction.


Ive wanted something like this for a while.

Years ago on db2 on AS/400 we could submit queries to "batch". It would save the results in a physical file and you could come back and query them later. We were running plenty of things that took hours or sometimes days to finish. Being able to have that running on the server, not tied to any given client and set their priority (and change the priority during the run) was a huge benefit.

Theres not too many cases where I need to run hour long queries anymore, but still would be a great feature to have for long running, lower priority jobs.

Just need this to become available for RDS / Aurora.



I'm having a hard time understanding. I see that it's basically a queue. Request your query to be inserted, get a job_id back, poll that job_id until it is done.

Under the hood, is the query ran like a normal query?

I guess I don't understand the distinction between "run queries async" or "put query in a queue, run them sync one a time, poll until the one you want is done"

I get how the latter is "async".


It is similar functionality to https://docs.snowflake.com/en/developer-guide/sql-api/refere..., the `async` flag more specifically. You submit the query and are given the job id. A background worker runs it like a normal query, but the caller can go about their business while Postgres handles it for them.


Interesting! I’ve used materialized reviews for this in the past. I’m curious what the advantages to pg_later are over that.


I don't think you'd replace a materialized view with pg_later, but it might help you populate or update your materialized view if you are trying to do that asynchronously. pglater.exec() works with DDL too!

I use it a lot for long running queries when doing data science and machine learning work, and a lot of times when executing queries from a jupyter notebook or CLI. That way if my jupyter kernel dies, my query execution continues even if the network or my environment has an issue. I've started using it a bit more with https://github.com/postgresml/postgresml for model training tasks too, since those can be quite long running depending on the situation.


It should be helpful for maintenance and DDL tasks. Such as I need to build a secondary index over a large table and do something once the index is ready. Start the index creation with pg_later, check the status later and do what you need to do next. Probably, they can support triggers so that once a task is completed I can be notified.


We are planning to add support for "Push notifications for completed and failed jobs". That's in our roadmap listed here: https://tembo.io/blog/introducing-pg-later/#up-next


arangodb has had this for a long time

https://www.arangodb.com/docs/stable/http/general.html


Snowflake has too! I'm not familiar with arangodb but I'll check it out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: