danmcs's comments

danmcs · on May 25, 2023

There were changes made to idle sessions in postgres 14.0 that were supposed to reduce the resource usage of open but idle connections.

Crunchydata mentioned it on their blog a while back (https://www.crunchydata.com/blog/five-tips-for-a-healthier-p...) and the pg 14 release notes mention a few changes to idle sessions (https://www.postgresql.org/docs/release/14.0/)

I don't know if they were sufficient that pgbouncer is no longer necessary, haven't had a need to try it.

danmcs · on Feb 1, 2023

Redshift speaks the postgres protocol so you might be able to use postgres. There are a few purpose-built docker images (googleable) that may replicate Redshift slightly better than just `docker run -p 127.0.0.1:5439:5432 postgres:latest`, but if you're at the point of having a test suite for your data warehouse code, you're likely using Redshift-specific features in your code and postgres won't suffice.

I have seen teams give each developer a personal schema on a dev cluster, to ensure their Redshift SQL actually works. The downside is that now your tests are non-local, so it's a real tradeoff. In CI you probably connect to a real test cluster.

danmcs · on Oct 5, 2022

I thought about that while back, how would I build an index on top of Durable Objects?

For sorted indexes like a b-tree in a database, I think you would partition into objects by value, so (extremely naive example) values starting with a-m would be in one object, and n-z in the second. You'd end up needing a metadata object to track the partitions, and some reasonably complicated ability to grow and split partitions as you add more data, but this is a relatively mature and well-researched problem space, databases do this for their indexes.

For full text search, particularly if you want to combine terms, you might have to partition by document, though. So you'd have N durable objects which comprise the full text "index", and each would contain 1/N of the documents you're indexing, and you'd build the full text index in each of those. If you searched for docs containing the words "elasticsearch" and "please" you would have to fan out to all the partitions and then aggregate responses.

You could go the other way, and partition by value again, but that makes ANDs (for example) more challenging, those would have to happen at response aggregation time in some way.

You'd do the stemming at index time and at search time, like Solr does.

I have no idea what the documents per partition would be; it would probably depend on the size of the documents, and the number of documents, and the amount you'll be searching them, since each durable object is single-threaded. Adding right truncation or left+right will blow up the index size, so that would probably drive up the partition count. You might be better off doing trigrams or something like it at that point but I'm not as familiar with those.

This is where optimizing would be hard. I don't think you can get from Durable Objects the kind of detailed CPU/disk IO stats you really need to optimize this kind of search engine data structure.

yazaddaruvala · on Oct 5, 2022

You’re better off creating Lucene Segments in R2 and letting Lucene remotely access them (if Lucene could run on Worker as WASM). Or something very like Lucene but compiled to WASM.

You’d also need to manage the Lucene Segments or Solar/ElasticSearch Shard Metadata in Workers KV. You’d need a pool of Workers that are Coordination Nodes, another pool as “Data Nodes / Shards” and a non-Workers pool creating and uploading Lucene segments to R2.

It shouldn’t be so hard to do actually. Cloudflare would need more granular knobs for customers to fine tune the R2 replication to be collocated with the Worker execution locations so it’s really fast).

danmcs · on Sept 22, 2022

Uber Engineering open-sourced Kraken [1], their peer-to-peer docker registry. I remember it originally using the BitTorrent protocol but in their readme they now say it is "based on BitTorrent" due to different tradeoffs they needed to make.

As far as I know there aren't any projects doing peer-to-peer distribution of container images to servers, probably because it's useful to be able to use a stock docker daemon on your server. The Kraken page references Dragonfly [2] but I haven't grokked it yet, it might be that.

It has seemed strange to me that the docker daemon I run on a host is not also a registry, if I want to enable that feature.

It's also possible that in practice you'd want your CI nodes optimized for compute because they're doing a lot of work, your registry hosts for bandwidth, and your servers again for compute, and having one daemon to rule them all seems elegant but is actually overgeneralized, and specialization is better.

1 https://github.com/uber/kraken

2 https://d7y.io/

danmcs · on Sept 30, 2021

Like -punk, it has been suffixed onto other words to essentially mean "genre".

danmcs · on Nov 8, 2020

I think it's useful to think of studies like this as a MVP. They're relatively cheap and fast. If they give a null result, you've failed fast. If they give a non-null, then you can invest in further study and iterate.

emteycz · on Nov 8, 2020

As opposed to a MVP, such study can give you any results you want.