More

jakebol · on Aug 7, 2020

How is ancestry.com a startup at this point? It was founded in 1996.

jakebol · on July 20, 2020

Most every (analytic) RDMS database system can model sparse arrays. A sparse array is modeled by defining a clustered index on the table "array" dimensions and defining a uniqueness constraint on that clustered index. This works well with columnar storage because the data needs to have (and assumed to naturally have) a total sort order on the dimensions. Ex. Vertica, Clickhouse, Bigquery... all allow you to do this. TileDB allows for efficient range queries through an R-Tree like index on the specified dimensions.

Most real world data though is messy and defining a uniqueness constraint upfront (upon ingestion) is often limiting, so for practical use cases this gets relaxed to a multi-set rather than sparse array model for storage, and uniqueness imposed in some way after the fact (if required).

Shelnutt2 · on July 20, 2020

I agree that many use cases of sparse data, uniqueness of the dimensions can't be guaranteed or you might not want to enforce the uniqueness. With the recent TileDB 2.0 release we introduced support for duplicates in sparse arrays which adds the support for multi-sets[1].

[1] https://github.com/TileDB-Inc/TileDB/pull/1504

jakebol · on July 20, 2020

Just to note that language "duplicates in sparse arrays" doesn't make sense, if you allow for duplicates it is no longer an array by definition.

DaiPlusPlus · on July 20, 2020

Arrays (vectors) can have duplicates - don’t you mean a set?

evanpw · on July 20, 2020

I think the duplicates are in the coordinate dimension, like x[1] having more than one value, rather than x[1] and x[2] have the same value.

jakebol · on July 19, 2020

Unfortunately "its just geography" is kind of one of the talking points for not really addressing the problem. Although true, concerted reductions in pollution have happened when there was political will to make it happen (mostly through the federal gov. / EPA clean air regulations).

Ogden and Provo are some of the worst offenders for per household air pollution emissions. Like many western cities they have longish commutes (everywhere) in large cars (trucks / suv's) with a high number of cars / household and almost non-functional public transport system. For the Salt Lake Metro area, per capita carbon emissions doubled between 1980 and 2015 because of increasing sprawl. Air regulations here are spotty for personal vehicles and I'm guessing almost non-existent for commercial vehicles. Oh and the state governments solution to this is to push a publicly subsidized "inland port" that will bring increased truck and rail traffic to the valley. The leaders of these tech companies are starting to point out that terrible air pollution for parts of the year is hurting recruitment so it seems like as the money flows into this sector maybe there will be political will on the state and local side to address some of these issues.

jakebol · on June 15, 2020

This is a good description, except that TileDB (the open source client) is not transactional but eventually consistent at least for S3 and other object stores.

I like your point about consuming S3 cleverly, it's often difficult to get good out of the box performance from S3 so abstracting that to the degree possible is good for end-users. The cloud vendors though are always one or two steps ahead of companies that build upon their services. AWS Redshift for instance already can pre-index objects stored on S3 to accelerate queries at the storage layer. It's difficult as a third party vendor to compete with that.

biggestlou · on June 15, 2020

This is a very interesting development that I'd like to learn more about. Whenever I've played around with writing databases (just as toy projects) I've always done so using RocksDB or something similar as a backend. This "thick client" model, though, seems to have a lot of potential benefits, most notably no need to worry about disk space or volumes (so say goodbye to a bunch of config parameters) and no need for a tiered storage setup or S3 migration tools (already accomplished!). Not ideal for most use cases but intriguing for some!

jakebol · on June 15, 2020

There are a lot of issues though with S3, latency, poor performance for small reads / writes, timeouts, api rate limits, api costs, and consistency issues poorly understood by third party developers.

A "thick-client" also doesn't perform well unless that client is located on a node in the same region. I think as with everything it works well in some cases and not well in others.

manigandham · on June 15, 2020

It's not so difficult if you control the data. Snowflake offers a relational datawarehouse built on EC2/S3 (and now other clouds) with its own column-oriented data format (a hybrid called PAX). It can seek to the right columns and rows by getting the exact byte ranges from an S3 object.

jakebol · on June 15, 2020

this is true (and a property of all? cloud formats Delta, Hudi, Iceberg, parquet, etc.)

I was referring more to the fact that the cloud vendors can co-design their infrastructure and software to support their database services.

jakebol · on June 1, 2020

Underrated it is not, just search for Cottonwood Canyon traffic jams to see what skiing really is like here when the snow flies. 30 minutes no traffic, can easily be 3+ hours now.

gboss · on June 1, 2020

I'll take three hours over 12 hours which is how long it can take to get to Tahoe from SF on a Friday when it's snowing.

2trill2spill · on June 1, 2020

Exactly, a normal drive from my apartment to Brighton or Snowbird is about 45-50 minutes, I've had several times where it was just short of 3 hours one way, this past season.

jakebol · on May 6, 2020

Moral of the story was that his dreams came true?

https://twitter.com/mikeandersonKSL/status/12577783769258024...

jakebol · on Jan 24, 2020

they already provide this as a service

jakebol · on Jan 9, 2020

Funny that he mentions Salt Lake / the Salt lake valley as the only other experience with significant levels of air pollution. I often wonder how "silicon slopes" companies are able to attract people here, they must never interview during the winter (particulate) or summer (ozone / smoke).

jakebol · on Oct 20, 2019

Made the jump from OSX and for scientific users who don't to endlessly chase a working laptop setup PopOS and it's nvidia driver support (on a thinkpad) has been fantastic, highly recommend as well.

jakebol · on Aug 20, 2019

usually original investors get stock in the split-up companies correct? In the long run the aggregate value of the split up / smaller basket of independent companies could be larger than the original quasi monopoly.