Hacker News new | past | comments | ask | show | jobs | submit | jakebol's comments login

How is ancestry.com a startup at this point? It was founded in 1996.


Most every (analytic) RDMS database system can model sparse arrays. A sparse array is modeled by defining a clustered index on the table "array" dimensions and defining a uniqueness constraint on that clustered index. This works well with columnar storage because the data needs to have (and assumed to naturally have) a total sort order on the dimensions. Ex. Vertica, Clickhouse, Bigquery... all allow you to do this. TileDB allows for efficient range queries through an R-Tree like index on the specified dimensions.

Most real world data though is messy and defining a uniqueness constraint upfront (upon ingestion) is often limiting, so for practical use cases this gets relaxed to a multi-set rather than sparse array model for storage, and uniqueness imposed in some way after the fact (if required).


I agree that many use cases of sparse data, uniqueness of the dimensions can't be guaranteed or you might not want to enforce the uniqueness. With the recent TileDB 2.0 release we introduced support for duplicates in sparse arrays which adds the support for multi-sets[1].

[1] https://github.com/TileDB-Inc/TileDB/pull/1504


Just to note that language "duplicates in sparse arrays" doesn't make sense, if you allow for duplicates it is no longer an array by definition.


Arrays (vectors) can have duplicates - don’t you mean a set?


I think the duplicates are in the coordinate dimension, like x[1] having more than one value, rather than x[1] and x[2] have the same value.


Unfortunately "its just geography" is kind of one of the talking points for not really addressing the problem. Although true, concerted reductions in pollution have happened when there was political will to make it happen (mostly through the federal gov. / EPA clean air regulations).

Ogden and Provo are some of the worst offenders for per household air pollution emissions. Like many western cities they have longish commutes (everywhere) in large cars (trucks / suv's) with a high number of cars / household and almost non-functional public transport system. For the Salt Lake Metro area, per capita carbon emissions doubled between 1980 and 2015 because of increasing sprawl. Air regulations here are spotty for personal vehicles and I'm guessing almost non-existent for commercial vehicles. Oh and the state governments solution to this is to push a publicly subsidized "inland port" that will bring increased truck and rail traffic to the valley. The leaders of these tech companies are starting to point out that terrible air pollution for parts of the year is hurting recruitment so it seems like as the money flows into this sector maybe there will be political will on the state and local side to address some of these issues.


This is a good description, except that TileDB (the open source client) is not transactional but eventually consistent at least for S3 and other object stores.

I like your point about consuming S3 cleverly, it's often difficult to get good out of the box performance from S3 so abstracting that to the degree possible is good for end-users. The cloud vendors though are always one or two steps ahead of companies that build upon their services. AWS Redshift for instance already can pre-index objects stored on S3 to accelerate queries at the storage layer. It's difficult as a third party vendor to compete with that.


This is a very interesting development that I'd like to learn more about. Whenever I've played around with writing databases (just as toy projects) I've always done so using RocksDB or something similar as a backend. This "thick client" model, though, seems to have a lot of potential benefits, most notably no need to worry about disk space or volumes (so say goodbye to a bunch of config parameters) and no need for a tiered storage setup or S3 migration tools (already accomplished!). Not ideal for most use cases but intriguing for some!


There are a lot of issues though with S3, latency, poor performance for small reads / writes, timeouts, api rate limits, api costs, and consistency issues poorly understood by third party developers.

A "thick-client" also doesn't perform well unless that client is located on a node in the same region. I think as with everything it works well in some cases and not well in others.


It's not so difficult if you control the data. Snowflake offers a relational datawarehouse built on EC2/S3 (and now other clouds) with its own column-oriented data format (a hybrid called PAX). It can seek to the right columns and rows by getting the exact byte ranges from an S3 object.


this is true (and a property of all? cloud formats Delta, Hudi, Iceberg, parquet, etc.)

I was referring more to the fact that the cloud vendors can co-design their infrastructure and software to support their database services.


Underrated it is not, just search for Cottonwood Canyon traffic jams to see what skiing really is like here when the snow flies. 30 minutes no traffic, can easily be 3+ hours now.


I'll take three hours over 12 hours which is how long it can take to get to Tahoe from SF on a Friday when it's snowing.


Exactly, a normal drive from my apartment to Brighton or Snowbird is about 45-50 minutes, I've had several times where it was just short of 3 hours one way, this past season.


Moral of the story was that his dreams came true?

https://twitter.com/mikeandersonKSL/status/12577783769258024...


they already provide this as a service


Funny that he mentions Salt Lake / the Salt lake valley as the only other experience with significant levels of air pollution. I often wonder how "silicon slopes" companies are able to attract people here, they must never interview during the winter (particulate) or summer (ozone / smoke).


Made the jump from OSX and for scientific users who don't to endlessly chase a working laptop setup PopOS and it's nvidia driver support (on a thinkpad) has been fantastic, highly recommend as well.


usually original investors get stock in the split-up companies correct? In the long run the aggregate value of the split up / smaller basket of independent companies could be larger than the original quasi monopoly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: