Hacker News new | past | comments | ask | show | jobs | submit login

My biggest question is why on earth did they give you 25 Tb of TSV genetic data? :-)

I'm not sure what your sample was but seems like it would have been better to use one of the special binary file formats for genetic data. You wrote SNP chips, But in order to get to 25 Tb I assume there must be imputed calls, so it seems like a BGEN might have been a lot easier.

This is speculation of course, I'm not sure exactly what your situation was.




You're correct that, for the final discrete or probabilistic variant calls, there are far better data formats. However, it's clear that Nick's lab currently wants to work with raw intensity readings.

My main practical recommendation for Nick is to become familiar with bgzip and zstd. bgzip sacrifices a little bit of compression efficiency relative to plain gzip, but in exchange it solves the more important problems of (i) letting you take advantage of all your cores when decompressing and (ii) supporting random-access reads with an appropriate index, while remaining compatible with all .gz-reading programs. When backward compatibility is unimportant, zstd tends to have much better compression/decompression speed for the same compression ratio than gzip.


If you unpack all of https://files.pushshift.io/reddit/comments/ you have many Tb of JSONs that are just dumps of API responses that slowly change schema over the years. It's also an incredibly useful dataset.

In the end CPUs are fast enough and compression algorithms good enough that I would argue it doesn't really matter what format you use for storage, as long as it's reasonably easy to read back.


In the case of genomics, there have been at this point decades of work developing high performance file formats and there are large ecosystems of tools around them. Lots of bioinformatics is really manipulating these files. So using a supported file format makes a big difference.


It's a plain chip, just one specially made for our institution by illumina. As to why they would deliver it in tsvs, that I can't answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: