Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tl;Dr: Nanopore data is historically lower quality than current gold-standard methods, but it is by no means "not viable" in a genomics pipeline. Their newer chemistry flowcells are competitive with current gold-standard (but I've not seen it with my own eyes in the lab yet due to limited release).

There are two components that drive sequencing error rate. 1) The chemistry behind the sequencing (for nanopore sequencing this is the "feeding DNA through a pore" bit) 2) the method to convert raw signal into DNA sequence (this is called "base calling").

The gold-standard in terms of error profile for sequencing is currently the Illumina short read platform. Illumina machines are really just microscopes (TIRF scopes for optics folks) that sequence DNA by visualizing incorporation of dye-labeled nucleotides into the sequenced molecule(s) (Imagine a really slow PCR [1]). Each base is labeled with a different color, then when a molecule has a match it makes a colored spot on the slide that the machine can read (see here for more info & details of newer chemistry that use fewer colors [2]). This whole process is mediated by DNA polymerase which itself has a very low error rate. Another important point is that DNA sequenced on the illumina platform (called a "library") tends to be from "amplified" template DNA, meaning the DNA will have been processed and potentially be missing chemical modifications on the bases that could be present in the organism. This works to Illumina's advantage, because when trying to answer the question of "what is the DNA sequence?" we want the ground-truth DNA, not the modification state.

In contrast, Nanopore sequencing works by feeding a long strand of DNA through a pore and measuring the change in electrical current through the pore (watch the cool video [3]). For the current set of nanopore flowcells, 8 bases of DNA sit in the pore at a time, meaning the current at each timestep is a product of 8 nucleotides in aggregate. This also means that the pore "sees" each base 8 times, but always in the context of an additional 7. In order to basecall from the raw signal, it's not as easy as saying "blue = A", instead, you have to deconvolve each base from a complex signal. As you might imagine, the folks at Oxford Nanopore & broader research community have turned to machine learning-based base callers to solve this problem, and they work quite well [4]. But they are not perfect. Deconvolving runs of the same base (e.g. "AAAAAAA") is difficult because without well-defined signal changes between bases, the caller has a hard time deciding how many bases it has seen, so a common error mode for nanopore sequencing is to create insertions/deletions at places in the genome with low nucleotide diversity. Another interesting reason is that most Nanopore library preps are often performed on unamplified DNA, and so in addition to normal A/T/G/C nucleotides, the template DNA can also contain bases with chemical modifications. For example, in bacteria, A's are often methylated, and in Humans, C can have all kinds of different modifications (5-methyl-cytosine, 5-hydroxymethyl-cytosine, etc. etc.) and each different modification affects the signal in the nanopore. Therefore, basecallers that weren't trained on modified bases will produce basecalling errors in the presence of base modifications.

For both Illumina and Nanopore basecallers, they assign a quality score to each base that indicates the probability that the basecaller produced an incorrect value. This is called a Q-score, which is defined as "Q = -10(log10(P-value))" (i.e. Q / 10 = the order of magnitude of the error probability) [5]. For example, a Q-score of 10 means an error rate of 1 in 10, but a Q-score of 50 means an error rate of 1 in 100,000. For Illumina sequencing, >95% of the reads have a Q-score > 30 (i.e. 1 in 1000 errors), while Nanopore reads tend to have lower average Q-scores (~Q20, i.e. 1 in 100 errors). For genetics, where 1 base difference can mean the difference between a severe disease allele vs a normal variant, 1 in 100 won't cut it.

The current gen Nanopore flowcell chemistry (R9.4.1) is what most people are talking about when they talk about Nanopore error rates, but they've just released a new pore type & made some basecaller upgrades that improve the accuracy to what they call "Q20+" and some claims of Q>30, and from the data I've seen, it's impressive, I just haven't got my hands on one yet to see for myself [6]. I think the comment saying "wait 5 years" is an overestimate, but if you want to genotype yourself today, I'd just pay someone for Illumina sequencing and process the fastq files yourself if you really want to do it as a learning exercise.

I've unintentionally written an essay, so I'll stop here, but real quick to your other point RE: rerunning the sample N times & using the repeats for error correction. This won't work the way you're thinking because a "sample" is actually a collection of DNA molecules that are sampled randomly by the sequencer. You have no way of knowing that the same read between runs was actually from the same molecule, so you can't error correct this way. Consequently, a totally different sequencing platform from Pacific Biosciences uses this strategy by doing some really cool chemistry, but I'll spare you the second essay (google "PacBio HiFi" or "circular consensus reads" if you're interested).

[1] https://en.wikipedia.org/wiki/Polymerase_chain_reaction

[2] https://www.ecseq.com/support/ngs/do-you-have-two-colors-or-...

[3] https://www.youtube.com/watch?v=RcP85JHLmnI

[4] This paper is a tad out of date, but Ryan Wick always writes extremely clear papers: https://genomebiology.biomedcentral.com/articles/10.1186/s13...

[5] https://www.illumina.com/documents/products/technotes/techno...

[6] https://nanoporetech.com/about-us/news/oxford-nanopore-tech-...

Edit: reformatted links for clarity.



I for one am glad you wrote the essay, this was incredibly informative and filled in a bunch of blanks I had after reading what I could scratch together on the MinION product. I think I'm in a partial state of shock at how accessible this is becoming. Thank you!


Thanks - fascinating stuff. I'm now even more convinced I want to give it a try, but I think I'll play around with public data and tutorials before leaping into home sequencing.


You totally should, it's a lot of fun. I'd suggest trying to find some bacterial genome sequencing (like E. coli) done on nanopore if you're interested in those data. I don't have a link to any handy right now, otherwise I'd post here, but assembling bacterial genomes is shockingly easy these days and doesn't need near as many resources as doing a human genome, so it's great for learning (I love the assembler Flye [1] for this).

And RE: home sequencing, honestly the hardest part for a beginner will likely be the sample prep, since that takes some combination of wet lab experience and expensive equipment. I really wish molecular biology was as simple to get hacking on as writing software. The lag time between doing an experiment and getting a result is so much longer than waiting for things to compile, it just makes improving your skills take longer.

[1] https://github.com/fenderglass/Flye




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: