I'm really curious about what I could learn by getting my DNA sequenced, but I'm worried about my rights to not have it recorded and shared without my consent if I got someone else to do it for me - so any advance toward an affordable home test setup is very welcome.
Protection from this comes from laws that ban DNA-based policies, not by being secretive about sequencing. If it is allowed, insurers will have no need to obtain DNA sequences in devious ways, they will just ask and refuse cover or charge more when clients refuse to get sampled.
“Passed in 2008, a federal law called the Genetic Information Nondiscrimination Act (GINA) made it illegal for health insurance providers in the United States to use genetic information in decisions about a person's health insurance eligibility or coverage.”
Also prevents employment discrimination based on genetics.
Oh sweet summer child. The astute business person will construct a score that happens to correlate with these known genetic defects and then sell it to insurance anyway with the plausible cover correlated source.
That really isn't how most health insurance works in the US now. As far as I know, there really is no such think as refusing health insurance to an eligible person. Now other types of insurance like life, home, auto, those are a different story. But regular health insurance just has to accept your application.
Perhaps health insurance companies can't do this, but I know for absolute certain no one is looking closely enough at every little company's hiring decisions to find out if someone is doing this.
Insurers have auditing requirements to prove what goes into the policy calculation. It is impossible to hide illegal data use at any meaningful scale, and no insurance agency is looking to save a buck on a small number of clients.
I am absolutely sure there is no one who would call mortgages "unregulated" in 2008. That the regulation is insufficient was determined later - and way too late.
You resolve part of them, but immediately generate others. Hybrid systems are the way to go.
In Spain, for example, we have a private system but it is extremely inefficient in some areas (and very good in others). Of course, you can have private insurance, but you still have to pay your social security. Curiously, the only ones who can decide which system they want are the public servants...
You avoid the problem with medical debt, to be precise.
You cannot really avoid the fundamental constraints - anywhere in the world, there are only so many doctors and so much money available for treatments. IDK if USA has a shortage of doctors, but plenty of European countries do. A country like Romania just cannot give its doctors big enough wages to stop them from seeking employment elsewhere, where they will get five to ten times as much (UK, Germany, Switzerland). As a result, local hospitals are seriously understaffed.
Where I live, having personal connections to good doctors gives you an advantage - you will be examined and treated faster. Then there is outright nepotism.
The outgroups are different than in America, but there are always people for whom the system sucks.
That's a pretty poor way of pigeonholing the problem. Looking at the US healthcare system, it's obvious that many doctors' and nurses' talents are wasted doing bureaucratic paperwork. Simultaneously, if there is a genuine lack of healthcare providers, there is no price signal that would encourage more to enter the market.
What you say may be somewhat true in the context of transmuting the US's "private" bureaucracy into bona fide "government". But it's certainly not a "fundamental constraint" that's impossible to solve. Rather it's a failure of organization, whether critiqued in terms of bottom-up market failure or top-down governance failure.
> Looking at the US healthcare system, it's obvious that many doctors' and nurses' talents are wasted doing bureaucratic paperwork.
This is incorrect. Most of the paperwork is done by administrative staff. Paying for that giant staff + the actual medical professionals is why things are so expensive.
Hospitals are not stupid, they won’t waste their most valuable resource (healthcare time) on bureaucratic paperwork.
An oncology department I'm familiar with has an entire "nurse navigator" whose whole job is to submit "prior approval" requests to "insurance" companies justifying why patients need a specific treatment, plus the nurses employed by the "insurance" companies reading those requests. I believe it's similar for any moderately expensive specialty. A common career path is care nurse -> burnout -> administration. Most of the administration is made up of people who could be providing healthcare.
And no, hospitals' most valuable resource are their billing computers. I think when it comes to providing actual healthcare hospitals are very stupid. You cannot partition any knowledge worker's attention into 10 minute blocks and expect them to achieve anything useful, yet that is what their entire system is designed around. The hospital doesn't have unilateral say of course (an "insurance" company won't pay one doctor the "price" of two if they spend twice as long with a patient), but they're still content optimizing within that status quo outcome - completely scatterbrained care.
And it's not like individual doctors are well rested or happy when you talk to them. The system clearly takes their toll on them (eg disappearing for 5 minutes to go retrieve test results that didn't show up before your appointment). In fact I'd say the vast majority of human talent in the medical system ends up completely wasted.
Capitalism, nepotism, public, private, insurance, nationalised healthcare ... the GP is saying that these are methods of dividing up available care, not methods for creating more available care.
X is the amount of medical care available
Y is the amount of medical care wanted
If Y < X there is no problem with any of the systems. And, obviously, a certain amount of inefficiency doesn't affect patient care. Plus, perhaps relevant today: when shit hits the fan we can scale up available care quickly.
If X > Y it doesn't matter which system you choose, someone will go without. You can change who goes without but you cannot fix the system by changing the method of dividing care.
Can things be somewhat improved with better organisation? Sure. Probably. But let's not overestimate it either. Let's take a dream scenario: optimal organisation can make 20% more care available. How much more care is wanted? I think we can safely say the US population wants 200% or more than the current system provides. Whilst nobody's opposed to improving organisations, it cannot fix the problem.
Fixing the problem is something you can only do by doubling the medical training available. That'll be a lot of extra dollars, none of which go anywhere near patient care for at least 10 years, so I would expect a lot of strong opposition from a lot of sides. But it's the only way to fix things.
Looking at spending per capita it's clear that American problems aren't caused by lack of money in the system.
They are caused by high barriers of entry, which in turn are caused by entrenched elites gatekeeping jobs through absurdly high tuition fees, expecting everybody to take lots of student debt and a very litigation-friendly environment. These costs are then passed on to the general population through a byzantine system of health insurance that leaves a lot of people uninsured.
> Let's take a dream scenario: optimal organisation can make 20% more care available.
In 2020 UK spent 3278 GBP (~4400 USD) per capita on healthcare [1]. USA: 12,530 USD. That's about 3 times less or a difference of 200% [2].
In UK life expectancy is 81.2 years. In USA it is 78.79 years.
3 times more spent to get a worse outcome doesn't seem like "20% difference" to me. Of course there are other factors, but are they enough to overcome 3x difference? I don't think so.
You cannot compare healthcare systems on X doctors per Y patients basis, because the outcomes aren't linear. It's orders of magnitude more expansive to treat many health problems if you go to the doctor 2 years too late. And the outcomes are worse despite the higher costs. Guess what happens when people have to pay a lot for each visit - often they go too late.
There should be a directors cut where the mission fails because of Vincent's hidden heart condition.
Gattaca shows eugenics has been so vilified that the audience will root for a character who selfishly commits fraud, risking lives and scientific progress for his own vanity.
The really scary fact is that there would be no need for a police state and segregation. The genetically enhanced would just completely dominate an open and fair competition.
Gattaca shows a society in which eugenics, in hopes of creating a shortcut for people who have supposedly the better genes, devolved into a society where someone who did not have genetic augmentation could outcompete a whole bunch of genetically augmented people.
In the movie, either the genetic augmentation didn't work (as well) as expected, or their advantage caused the augmented people to become lazy because they got covered in undeserved status no matter how little or much they worked, as everything depended on which genes they have been "bred" for. Then someone with supposedly bad genes could run circles around them just by working hard.
Maybe some mixture, e.g. in order to protect those kids who fail at the task they have been "bred" for from considering themselves failed humans, gattaca's society adopted this model where they shower all kids in status who have the right genes. Maybe it's not the kids who are being protected but the companies selling the augmentations.
100% Agree. Gattaca actually showed A SPACE PROGRAM running similar to the real space programs. I don't know about now but the first astronauts needed certain genes or they weren't allowed in. AFAIK that used to be true for fighter pilots. Want to be one you needed perfect eye sight. Bad genes = selected out.
I'm pretty sure the effect was temporary and he had to do it a second time. It's very important to note that this research is still very new and he was lucky that his genetic code was prime for that test. (I'm not against bio hackers btw. I think they provide a very good service though obviously more risky. No problems when that risk is on yourself but just trying to say "don't try this at home").
Note that you are literally shedding identifiable DNA from your body at all times and a truly motivated adversary would have no problem obtaining enough sample material to do high quality sequencing.
It's not the motivated adversary I am worried about, who actually has to show up where I have physically been. It is the company on the other side of the world in a country with lax legislation, profiling me based on the data I 'shed' online, like a cloud-based DNA sequencing service.
This is my threat model for most things in life. If someone is physically targeting me, I'm fucked. I'm more worried about limiting the casual long-distance attacker since I have more ability to stop them.
If someone steals my DNA I can't stop them. But I can at least avoid being swept up in large scale DNA scanning and tracking efforts.
The data monopolies and abuse originate from people giving these companies data for free. If they had to buy it, or pay goons to collect it, they wouldn't be profitable.
In the near future (or arguably now depending on your purpose) you don't even need that. Assuming enough of your relative's sequences are available, the probability of you having certain genes/mutations can be narrowed down so much that having your individual genome doesn't add much.
On April 24, 2018, authorities charged 72-year-old DeAngelo with eight counts of first-degree murder, based upon DNA evidence; investigators had identified members of DeAngelo's family through forensic genetic genealogy.
Lets say there's some rare genetic disorder, only a few hundreds of a percent of the population has it. If someone knows that your mother or father has it, you now don't have a few hundreds of a percent chance of having it. Depending on the disorder you having it might just be a cointoss.
One of the key differences is that in the case of the DNA sequencing services, you're agreeing to ToS that allow them to abuse your data (and thus indirectly the data of any of your blood-relatives), and you directly tie the data to a name and address.
Sure. I've worked with and know people who could carry this out at scale, although obviously individual sample collection isn't highly scalable.
Edit: I used to help Google fund researchers like Joe Derisi and others who develop technology to do this, and some of the people I worked with in my academic career are quite good at identifying serial killers from 30 year old DNA. If you're downvoting because you think I'm making this up, you're wrong. If you're downvoting because you don't think large-scale individual detection using genetic sampling of the environment is possible, you're wrong. If you're downvoting because you think you couldn't do a whole genome sequence of an individual using a sample collected in the wild, you're wrong. If you're downvoting because you think this is a terrible idea (morally, ethically), that's fine but I didn't say anything about my own moral or ethical beliefs about this.
It's simply factually correct to say that large-scale individual sample collection (at order tens of thousands, if not hundreds of thousands of individuals in a country the size of the US) is possible. All the technology is there to do this.
It seems very unlikely to me that there isnt at least *some* genetic information that would be of direct value in advertising. Like if Google took five years of personalized ad performance info and filtered that through the associated individual’s known DNA to develop a predictive model.
I read a story[1] about a UK-based Covid testing firm who was planning on collecting and selling their customers' DNA samples.
> Its "research programme information sheet", last updated on October 21, says the company retains data including "biological samples" and "the DNA obtained from such samples", as well as "genetic information derived from processing your DNA sample ... using various technologies such as genotyping and whole or partial genome sequencing".
The policy also says Cignpost may share customers’ DNA samples and other personal information with "collaborators" working with them or independently, including universities and private companies, and that it "may receive compensation" in return.
> L.A. County Sheriff Alex Villanueva .. was briefed by the FBI about “the serious risks associated with allowing Fulgent to conduct COVID-19 testing,” ... the FBI advised him that information is likely to be shared with China, and that the FBI told him DNA data obtained is “not guaranteed to be safe and secure from foreign governments.”
Yes, of course there is. Just like any other medical sample you have ever given. Blood, stool, urine, swaps...
But also all the unintentional donations: Every pubic hair you lost on the toilet seat, every tampon you disposed, every bandage you ripped off and threw away, every mattress you slept on, chewing gum you've spit out, every ejaculation, every ... you get the idea.
Well crap, there you go, giving the lunatics reasons not to get tested and make the whole thing worse than it already is...
I'm joking, I don't think you did anything wrong but I'd hate it if a ridiculous argument such as this example gained any traction :
Example : The government / aliens / whoever released the virus so we willingly gave them our DNA to sequence and match with our assigned ID so they can do XYZ in case the implant in the vaccine doesn't work or if we are smart enough not to get vaccinated.
Given the past incidents, it would probably be more in line with a gov. agency getting direct access to everything that goes through a few selected labs for years/decades, so that a significant number of people geting blood/cells sampled for that period would have a high chance of passing their data though these labs at some point.
At the base of it, if the gov. of the country one lives in is the enemy, it can’t be a matter of refusing vaccines here and there, that’s not the scale they should be thinking about.
It’s not exactly DIY but there are in theory ways to ‘encrypt’ your DNA before it gets sequenced. Something like amplifying/enzymatically modifying the DNA in a way that changes the sequence which you can undo computationally once you get the data back.
One of the other comment threads indicates that the data, that you need to do that kind of annotation of the sequence, is to some extent available for home use as well: https://news.ycombinator.com/item?id=29695449
I'm really hoping someone will work on an open source "23andme@home" solution that ties all this together in an accessible way.
Years ago I used Ancestry, then requested the .txt file and asked them to delete it from their records. Uploaded it to run a report at https://promethease.com/ that cross-references your SNPs against the existing body of genetic research.
The results have been pretty astounding. I found markers that pointed to poor response to a specific blood thinner my grandfather was put on before he passed. Currently I'm researching the cluster of Bipolar / ADHD / SAD symptoms I experience that all seem to trace back to a certain genotype of circadian rhythm genes I have (thank you, Sci Hub). To boot, some of the studies I've come across have been done on Han Chinese populations that match my descendance.
Perhaps going too far down this rabbit hole poses a self-diagnosis risk, but the correlations to my family history and my own life experience working with doctors to diagnose and treat symptoms are pretty undeniable. And given that your run-of-the-mill psychiatrist is going to treat you off of a DSM checklist, I feel much more confident knowing there have been genomic studies to back things up, since my doctor isn't up to date on this research, and finding one that would be will be difficult and expensive. I've shared the papers with my doc and he's been supportive, sometimes I feel like I should be getting a discount on services rendered.
Self-diagnosing is not the problem it is made out to be - I live with my symptoms 24/7, doctor sees me for 5 minutes. The amount of times doctors have missed fairly clear sign of trouble in my family is disturbingly high. A simple procedure, done in time, would have saved two people I know.
Unfortunately our educational system teaches you about mitochondrion, but not the practical difference between ibuprophene and paracetomol, or CRP.
That's Prometheus, no? They got acquired however, but prior to that you could upload data anonymously and then browse the analysis. It was very rough though, just linking sequences to risk, but a lot of it was inconclusive.
Note that Oxford Nanopore seems to have very much a "sell the ink/razor/etc" business model with their devices: that $1,000 package comes with one flow cell, which is a consumable and costs $900. They're essentially giving the device away for free.
On some of their larger devices (eg, the PromethION), they've moved outright to a "we lend you the device for free, you buy the consumables" model.
There is some exciting work around this flow cells to create something more durable. It would be really interesting to be able to buy something like that and use it in schools/personal hacks without worrying about small mistakes in the sample.
A qubit or fluorometer isn't required. You can use a simple DNA ladder to measure the relative quantity and quality of DNA that's good enough for nanopore sequencing. I just did a full genome sequence of a novel fungus using this exact approach.
I did a HMW extraction kit on the DNA and used a gel to estimate the volume of HMW DNA. Yes, you need to be able to run a gel, but I'm not sure what the expectation is from folks; that you just place a random piece of non-sterile tissue on a chip and have it do the extraction, sequencing and assembly? That seems like an unrealistic expectation.
They do work on an extraction flowcell that can be added on top. I’m hoping they can make it as easy as adding sample to a well, at least for blood or saliva.
If you're talking about the voltrax, you still need to pipette reagents in at the right times. They don't really talk about it in their advertising, but it's basically just to make mixing consistent when you have under-trained techs. Definitely will get there eventually, but I don't know if they are working on that right now.
You can get a usable partial genome at home using a minion (provided you have access to basic lab equipment and consumables, like pipettes, a microcentrifuge, gel electrophoresis kit) for about $2.5k, and a fairly decent one for $5k (about 22x coverage - not perfect but plenty for most purposes).
This is very cool. Are there by chance any associated projects that could evolve into something like 23andme but remain entirely within a private network meaning that the data is entirely in the hands of the individual?
yes. if you wanted to annotate your genome you could “easily” do it on your brand new macbook (this is ram intensive, you probably need 32G). you’d need a reference genome, like https://www.nist.gov/programs-projects/genome-bottle
you’d likely to have to get the nanopore sequencer in the article or find a lab using Next Generation Sequencing to sequence your DNA and give you “raw data” which are usually fastq files
Could you please explain how this mapping works? Why it needs so much RAM? Is it doing a fuzzy search of sorts for known sequences (genes)? Why can't it do so one by one?
bwa specifically performs a burrows wheeler transform of a 3GB string. other mapping algorithms usually rely on some sort of indexing of the genome. the program then loads this into memory and queries that index for each “read” (a dna fragment from the dna sequencer).
Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an encoding with one byte per nucleotide.
I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?
It's very common to use 2 bits per nucleotide despite the human genome having ambiguous bases on top of the 4 letters. These tools typically encode the ambiguous base as a random nucleotide but keep track of them and then convert them to a random nucleotide later.
In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]
With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].
That's not true. I just did a high-quality sequence and assembly of a new species of fungus from my home lab using nanopore. You can see all my code used for assembly and analysis that will be referenced in a paper I plan to publish in Jan here: https://github.com/EverymanBio/pestalotiopsis
Given that the decoder is machine-learned and depends on a training set to go from squiggle -> ATGC..., how do you ensure that sequences which haven't been seen before (not in the training set) are still accurately accounted for?
We used Guppy for basecalling, which is neural network based and used to turn raw signal data into the predicted bases. There're no guarantees of accuracy, only tools to determine and assess quality. One major way of assessing accuracy is to compare the subject genome with other similar reference genomes and denote the high-degree of homology in highly-conserved regions.
My question is if in the future, we would be able to fully rely on translations to predicted bases for sequencing or if there would always be a need to compare with a different sequencing methodology in the case of de novo genetic information that previously hasn't been seen before (no reference genomes being available in that case).
Is there publicly available information on how accurate Guppy is, as well as how the amount of training data scales with improvements in accuracy?
It didn't seem like these things were mentioned explicitly in the Community Update, other than that it’s expected to continue improving, but a clearer roadmap would definitely be much more helpful.
There are quality checks throughout the entire process, starting from the raw read quality scores returned directly from the sequencer all the way to fully assembled genome completeness. In our paper, one of the tools we used for this is called BUSCO[0] which scored our assembly at 97.9%, a relatively high score for de novo assemblies.
Interested outsider here; I work with a lot of HCLS research customers but don't have a biology-related background. Can you explain the problems with the Nanopore sequencer accuracy in more detail? Basically, I was wondering if I could get one for myself and sequence my own genome, then user the data to learn about life-sciences computing techniques. If I were to buy one of the USB-attachable devices and run it, is the data simply not viable for use in a genomics pipeline, or is it just that the results would be questionable? Also, if accuracy is an issue, what about just running the same sample N times and doing some error correction?
I guess there are limits to ensemble methods if the underlying accuracy doesn't increase. I don't work on gene sequencing algorithms but from what I understand of ML ensemble techniques, there are certain assumptions regarding the underlying independence of the errors. The errors for nanopore should be uniform but I am not sure. Any molecular biologist here care to comment?
I know that the error rate of the oxford nanopore sequencer depends on GC content (guanine/cytosine nucleotides), and that the Pacific Biosciences sequencer uses a polymerase that gets worn down during reading. So there is some non-uniformity in the chemistry.
The instruments do exactly as you say (run the sample N times), but this obviously comes at a cost. Also, keep in mind that sequencing needs to be very, very accurate to be useful. We share most of our DNA, and the small variations make up all the difference.
Yes, those are all relevant costs. There's also a tradeoff between accuracy and the number of reads (how many sequences you can observe), or how much data you can get out of the machine.
Tl;Dr: Nanopore data is historically lower quality than current gold-standard methods, but it is by no means "not viable" in a genomics pipeline. Their newer chemistry flowcells are competitive with current gold-standard (but I've not seen it with my own eyes in the lab yet due to limited release).
There are two components that drive sequencing error rate. 1) The chemistry behind the sequencing (for nanopore sequencing this is the "feeding DNA through a pore" bit) 2) the method to convert raw signal into DNA sequence (this is called "base calling").
The gold-standard in terms of error profile for sequencing is currently the Illumina short read platform. Illumina machines are really just microscopes (TIRF scopes for optics folks) that sequence DNA by visualizing incorporation of dye-labeled nucleotides into the sequenced molecule(s) (Imagine a really slow PCR [1]). Each base is labeled with a different color, then when a molecule has a match it makes a colored spot on the slide that the machine can read (see here for more info & details of newer chemistry that use fewer colors [2]). This whole process is mediated by DNA polymerase which itself has a very low error rate. Another important point is that DNA sequenced on the illumina platform (called a "library") tends to be from "amplified" template DNA, meaning the DNA will have been processed and potentially be missing chemical modifications on the bases that could be present in the organism. This works to Illumina's advantage, because when trying to answer the question of "what is the DNA sequence?" we want the ground-truth DNA, not the modification state.
In contrast, Nanopore sequencing works by feeding a long strand of DNA through a pore and measuring the change in electrical current through the pore (watch the cool video [3]). For the current set of nanopore flowcells, 8 bases of DNA sit in the pore at a time, meaning the current at each timestep is a product of 8 nucleotides in aggregate. This also means that the pore "sees" each base 8 times, but always in the context of an additional 7. In order to basecall from the raw signal, it's not as easy as saying "blue = A", instead, you have to deconvolve each base from a complex signal. As you might imagine, the folks at Oxford Nanopore & broader research community have turned to machine learning-based base callers to solve this problem, and they work quite well [4]. But they are not perfect.
Deconvolving runs of the same base (e.g. "AAAAAAA") is difficult because without well-defined signal changes between bases, the caller has a hard time deciding how many bases it has seen, so a common error mode for nanopore sequencing is to create insertions/deletions at places in the genome with low nucleotide diversity. Another interesting reason is that most Nanopore library preps are often performed on unamplified DNA, and so in addition to normal A/T/G/C nucleotides, the template DNA can also contain bases with chemical modifications. For example, in bacteria, A's are often methylated, and in Humans, C can have all kinds of different modifications (5-methyl-cytosine, 5-hydroxymethyl-cytosine, etc. etc.) and each different modification affects the signal in the nanopore. Therefore, basecallers that weren't trained on modified bases will produce basecalling errors in the presence of base modifications.
For both Illumina and Nanopore basecallers, they assign a quality score to each base that indicates the probability that the basecaller produced an incorrect value. This is called a Q-score, which is defined as "Q = -10(log10(P-value))" (i.e. Q / 10 = the order of magnitude of the error probability) [5]. For example, a Q-score of 10 means an error rate of 1 in 10, but a Q-score of 50 means an error rate of 1 in 100,000. For Illumina sequencing, >95% of the reads have a Q-score > 30 (i.e. 1 in 1000 errors), while Nanopore reads tend to have lower average Q-scores (~Q20, i.e. 1 in 100 errors). For genetics, where 1 base difference can mean the difference between a severe disease allele vs a normal variant, 1 in 100 won't cut it.
The current gen Nanopore flowcell chemistry (R9.4.1) is what most people are talking about when they talk about Nanopore error rates, but they've just released a new pore type & made some basecaller upgrades that improve the accuracy to what they call "Q20+" and some claims of Q>30, and from the data I've seen, it's impressive, I just haven't got my hands on one yet to see for myself [6]. I think the comment saying "wait 5 years" is an overestimate, but if you want to genotype yourself today, I'd just pay someone for Illumina sequencing and process the fastq files yourself if you really want to do it as a learning exercise.
I've unintentionally written an essay, so I'll stop here, but real quick to your other point RE: rerunning the sample N times & using the repeats for error correction. This won't work the way you're thinking because a "sample" is actually a collection of DNA molecules that are sampled randomly by the sequencer. You have no way of knowing that the same read between runs was actually from the same molecule, so you can't error correct this way. Consequently, a totally different sequencing platform from Pacific Biosciences uses this strategy by doing some really cool chemistry, but I'll spare you the second essay (google "PacBio HiFi" or "circular consensus reads" if you're interested).
I for one am glad you wrote the essay, this was incredibly informative and filled in a bunch of blanks I had after reading what I could scratch together on the MinION product. I think I'm in a partial state of shock at how accessible this is becoming. Thank you!
Thanks - fascinating stuff. I'm now even more convinced I want to give it a try, but I think I'll play around with public data and tutorials before leaping into home sequencing.
You totally should, it's a lot of fun. I'd suggest trying to find some bacterial genome sequencing (like E. coli) done on nanopore if you're interested in those data. I don't have a link to any handy right now, otherwise I'd post here, but assembling bacterial genomes is shockingly easy these days and doesn't need near as many resources as doing a human genome, so it's great for learning (I love the assembler Flye [1] for this).
And RE: home sequencing, honestly the hardest part for a beginner will likely be the sample prep, since that takes some combination of wet lab experience and expensive equipment. I really wish molecular biology was as simple to get hacking on as writing software. The lag time between doing an experiment and getting a result is so much longer than waiting for things to compile, it just makes improving your skills take longer.
yup. that’s the business model for Illumina. it’s very much akin to video game consoles. Illumina might take a hit on selling the machine but makes it up in selling you proprietary reagents.
What sort of books/videos do you suggest so one can learn more? This stuff is interesting, and I've always seen inexpensive lab equipment on ebay.
If this can sequence flora, fungi and human DNA for about 10k - I'd buy it, just to experiment and deep dive. That is such a low barrier of entry it itself is interesting.
Cost/benefit analysis may dictate that, as other posters suggested, you'd be better served to get raw fastq files from a sequencing lab. Even better if you can send the lab a sample and they'll process the extractions for extra $$.
> and i feel like nanopore is the VR of dna sequencing. it’s always just another few years off.
Is this also true for nanopores in protein sequencing? This HN comment from a few weeks back [1] pointed out recent progress but perhaps the tech is still not quite there.
What do you mean by it's always a few years off? Nanopore will allow you to do high-quality genomic sequencing _now_, in a home lab if you wanted, for less than $3K. If you amortize the 3K by the number of genomes you can sequence on the same flow cell, the price per base or per genome falls precipitously, depending on the size of the genome of course.
ya my first thought was how hard are reagents to get, but probably not that hard. i wasn’t in the lab, i was in bioinformatics so i’m generally clueless on reagent acquisition.
Oh God, I would not want a distributed group of actors with limited trust to sequence my DNA. Maybe it's a project for close group of friends that would be interested?
I wasn't thinking sequencing but rather comparison. Could even hash data for comparison to enforce privacy (unsure how effective that would be)
But this could enable things like finding relatives which is what I got out of the comment about 23andme. Instead of all the data being centralized, storage and comparison could be distributed
Your DNA is almost exactly the same as other people's, just a unique mix.
Music is exactly the same notes, just a unique mix. So why is Sony upset that I want to stream their entire library? But jokes aside...
A few decades ago I fought the military on collecting my DNA. I stalled them long enough to get my honorable discharge and avoid that all together. It's funny you ask because the commander asked the same thing and joked "Are you afraid we are going to clone you?!" to which I replied, "No sir, you should be afraid you are going to clone me." and we both had a laugh because he knew I was right. The military are not fond of critical/free thinkers. One of me was plenty. I explained that insurance companies were already using this data to retroactively cancel peoples policies even if they were not actively afflicted by something. The commander showed me how to use the FOIA request system.
Laws have evolved a little since then but there are plenty of other risks. For starters, I can't easily change my DNA like I can change my debit card. That data can be used to tie me to others or guilt by association which is undesirable drama. It can also be used to try to sell me things. It can also be used to target biological weapons against specific groups of people. There appears to be an imbalance of data sharing in this regard. [1] Then there is simply the matter of privacy. If I want to share my DNA with some lab that is in turn going to sell it out to hundreds of other companies over and over forever, I should at very least be getting paid a vast amount of money and land and have legally binding contracts and NDA's that cover what is and is not allowed to be done with my data and how long it may be retained. That contract and the laws enforcing the contract must have some serious teeth with very serious ramifications for anyone violating it whether intentionally or by mistake.
Congrats on your navigation of the military DNA collection.
I'm more curious what the actual threat might look like.
The marginal utility of your particular genome is miniscule. Without deep phenotypic information from biophysical parameters, it is utterly impossible to learn something novel from any single genome. This makes the marginal value of the genome information very low, both to you and any attacker or user. You would not be paid much for your data even if it was sold over and over because the rates are like those for plays on Spotify.
There are not fixed differences between human populations, and there are dramatic pressures to balancing selection that keep diversity focused in key genomic regions that are critical for immune response. This is to say that it would be damn hard to target any single group with a bioweapon. And if you wanted to target a single individual with a genomically targeted bioweapon, you also have physical access, making the problem of getting genomic information without consent trivial.
People often talk about insurance risk. I suppose that's an attack vector. It's also one that can be regulated with laws and social norms. Fwiw I wonder how often this is primarily an American concern.
Imagine a public genome data repository. People donate their genomes to science and post them there for the world to use and learn from. In my opinion, it would be better for an individual to share their data than not. The reasoning is that no matter what is done with the data, the net effect will be that society learns more about the individual's particular genome than those of people who haven't contributed. This will yield better adaptation of the society to the individual. Literally this might mean that a treatment for something affecting the individual is slightly better. In expectation, the worst thing that can happen is that the individual gains more information about themselves.
I'm curious about the possible abuse scenarios given the ubiquitous use of PCR-testing for nearly two years, now.
If I'm informed correctly for a viable sample for NGS you need like 2mL saliva (which sounds little but it really takes some time: >1 min) not those trace amounts which gets usually collected by the swabs?
A very practical reason not to want your DNA out there, unrestricted, is insurance costs. From car insurance, to health insurance, to mortgage lending rates, and life insurance, and while GINA from 2008 is supposed to protect that information, there are loopholes with the interpretation of that law that should give everybody pause.
Using that analogy, all the 1s and 0s in your private key are the same as everyone else's as well. Genetic data can be used for all kinds of things, the worst of which would be things like targeted diseases or planting your DNA at a crime scene.
Actually it's like your private key is made up of ~1000 1 mb pieces that each have 1/1000 rate of difference with any other similar piece. Oh, and the order of the pieces is almost always exactly the same.
No, genomes are not "almost the same" because they are all in base-4 sequences and this made up of the same 0s 1s 2s and 3s.
We are astoundingly similar, even unusually so for a large mammalian species.
Alas the information presented is an over simplification of the process.
To actually sequence DNA with this USB thingy you need to prepare a so called sequencing library - and for that you need a fairly well equipped lab - expensive reagents and years of practice and skill ... a mid level biology Ph.D can prepare these ...
in addition the flowcell sold by Oxford Nanopore often malfunctions and the whole run is a bust ... (behaves like this since 2014 ... so no, the technology does not seem to improve a whole lot)
On one hand, I would love to learn something new about my body.
On the other hand, what if the results tell me that I am predisposed to some horrible untreatable disease? Will I spend the rest of my days observing every little pain or discomfort and thinking "is this IT?"
1. A completely genetically determined disease; a rare 100%-going-to-happen deal. (Which you would probably know about already, because your mother, or grandfather died from it...)
2. Some significant, but abstract risk modification.
With 1., you would know, you will get sick/die some time soon in the future, allowing you to live your life accordingly, die without regrets, prepared and so on. You can take that into consideration when planning for a family, taking job offers, procrastinating on the good life with work and retirement plans. Burn bright.
With 2., there is a very, very high chance lifestyle choice influence the stated risk, as obviously not everybody who got the polymorphism gets sick. So you can get your ass up, exercise, quit smoking and drinking, reduce stress, get regular check ups, ..., and avoid getting sick or reduce the impact/progression, in case you do.
I think, logically, knowing is always better than not knowing. But I understand how anxiety does tell a different story.
well, build a whitelist of the conditions you are interested in knowing. then just run the report through a sed filter so that it strips out all the information you’re not interested in. destroy the original report. problem solved: infohazards avoided.
Knowing something about your prospects doesn't doom you to negative thoughts. In fact, the way the human mind works is often the obverse.
"Inaction breeds doubt and fear. Action breeds confidence and courage. If you want to conquer fear, do not sit home and think about it. Go out and get busy."
--Dale Carnegie
"You gain strength, courage and confidence by every experience in which you really stop to look fear in the face. You are able to say to yourself, 'I have lived through this horror. I can take the next thing that comes along.' You must do the thing you think you cannot do."
--Eleanor Roosevelt
"Fear is the path to the Dark Side. Fear leads to anger, anger leads to hate, hate leads to suffering."
--Yoda
"The brave man is not he who does not feel afraid, but he who conquers that fear."
--Nelson Mandela
"Nothing in life is to be feared. It is only to be understood.'
--Marie Curie
"The key to change... is to let go of fear."
--Roseanne Cash
"He who is not everyday conquering some fear has not learned the secret of life."
--Ralph Waldo Emerson
"We should all start to live before we get too old. Fear is stupid. So are regrets."
--Marilyn Monroe
"Fear keeps us focused on the past or worried about the future. If we can acknowledge our fear, we can realize that right now we are okay. Right now, today, we are still alive, and our bodies are working marvelously. Our eyes can still see the beautiful sky. Our ears can still hear the voices of our loved ones."
--Thich Nhat Hanh
Nanopore sequencing is a really interesting technology. It utilizes fundamentally the same apparatus as a Coulter Counter [1], which is a general method of counting and sizing arbitrary particles that's frequently used in flow cytometry. Applying it to sequencing by drawing unwound DNA through the pore was a really excellent logical leap, and we're only now starting to see the benefits of even though it was first ideated over 30 years ago.
the nanopore units are awesome! although if i recall, most of the device is a replaceable one time use consumable and the cost of that consumable is quite expensive (at least hundreds, if not thousands).
when i looked i was interested, but was turned off when i saw that the cost far outstripped commercial sequencing services.
An idea just popped into my head reading your comment:
What if you could take the (binary) data file of your DNA and use it as input in the (recently remastered) Monster Rancher games to generate a monster?
Apparently those games use external user-provided data (like music CDs, game discs etc.) to generate the monsters the player would then train and use (something I only recently learned about through gaming livestreams).
I'd actually like to see the level of jank that would come out of something like that.
Many years, we still have problems simulating a single protein folding correctly. If we don’t find some new algorithm for simulating cells we would need computers that are billions of times faster than our current ones.
Also your dna is bootstraped from your mothers cells. And the prenatal environment has quite a large effect on development so your simulation might end up quite different from you if we only started with your dna.
It's likely that you don't have to simulate even a single cell at high resolution to be able to simulate how an organism would grow. There are numerical shortcuts.
For example today we can already predict the color of the eyes and other phenotype from the DNA.
If you are able to observe enough samples of cell growth and their associated DNA, you probably can model and predict the statistics of a cell from their DNA. Because the cell is itself the result of a lot of chemical processes, the law of large number will help smooth those statistics.
Given that we have a lot of cells, the collective behavior is probably entirely governed by these statistics.
You seriously underestimate the continuous growth of computer power. And quantum computers after, which are perfect for simulating chemical reactions.
What was unthinkable 50 years ago, playing chess better than a human, it's now trivial for a $100 device.
And it's not necessarily required that to simulate the growth of a human you'll need to simulate the entirety of chemical reactions in all 50 trillion cells and all that.
It's possible I underestimate, but I have worked in all the relevant fields of simulation, ~20 years of running various simulations on large HPC, built the largest instance of folding@home using idle cycles inside google data centers, published papers simulating proteins, developed infrastructure to process the voluminous data, etc, etc. Quantum computing remains fantasy (in terms of being useful for science).
It's unlikely even if we improved computing hardware many orders of magnitude beyond all reasonable predictions, that the calculations would be able to simulate all the necessary details; most of our simulations now are based on many approximations due to hardware limitations.
As to the question of "what level of fidelity is required to turn a FASTQ of somebody's genome into an accurate model of the resulting human, with some sort of realistic environment also provided", that's so far beyond what is even remotely comprehensible it's not worth speculating about in terms of science fact; it's just fiction.
A researcher mentions using a compact index based on the Burrows-Wheeler Transform to fit things in less memory compared to using a huge hashtable.
I see open-source implementations of BWT-based indexes (FM-Index/FMtree) out there. Out of curiosity, does anyone know of anything using BWTs for compact indexes in more everyday uses (like full-text search), or alternately reasons it doesn't really work outside the genome-alignment use case? Likely it only 'pays for itself' if you really need the space savings (like, it's what makes an index fit in RAM) or else we'd see it in use more places. It'd still be kinda neat to actually see those tradeoffs.
There was some interest in the information retrieval research community 10-15 years ago, but I don't think anyone ever found a good application for it. Some limitations of the BWT always got in the way.
The BWT sees strings as integer sequences. Either "ABC" and "abc" are two unrelated strings, or you normalize before building the index and lose the ability to distinguish between the two.
Search proceeds character-by-character backwards, jumping arbitrarily around the BWT using the same LF-mapping function as when inverting the BWT. You get cache misses for every character.
BWT construction is expensive, because you want a single BWT for the entire string collection. There is a ridiculous number of papers on BWT construction, as well as on updating and merging existing BWTs, but the problem has still not been solved adequately. If your data is measured in gigabytes, you can just pay the price and build the index, but a few terabytes seems to be the practical upper limit for the current approaches.
You can of course partition the data and build multiple indexes, but then you have to search for each pattern in each index. There is no way to partition the data in a way that different indexes would be responsible for different queries.
Have you ever encountered any insurance implications from it? eg: questioned whether you have ever had a genomic test etc. and had to answer yes and then them wanting to see results?
I guess in your case where nothing actionable is found it's benign. It will be the cases where there are risk factors for late onset things - cancer, diabetes, heart disease etc. where it would get sticky.
No, my health insurance company doesn't care about my whole genome data. Health Insurance companies are already quite skilled at (and profitable due to) their ability to model life expectancy and health issues without genomic data, and they are legally prohibited from using this data, in my country anyway. Life insurance is different (they are allowed to incorporate much more information) but I've never been asked for anything like that.
As for the case where nothing actionable is found- it's not benign. It's absence of information, not information of absence.
Initially, the DNA is brought near the pore through diffusive (brownian) motion + any small attraction it'll have to the membrane. Close to the pore it uses a combination of the electrophoretic and electro-osmotic effects to draw the DNA molecules through. The application of an external magnetic field will cause the charged DNA molecules to migrate along the field (electrophoresis). This is independent of the fluid, and happens to any ions under voltage. The electro-osmotic flow, on the other hand, is a motion of the fluid itself, pulling the DNA molecules along with it. EOF is a really interesting phenomenon which is caused by the interaction between the surface chemistry (vis-a-vis charge distribution) and the concentration gradient of charge carriers in the fluid. I'd recommend Fundamentals and Application of Microfluidics by Nguyen et al if you're looking for a good primer on electrically induced flows in microfluidics.
> Why not make the software into a proprietary product? ... There’s such a race there that it’s hard to commercialize the software for the long term.” Schatz continues, “Plus our work is largely funded through government sponsored grants, so this is one of the important ways for us to give back to society.”
In some people's thoughts, making a better society is the first and most obvious thing to do with technology like this, not an accidental consequence of inconvenience. Fortunately, enough of those people are active in the world to make Main Street different to Wall Street, at least sometimes.
It’s a weird quote anyway since there is commercial, proprietary software for DNA sequence analysis. Just a few examples of companies in this space are Sentieon, Edico (acquired by Illumina) and Parabricks (acquired by Nvidia). And Michael knows this (they’re sufficiently well known, and his own research laid some of the earliest foundations that Parabricks would ultimately build upon) so I’m assuming the quote was taken out of context or he was talking specifically about his own lab.
Maybe at our local library we should be able to check these nanopore sequencers, or even other devices like simple & robust medical devices like handheld ultrasound devices that plug into iPad's?
There is a 3+ year old London-based project, partnered with an established genome sequencing company, doing something highly interesting.
They sell swab kits directly, or via NFT purchase, for ~$500 for a 30x near complete sequencing (that's 30 passes for over 99.9% vs 0.2% for 23andme et al). The results are stored in an encrypted AMD SEV-E vault to be accessed by big pharma or individuals, only for specific markers, in exchange for the $GENE token paid directly to the genome owner. Figures touted are $50-80 per request. This token is burned as kits are sold, can be staked, offers rewards like DAO membership, can be gifted to charities researching specific diseases in various populations. It can act as a form of UBI in unbanked populations and puts your DNA back in your control.
To me it's the best use of web3 tech I've come across, so disclaimer, I am invested and a DAO member, but it's early in the project still. They are not quite ready for mass marketing. They are moving over to Polygon for very low transaction fees in January, will be launching the first joint NFT/kit sale (the next season might include personal genetically generated art) to fill the vaults with 10k sequenced genomes. They are over half way already through work with charities, but that is the magic number before big pharma can start making queries. Right now though they are quietly building and preparing before marketing plans kick in later in Q1.
Take a look at https://genomes.io where everything is explained in more detail, the team are presented and the tokenomics set out.
TL;dr - for $500 right now you can get your entire genome sequenced, stored in a vault to earn you passive income, if you agree to each query. But wait for the NFT vs buying directly, it will have more perks.