Hacker News new | past | comments | ask | show | jobs | submit login
The complete sequence of a human Y chromosome (nature.com)
498 points by birriel on Aug 25, 2023 | hide | past | favorite | 220 comments



For those who don't recall: Back in the Dark Ages, there was a race to decode the human genome. The leading competitiors (wealthiest) were Celera Genomics and the Human Genome Project. After some time, Celera (headed by Craig Venter), announced they had done the deed. However, what Celera had actually done was used what they called a "shotgun method", which meant they took small samples here and there, then built a model of the genome with various statistical shenanigans. About five years later, after all the hype, they admitted they had not sequenced the entire genome.

Ref: https://www.technologyreview.com/2007/09/04/223919/craig-ven...


As noted, the shotgun sequencing wasn't any worse than the traditional chromosome walking method used by the other human genome project, and neither created a truly complete sequence, which is only possible today with improvements in technology.

And it is worth noting that despite all the critiques of shotgun sequencing at the time as "cheating", chromosome walking sequencing is dead, and shotgun is what we use today. If anything replaces that it will be long read technology like that done by Oxford Nanopore sequencing, although we aren't really there yet (you still need to assemble that data and don't really get end-to-end sequences yet).


I feel a joke coming on.

Mother: "Amy, dear, you hardly touched any of your vegetables. Most of them are still on your plate."

Amy: "That's not true, Mom; I completely sequenced them into my tummy using the shotgun method."


Someone submit this to smbc comics !


yes! or XKCD


Not true now with PacBio Revio and new Oxford Nanopore chemistry and flow cells. We still use a bit of older Illumina data for polishing, but long read methods now as accurate. HiC and Bionano more important as supplements than 150 nucleotide pair-end reads.

And the cost of a T2T using a combination of these technologies is already well under $10,000 in reagent costs. The assembly is still a complex art especially for a messy chromosome like Y.


TIL that. I was 100% sure the DNA was completely sequenced. Even have a friend with a sequencing company. It is incredible how ignorant one can be. Thank you 10g1k and HN.


FWIW, the DNA was 100% sequenced. However, a proportion of that data that came back from the sequencing machine was too similar to each other, so it was impossible to assemble it into a coherent full set of contiguous chromosomes.

Imagine it like putting together a jigsaw puzzle, except that the picture on the front of the puzzle has lots of repeating motifs, so you end up with multiple pieces that look identical. You won't be able to tell how the whole thing fits together, but you can assemble the bits that are well-behaved and unique.

Modern technology gives us larger jigsaw pieces, which allows us to distinguish between almost-identical parts of the puzzle better. But I would note that the project linked did a huge amount of sequencing using very expensive methods to be able to resolve the whole thing.


For anyone interested, the problem mostly boils down to the sequence lenght we can read.

Modern sequencing techniques allow us o read 200 to 500 bases at a time. So after hat we need do find a way o arrange these short sequences into a single sequence. And this can be prety hard, especially when you are doing 'de novo' assembly[0].

Besides that, there is the fact that some regions of DNA are repeated[1].

[0] - https://en.wikipedia.org/wiki/Third-generation_sequencing#De...

[1] - https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)


You are a bit behind on the available read lengths. The Oxford nanopore can produce reads as long as 300kb (k as in kilo) and the PacBio hifi can product 15 to 20kb reads. Finally a now defunct technology by 10x genomics called linked read sequencing also permitted longer regions to be sequences using short reads.


Thanks for pointing that out.

My knowledge cut-off period in this domain is around 2018. So it's not suprising that things moved on.


So if one orders WGS at 30x or 100x from someplace like Nebula, what do you actually get?


Unless they're explicitly saying that you'll be getting long read sequencing, what you'll be getting is a paired-end short read sequencing, likely Illumina (although similar output can be achieved using BGI and Element machines as well). You'll be getting fragments of DNA which have been size-selected to around 400-500bp in length and then sequenced 150bp from both ends, with an unsequenced gap of unknown length in the middle (and possibly an overlap if the fragment is smaller than 300bp), sufficient to cover the mappable parts of the reference genome with around 30 or 100 sequence reads on average. That data can be supplied in a couple of FASTQ files (usually gzipped). They may offer to align that against a reference genome for you, where you'll get a BAM file - make sure you have the same reference genome handy to compare against. They may also offer to detect variants, that is places in the sequenced DNA that are different to the reference, in which case you'll get a VCF file.

Because the sequencing uses short reads, it will not be able to resolve parts of the genome that are repetitive with a repeat unit longer than ~150bp. You'll have all the jigsaw pieces from the puzzle, but you won't be able to reconstruct some parts of the picture. Long read sequencing can help with those, but that's more expensive.


Found out you were wrong on the day you became right!


Thought so too, turns out we're right now


I'm not really sure what it even means to be 100% sequenced.

Does that mean that every possible polymorphism is accounted for?

Does that mean that 100% of possible polymorphic sites are known?

What about weird edge cases. You have a tandem repeat that is known, but in some small population of humans, there's two tandem repeats with an island of something else in there.


Yes, but the HGP didn't make a full sequence either, using their scaffold-based contig assembly. Both groups declared a truce and announced they were "finished with the first draft"

35 years ago: https://www.nytimes.com/1987/12/13/magazine/the-genome-proje...

33 years ago: https://www.nytimes.com/1990/06/05/science/great-15-year-pro...

29 years ago: the competition gets fierce https://www.nytimes.com/1994/02/22/science/scientist-at-work...

24 years ago: the race is alive! https://www.nytimes.com/1999/03/23/science/who-ll-sequence-h...

23 years ago: draft is complete https://archive.nytimes.com/www.nytimes.com/library/national...

20 years ago: https://www.nytimes.com/2003/04/15/science/once-again-scient...

2 years ago: https://archive.nytimes.com/www.nytimes.com/library/national...

The celera approach using shotgun was tricky because assembling the bits required a great deal of computational finesse and horsepower; "The final assembly computations were run on Compaq’s new AlphaServer GS160 because the algorithms and data required 64 gigabytes of shared memory to run successfully."

At the time the GS160 was the monster machine, a classic big iron UNIX, which could be tightly clustered- unified IO/filesystem between all the machines. The primary author of the shotgun assembly was Gene Myers, who previously had written BLAST, and invented the Suffix Array with Udi Manber.

The public project had its own issues, as many of the teams assigned to work on it still had the "cottage industry/artisinal/academic" approach, then Eric Lander came along and turned it into an industrial process, and parlayed that into running the Broad Institute, a privately funded MIT/Harvard research institute in Boston.

These days petabytes of sequence data are generated every day and stored in clouds. The genome has been a fundamental tool for shaping our studies of humans, although its true potential for understanding complex phenotypes remains elusive.


Lets not forget how Perl Saved The Human Genome Project [1], which I remember seeing in a print magazine at the time.

As someone who is in the genomics world now as a software person, I find it amusing that I've gotten more into perl over the last year or so. It has its place, in the way that grep/sed/awk/etc does.

[1] https://news.ycombinator.com/item?id=30327812


Lincoln Stein is great, and perl was certainly critical to many processes, but it was and remains fairly niche in genomics, which used much more C++, Java, and later Python.

IMHO the person who "saved the genome project" was WJ Kent, who developed the assembler, BLAT, that the public project needed. I strive to point out that he wasn't a sole hero, nor was Lincoln. What I really like about BLAT is that while Celera was using Big Iron UNIX (massive 64-bit 64GB machines with 10s of terabytes of central high performance storage), BLAT ran on a cluster of linux machines, right around the time that people were waking up to the fact that linux was becoming a useful tool for scientific data processing. BLAT's design allowed it to work on a cluster of cheaper/smaller machines, while celera's algorithm really needed a massive shared memory single machine. it was sort of a microcosm of the larger battle being fought between Big iron UNIX and little intel linux at the time.


I had the privilege of working a co-op term at Stein's lab at the OICR. I encountered quite a bit of Perl during my short time there, and have yet to see it elsewhere in the modern enterprise tech world. BioPerl in particular stands out as a fairly substantial project in the bioinformatics space.


Amazon still has some Perl kicking around, mainly on the main website. I don't think there's any new development though, just maintenance and gradual replacement.


I agree from the human genome sequencing perspective and the superlatives of saving the HGP. But at the time I was a standard software dev who thought the topic was cool, as was this new fan dangled language named perl.

However, I got into the genomics world in the early aughts, and the perl hung on and on and on. I remember starting a new job in the mid-teens, and one of the first tasks I had was to port over a legacy perl script. Such is the world of scientific software. 99% of the people I know who have touched perl since ~2003ish are in the bioinformatics space.


Small correction: BLAT is a local alignment tool Jim Kent also wrote. I think his assembler you're referring to is GigAssembler (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC311095/).


I also love that all of their genomic tools are released as a virtualbox virtual machine and their entire site can basically be run locally with personal genomic data. I think it’s called genome in a box.

And if I remember correctly BLAT was so useful because it could be run on machines with less cpu power by loading more data into memory…or was it the other way around?


I know hn isn't the place for memes, but, I'm obligated. Imagine what they could do with a beowulf cluster of pentium pros?


There was a NYTimes profile at the time of Kent's efforts, and they IIRC literally got the UC Santa Cruz administration to hand over desktop computers destined to go to various people for a few months to assemble into a cluster. So it was likely a bunch of pentium pros or similar. Not a Beowulf cluster, of course, but whatever clustering software Kent had hand rolled out of C for the purpose.


They weren't using pentium pros, they were using Alpha's because the memory interconnect between a beowulf cluster of pentium pros would be orders of magnitude slower. So likely 'the same, but much slower'.


oh, ha. yeah, that was sort of the root of the joke. companies like Celera, or whoever had the money would write a (large) check for big iron.

smaller organizations were realizing they could get slower computers power pretty cheap, and they could get a whole lot of them. There was a little flurry of activity with flocking algorithms, objective-c had a little renaissance due to swarm computing. probably the most famous and lasting was map-reduce, from alphabet, but they were called google back then.

there were a bunch of clever little tricks, like channel bonded network cards to make multiple cards look like one fast card, so you could double or triple bandwidth.

The beowulf cluster joke was kind of the spirit of the scrappy, make something cool out of junk approach while calling companies like sun, hp, compaq dinosaurs. and it continued for a while with people building stuff like this - https://ncsa30.ncsa.illinois.edu/2003/05/ncsa-creates-sony-p... I think there was some weirdness with export controls of ps2's for this kind of stuff.

I don't remember what pc chip was the new hotness back then, it was a good 20 years ago and my memory is dim.

but that, I think, captures the gist of the meme.


www.clustercompute.com


I feel like this is the one domain where the beowfulf cluster meme is still funny


And Jim Kent!


Re: Perl, I very much agree. I've mostly switched to Ruby and `grep -P` but Perl really does have a place. `perl -pie` is still the most expressive/powerful option in some cases. Even if not directly used much anymore, I think most people vastly underappreciate how much Perl affected it's successors. Even as historical significance it's worthy of admiration.


Hi, I'm in software and looking to get into genomics/biotech. I'd like to learn more about how you made the move. Is there a way I could dm you? I saw no email in your profile.


  > The primary author of the shotgun assembly was Gene Myers
No conflict of interest there!


His wife, Chromosome, was an uncredited contributor.


> The primary author of the shotgun assembly was Gene Myers, who previously had written BLAST...

So... um... BLAST processing?


I think blast is for mapping sequences on existing genomes, which uses similar algorithms but applied for a different purpose.


My recollection of the contentious issues at the time are that a) the Celera project needed the information being made public by the NIH/international Human Genome Project in order to assemble sequences, but was keeping secret its own results, and b) it was intent on acquiring IP rights before the information entered the public domain.


Yes this is correct. Shotgun sequencing needs to be aligned to a reference which in this case was publicly free to access.


This comment is needlessly negative. What they did had value and they did not hide what they did either, at least that wasn't my impression. The statistical "shenanigans" is better termed as an innovation.


Pretty sure that the announcement was that the sequence was "essentially complete" (or something like that). Here's a press report from the time that provided the caveat:

http://news.bbc.co.uk/2/hi/science/nature/2940601.stm

> The remaining tiny gaps are considered too costly to fill and those in charge of turning genomic data into medical and scientific progress have plenty to be getting on with.


I don't think this was a secret. I remember reading an article (IIRC in Readers Digest) nearly ~20 years ago which specifically referred to it as a "shotgun method". I wasn't (and still am not) enough of an expert to have known that was a less authentic/accurate method, but at least at the time the tone was not that they were doing anything illicit.


if you go over the maths on the shotgun method, not taking into account systematic difficulty in sequencing (like tandem repeats), the likelihood of incomplete coverage vanishes much more rapidly than you'd expect, and after about 5-10x coverage volume, coverage is basically linear in the total amount of sequencing.


Shotgun sequencing could be better explained than "statistical shenanigans": imagine blasting a huge collection of identical long strings with a shotgun, each breaking into different fragments, and realizing overlaps in the fragments, collectively, can be enough to uniquely restore one copy of the original string.


Anybody familiar with news sources which questioned Celera accomplishment before they admitted how they did it?


Parent is weird in claiming celera was hiding things. The second half of the 90s was a very vocal spat between Celera and the HGP about sequencing methods and Celera saying shotgun sequencing was superior. And the HGP was publishing updates to their genome every 24 hours while Celera published their sequence on completion of the initial draft, which showed their coverage. The fact that there wasn’t 100% coverage wasn’t hidden to anyone who read the published papers. (The abstract for the Celera paper even states “reduced the size and number of gaps”, which plainly indicates there were still gaps…)


Huh, narration matters. Thanks.


http://news.bbc.co.uk/2/hi/science/nature/2940601.stm

> The decoding is now close to 100% complete. The remaining tiny gaps are considered too costly to fill and those in charge of turning genomic data into medical and scientific progress have plenty to be getting on with.

OP is being silly. Nobody was fooling anyone.


[flagged]


I'm sorry, are you saying the media, whose job it is to report and not serve as experts, are useless ignorants, or Celera, which lied to the public, are useless idiots?

If the first, time to let that dead dog lie. It's a tired trope.


Not a tired trope because that dog isn't dead. Until the media either 1. has no influence or 2. stops being dishonest then it needs to constantly be called out and berated.


What if it is apparent that there are other groups pushing the narrative that the media is untrustworthy so that they themselves can own the narrative?

Any given group of humans will have a mixture of deceptive and trustworthy participants. Singling one group out for constant castigation, held against an impossible standard, makes it easier for dishonest members of other groups to avoid the spotlight.


It doesn’t matter, because “the media” doesn’t work as a concept. It’s a bunch of amateurs trained on nothing more than hiding the fact that they are not an expert.

How much of the reporting from the media revealed that Theranos was a fraud? Anything that wasn’t was fucking useless because it’s just a megaphone for corporate PR, despite what they think.

The entire concept of a weekly news is risible, let alone daily news. A fully investigated story actually worthy of the term journalism requires probably 2 employee years worth of actual work.

Look at how many articles the NYTimes publishes daily, multiply it by 700, and then compare that number with the actual number of journalists they have. It’s a complete farce.


Aren't we all a bunch of amateurs hiding our lack of true expertise? Each passing decade I feel that all the more. I think you've described the human condition.


"The media" isn't a uniform entity, but when people refer to it in this context they typically mean the large consolidated media corporations that own major print and television outlets. And those entities have essentially eroded their own credibility to the point that trusting anything they say is foolish. If they say the sky is blue, go out and check.

They could do better, but until they do, it makes sense to castigate them when they err, because avoiding that reputational harm is their incentive to do better. Or, if they fail to do better, the thing that should ultimately kill them so they can be replaced with something that can do better.


The larger news outlets are some of the few remaining places that actually hold their news rooms to some kind of factual standard and facsimile of objective reality. If NYT or the Economist or the BBC tells me the sky is blue, I'm pretty sure it is. If they report on some scientific topic the reporter is not an expert on, I understand that there's probably some signal lost. If they publish something on a contested political topic, I understand there might be some slant in the direction of their overall political leaning. I also understand that they have an interest in the direction of their corporate ownership and host country. But there are plenty of other outlets trying (and succeeding) at capturing audiences who have all those same conflicts of interest, and more, but do not hold themselves to a factual standard.

While media companies have indeed "eroded their own credibility", in that there's a steady stream of falseness alongside truth, I would add with equal measure that "their credibility has been undermined by others". The loudest voices that castigate the media, in America at least, tend to be the ones that profit the most from being the sole arbiter of truth for their own audiences.

The business model of having an autonomous news room that holds itself to factual standards and is largely sequestered from advertising is shrinking and shrinking. I mourn its death, for I see no replacement coming along.


> The business model of having an autonomous news room that holds itself to factual standards and is largely sequestered from advertising is shrinking and shrinking.

In large corporations it's effectively already dead, because many of the "trusted" names have already abandoned it, and you can't even tell which ones they are without being inside the newsroom to know if stories there get spiked on behalf of advertisers.

Where you can still get this is the likes of Substack, where you're paying a subscription to someone who doesn't have advertisers.

> The loudest voices that castigate the media, in America at least, tend to be the ones that profit the most from being the sole arbiter of truth for their own audiences.

That's to be expected when it benefits them. But their criticisms would sure have a lot less weight if they weren't accurate.


I don't wholly disagree with you, and in fact think you have a very good and thoughtful perspective (FWIW), but I do have some thoughts.

I think there's too much focus on advertising and corporate interests as the thing that undermines journalism. There's such a thing as to little focus, but right now there's too much focus.

For example, I have seen people lionize particular doctors, or researchers, or climate scientists, or what-have-you, who hold contrarian opinions on particular subjects. Clearly those people, in the minds of those who lionize them, have fewer conflicts of interest than advertising-driven journalists or powerful government officials.

The problem is that there are powerful motivations for people to lie or deceive that are equal in power to advertising money. Power through influence. Power through self-righteousness.

For example: I've looked into various vaccine conspiracy theories that have been touted online, and some of them come from highly credentialed people, sometimes with excellent historical track records in medical fields, who none-the-less are playing fast and loose with the data, or drawing conclusions that cannot be backed by research. The people who tout those theories look to those originators, those highly credentialed people, and say, "Look, there's Dr. So and So, they aren't part of Big Pharma, so they must be on to something". Then I look into Dr. So and So's blog, and look at the research they're citing, realize they're misinterpreting, or exaggerating, or what-have-you. In short, they are deceiving the public. They aren't financially motivated to do so. So why do they do it? Well, there's plenty of reasons. A desire to be right. A desire to be relevant and respected. A desire to have an audience. I think these things are easily as powerful, and often more powerful, than advertising money held over a journalist at a large media company.

And I see benefits to journalists working in groups with editorial control. I've seen great authors go astray in terms of quality when they start self publishing, or somehow become influential enough that they can shrug off their editors. I believe any journalist would be subject to the same forces of ego, influence, etc. If a journalist is making their money through Substack subscriptions, that may be even more of an existential threat than one faced by a journalist at the NYT, who can more readily absorb the financial hit of a story that goes nowhere. The subscribers, after all, now have expectations of what they will get from that journalist. Juicy stories -- at any cost?


> If NYT or the Economist or the BBC tells me the sky is blue

Since these journalists are spending their time on Twitter and never put a foot outside on the real world, no, I dont trust anything they say.


I think Hanlon's razor applies here. ignorance is not dishonesty.


I'd rather these ideas spread to others, potentially younger readers, than never have to read a trope again. I'm not sure it will ever change if we just accept it and never discuss it again.


In a sibling comment I give about 6 articles published over 3 decades by a single outlet- the Old Grey Lady (NY Times)- and it the competition was widely covered at the time.


Could someone explain exactly what it means to be "completely sequence" the human genome when all humans have distinct genetic makeup (ie, different sequences of nucleobases in their DNA/RNA)?


The public Human Genome Project used a group of people but most of the sequence library was derived from a single individual in Buffalo, NY. The celera project also used a group of people but it was mostly Venter's genome

https://www.nytimes.com/2002/04/27/us/scientist-reveals-secr...

I believe more recent sequencing projects have used a wider pool of individuals. I think some projects pool all the individuals and sequence them together, while others sequence each individual separately. This isn't really so much of a problem since the large-scale structure is highly similar across all humans and we have developed sophisticated approaches to model the variations in individuals, see https://www.biomedcentral.com/collections/graphgenomes for an explanation of the "graph structure" used to reprsent alternatives in the reference, which can include individual single nucleobase differences, as well as more complex ones such as large deletions in one individual, to rearrangements and even inversions.


We really should say "a human genome". Reference genomes serve as a Rosetta Stone of genomics. So we can take DNA/RNA sequences from other individuals and align (pattern match) them to the reference as a way of understanding and comparing individuals.

It is not perfect, as a references can be missing or have large variability in DNA regions. The goal of the Human Pangenome Reference Consortium (HPRC) https://humanpangenome.org/ is to sequence individuals from different populations to address this issue. We are also working to develop new computation models to support analysis of data across populations.


They mean they have obtained the complete sequence for a particular Y chromosome that is considered to be a "reference" chromosome. This is similar to what was done for all the other chromosomes.


I've never understood this either. I assume the genome is many megabytes of [ATCG]+. If we have that sequence, what does it tell us? Do we look at it and say "Ah, yes, ...ATGCTACGACTACGACTAGCG... very interesting?"


Many genes are highly conserved or consistent enough. E.g.: if there's a 1% difference between two people, then it's a bit like two very unique sentences that have a couple of small typos. They're sill recognisable, and it's also still pretty obvious that they're the "same".

A gene sequence allows researchers to determine the amino acids that are coded for, and from those, which proteins match which genes.

This can be matched up with genetic diseases. If you know that damage to a certain location in a chromosome causes a problem with a certain biological process, then ergo, the associated protein is needed for that process!

So: genetic illness -> gene sequence -> protein -> role in the body

Without sequencing, that chain can't be built.


But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

Is this why it’s so hard? This feels more like a healthcare records keeping people and less like an “actually reading the data problem”.

I can’t help but feel like some form of single payer healthcare is truly the way out of this problem. One where all disease record keeping is uniform and complete.


Single payer healthcare here (UK) is still subject to privacy controls in a way which would make it very difficult to do that.

(Also our health system's IT is a hellscape, but one reason for that is that people would literally rather not have a working system at all, than one with less than impeccable privacy controls.

Personally I'd gladly sacrifice a fair bit of medical privacy in return for giving scientists greater insight into disease processes, but the average citizen here wants advanced healthcare without giving their data to research scientists. /facepalm )


I trust the scientists, the problem isn't them. Look at the whole abortion data scandal in the U.S.


The problem there is the US' insane theocrat-conservatives (or just misogynist assholes hiding behind a thin veneer of religious justification, as the case may be).

I'm not saying a health IT system should have no privacy controls either. But the requirements for such controls need to be balanced against having a system that actually works, and that means having some people who actually understand the tech, and the workings of hospitals, having a role in requirements conversations. Instead it was dominated by MPs, "patient advocacy" groups and privacy campaigners, none of whom know or care anything about how to build a workable system.


> But you can only know that by having a large sample of very “stable” (have few genetic irregularities) gene samples compared to a large pool of samples from people with very narrow and pronounced gene irregularities, right?

No- A "gene" isn't an A/G/C/T- it's a sequence of 1000-1000000 base pairs. Each gene has a well-defined start/stop sequence called a start/stop codon. When people have genetic differences, one (an SNP- single nucleotide polymorph) of the tens of thousands of base pairs in that gene is different. Even for genes that are entirely "missing" in some people, they're really just different in a way that makes them nonfunctional.

Does that make it obvious how sequencing all those genes is useful, even if everyone has different genes? It tells us 99.999% of how proteins are coded, even if individual variation is the other .001%.


It’s actually about 3 gigabases (ATCG). There are some recurrent features of the genome whose function we’ve worked out. For example the TATA box is a classic sequence that typically indicates the start of a part of the genome that codes for a protein. The vast majority of the genome doesn’t code for proteins. The function of these genome regions are much more murky. Some of these regions function like scaffolds for proteins to assemble into complexes. These protein complexes then start transcribing the genome into into mRNA. So the genome regulates its own expression, in a sense. Many of the sequences that function in this way are known. There are also just a bunch of parts of the genome that probably don’t do anything. There are also many regions of the genome that are basically self replicating sequences. They code for proteins that are capable of inserting their own genetic sequence back into the genome. These are transposons.

In short, a lot of very painstaking genetics and molecular biology work has gone into characterizing the function of certain sequences.


Also interesting are HERVs - human endogenous retroviruses which integrated into the human or our ancestor species’ genomes. They have degraded over time so none of the human hervs seem to be capable of activating but there are some in other mammals that can fully reactivate.

In humans even though hervs don’t reactivate into infectious viruses they have been implicated in both harmful (senescence during aging[0]) and beneficial (protection from modern retroviruses)[1] activities in the body.

They might be up to 8% of the human genome.

0: https://www.cell.com/cell/pdf/S0092-8674(22)01530-6.pdf

1:https://www.microbe.tv/twiv/twiv-956/


For the same reason Monsanto sequences basically anything: Because we can tell what proteins are encoded in there, and what is near them, and we can have good ideas of what proteins are expressed together. When dealing with genetic modification, we get to see whether our modification went in, and where it landed: Having a protein in a genome isn't enough. Its expression might be having an effect on other things, depending on where it is.

When we have baselines, we can compare different individuals, and eventually make predictions of how they are going to be based solely on the genetic code. If I know that a certain polymorphism is tied to some trait I want, I might not have to even bother spending the time growing a plant: I know that it's not what I want, and discard it as a seed.

With humans we are probably not going to see much modification soon, but just being able to detect genetic diseases, risk factors for other diseases that have genetic omponents, or allow for selection of embryos in cases of artificial insemination is already quite valuable.

It's not source code that we are all that good at understanding just yet, but there's already some applications, and we have good reason to think there's a lot more to come


It's just about 3 gigabytes (each byte a letter). Pretty mind-blowing, if you ask me.


It's a slight exaggeration of the information content to report the data size using an ASCII encoding. Since there are 4 bases, each can be encoded using 2 bits, rather than 8. So we're really talking 750 megabytes. But still mind-blowing.


And since the data is highly redundant the 750MB can be compressed down even further using standard approaches (DEFLATE works well, it uses both huffman coding and dictionary backreferences).

Or, you could build an embedding with far fewer parameters that could explain the vast majority of phenotypic differences. the genome is a hierarchical palimpsest of low entropy.

My standard interview question- because I hate leetcode- walks the interviewee through compressing DNA using bit encoding, then using that to implement a rolling hash to do fast frequency counting. Some folks get stuck at "how many bits in a byte", others at "if you have 4 symbols, how many bits are required to encode a symbol?", and other candidates jump straight to bloom filters and other probabilistic approaches (https://github.com/bcgsc/ntHash and https://github.com/dib-lab/khmer are good places to start if you are interested).


I'm curious if these 750MB + the DNA of mitochondria + the protein metagenomics contain all the information needed to build a human, or if there's extra info stored in the machinery of the first cell.

That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.


This is a complex question. The cocktail soup in a gamete (sperm or egg) and the resulting zygote contains an awful lot of stuff that would be extremely hard to replace. I could imagine that if the receiving civilization was sufficiently advanced and had a model of what those cells contained (beyond the genomic information) they could build some sort of artificial cell that could bootstrap the genome to the point of being able to start the development process. it would be quite an accomplishment.

If they just received the DNA without some information about the zygote, I don't think it would be practical for even advanced alien civilization (LR5 or LR6) but probably an LR7 and definitely an LR8 could.


I’m just pondering this, and it’s not clear to me that there is anything intrinsic in the genome itself that explicitly’says’ “this sequence of DNA bases encodes a protein” or even “these three base-pairs equate to this amino acid”.

I wonder if that information could ever really be untangled by a civilisation starting entirely from scratch without access to a cell


If you knew what DNA was and had seen a protein you could easily figure out start/stop codons. If you had only seen something similar it would be harder. If you had nothing similar, I don't know.

Coding DNA and non-coding DNA looks very different. Proteins are full of short repetitive sequences that form structural elements like alpha helixes: https://en.wikipedia.org/wiki/Alpha_helix

Once you've identified roughly where the protein-coding genes are it would be trivial to identify 3'/5' as being common to all those regions. You could pretty easily imagine a much more complicated system with different transcription mechanisms and codon categories, but earth genomes are super simple in that respect. Once you have those you just have the (incredibly complex) problem of creating a polymerase and bam, you'll be able to print every single gene in the body.

Without the right balance of promoters/factors/polymerase you probably won't get anything close to a human cell, but you'd be able to at least work closer to what the natural balance should be, and once you get closer to building a correct ribosome etc the cell would start to self-correct.


It’s an interesting question. Naively, I would expect it to be about like reverse engineering a CPU from a binary program. Which sounds daunting but maybe not impossible if you understand the fundamentals of registers, memory, opcodes, etc.

But… doing so from first principles without a mental model of how all (human) CPUs work? I guess it comes down to whether the recipients had enough context to know what they’re looking at.


Yes, it's intrinsic in the genome but implemented through such a complicated mechanism that attempting to understand these things from first principles is impractical, not impossible.

In genomic science we nearly always use more cheaply available information rather than attempt to solve the hard problem directly. For example, for decades, a lot of sequencing only focused on the transcribed parts of the genome (which typically encode for protein), letting biology do the work for determining which parts are protein.

If you look at the process biophysically, you will see there are actual proteins that bind to the regions just before a protein, because the DNA sequences there match some pattern the protein recognizes. If you move that signal in front of a non-coding region, the apparatus will happily transcribe and even attempt to translate the non-coding region, making a garbage protein.


What do you mean by "LR"? I queried an LLM but no results there either.


It's likely just a typo. LR5 "civilisation"/"civilization" brings up nothing on google. I don't know why you would an LLM to know more.

Based on the way the person is using it, it does not seem to equate to the Kardashev scale, as my peer stated


Since the cat is out of the bag, no, it's not a typo. it's related to Kardashev but is oriented around the common path most galactic civilizations follow on the path to either senescence (LR8.0) or singularity (LR8.1-4). Each level in LR is effectively unaware of the levels above it, basically because the level above is an Outside Context Problem.

Humans are currently LR2 (food security) and approaching LR3 (artificial general intelligence, self genetic modification). LR4 is generally associated with multiplanetary homing (IE, could survive a fatal meteor strike on the home planet) and LR5 with multisolar homing (IE, could survive a fatal solar incident). LR6 usually has total mastery of physical matter, LR7 can read remote multiverses, and LR8.2 can write remote multiverses. To the best of LR8's knowledge, there is no LR9, so far as their detectors can tell, but it would be hard to say, as LR9 implies existence in multiple multiverses simultaneously. Further, faster than light travel and time travel both remain impossible, so far as LR8 can tell.

“An Outside Context Problem was the sort of thing most civilisations encountered just once, and which they tended to encounter rather in the same way a sentence encountered a full stop.” ― Iain M. Banks, Excession

“Unbelievable. I’m in a fucking Outside Context situation, the ship thought, and suddenly felt as stupid and dumb-struck as any muddy savage confronted with explosives or electricity.” ― Iain M. Banks, Excession

“It was like living half your life in a tiny, scruffy, warm grey box, and being moderately happy in there because you knew no better...and then discovering a little hole in the corner of the box, a tiny opening which you could get your finger into, and tease and pull apart at, so that eventually you created a tear, which led to a greater tear, which led to the box falling apart around you... so that you stepped out of the tiny box's confines into startlingly cool, clear fresh air and found yourself on top of a mountain, surrounded by deep valleys, sighing forests, soaring peaks, glittering lakes, sparkling snow fields and a stunning, breathtakingly blue sky. And that, of course, wasn't even the start of the real story, that was more like the breath that is drawn in before the first syllable of the first word of the first paragraph of the first chapter of the first book of the first volume of the story.” ― Iain M. Banks, Excession


If we're at LR2, and each level is effectively unaware of the levels above it, how do we know what LR3/4/5/6/7/8/9 are or might be?

Or do you mean that a civilization at a particular level will always be unaware of civilizations above? That doesn't seem to make sense either; I see no reason why a LR4 civ couldn't have knowledge of a LR5 civ, for example.


Because it's an authorial construct used as a plot device, and thus has only small value of mapping onto the real world.


I'd imagine it might be beneficial to you to read more than one book


oops i've said too much


Or you've said too little?


The code how to build a sperm and an egg is inside the human DNA, isn't it?


Yes, but it currently requires developmentally mature individuals to build the gametes, and the "code" is so complex you couldn't really decipher it from first principles.


The code to build mitochondria is not.


Given code written for unknown hardware... can you execute it?


Given that the code contains the instructions how to make the hardware - if one is very smart than yes.


It would not necessary be possible, because it's incremental instructions on how to make the hardware, but based on already existing, unspecified and very complex, hardware. So the first instruction would be something like "take the stuff you have on your left and fuse it with the stuff you have on your right", both stuff being unspecified very complex protein assumed to be present.


Imagine a machine shop that has blueprints of components of the machines they use in the shop, and processes to assemble machines from the components. When a machine shop grows large and splits in two, each inherits a half of shop with the ongoing processes and a copy of the blueprints. https://m.youtube.com/watch?v=B7PMf7bBczQ&pp=QAFIAQ%3D%3D

DNA is the blueprints. There are infinite possibilities what to do with them. The advanced civilization would need additional information, like that they are supposed create a cell from the components to begin with, and a lot of detailed information how exactly to do that.

Edit: improved clarity


"if we transfer the DNA to an advanced alien civilization - would they be able to make a human."

I'm really surprised that in all these responses to your question no one's mentioned the womb or the mother, who (at least with current technology) is still necessary for making a human.

That's not to mention the necessity of the egg.

We're not just DNA.


This is a question about theoretical possibilities and what you're saying seems to be a rigid belief in an answer "no". But you provided no evidence or justification, except for "with current technology", which answers nothing about the theoretical question.


Artificial wombs have come quite a long way! It is not inconcievable to imagine that you could bring a zygote to term in an artificial womb.


Instructions on how to make a womb and an "egg" are contained within the human DNA.


It is known that that is not true, due to the distinct genetic code of mitochondria and known epigenetic influences of mothers on their children in utero.

You could say “well that's the last 10% of the details, maybe 90% is in the DNA,” but I think I would be suspicious that it's that high, because one of the things we know about humans is that we are born with all of the ova that we will ever have, rather than deferring the process until puberty. I should think that if it could be deferred it would have been, “you will spend the energy to make these 15 years before you need to for no real reason” seems very unlike evolution whereas “my body is going to teach you how to make these eggs, just the same as my mother's body taught me,” sounds quite evolutionarily reasonable.


But maybe you needed a pre-human womb to bootstrap the first human, and we no longer have the blueprint for that...


> That is if we transfer the DNA to an advanced alien civilization - would they be able to make a human.

You'd need a cell to start the process, with the various nucleic acids distributed correctly and proteins/energy with which to create further proteins using the information encoded by the DNA. Thus the civilization would need information about cells and a set of building blocks before being able to use the DNA.


The DNA contains all the code that creates and regulates the proteins.


Including code for the proteins that read DNA to produce proteins. You might hit similar problems trying to understand C given the source code for a C compiler - a non-standard environment could reproduce itself given the source code, meaning the code alone doesn't strictly determine the output.


I'll torture this DNA and C source code analogy a bit.

Epigenetics is missing in this discussion about reproducing a human from just the DNA. These are superficial modifications (e.g. methylation, histone modification, repressor factors) to a strand of DNA that can drastically alter how specific regions get expressed into proteins. These mechanisms essentially work by either hiding or unhiding DNA from RNA polymerases and other parts of the transcription complex. These mechanisms can change throughout your lifetime because of environmental factors and can be inherited.

So it's like reading C source code, except there so many of these inscrutable C preprocessor directives strewn all throughout. You won't get a successful compilation by turning on or off all the directives. Instead, you need to get this similarly inscrutable configuration blob that tells you how to set each directive.

I guess in a way, it's like the weights for an ML model. It just works, you can't explain why it works, and changing this weight here produces a program that crashes prematurely, and changing a weight there produces a program with allergic reactions to everything.


There's also postprocessing: varying modification of the RNA, RNA interference, glycosylation and other post transcriptional modifications.


And how will you decode it?


I can’t wait until we can bootstrap a human from a stage 3 tarball.


Yes, there is extra information in the first cells, in particular regulatory elements such as miRNAs. The headline here is epigenetics.


There's also some interesting work on understanding the roll of loops in the physical structure of the DNA storage on gene expression. [0] The base sequence of the DNA isn't everything; it may also matter how the DNA gets laid out in space---a feature which can be inherited.

[0] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638769/


Our DNA does not contain the mitochondria nor the gut bacteria so the raw data would most certainly not be enough to build a working copy


it's bit like- if i have source code of Linux (think DNA), can I build a machine running Linux? (think cell). no- you cant, you need to have machine that can run the code.

ie. "software" without "machine" to run it on, is kind of a useless.


Yes, and if you gzip it it's even smaller. But the big takeaway is that the amount of info that fully defines a human, is what we consider "not much data," even in its plainest encoding.


We don't know that it fully defines a human until we can create one without the starting condition of being inside another human. It's prototype-based inheritance.


Some of the research about being able to make simple animals grow structures from other animals in their evolutionary “tree” by changing chemical signaling—among other wild things like finding that memories may be stored outside the brain, at least in some animals—makes me think you need more than just the “code” to get the animal that would have been produced if that “code” were in its full context (of a reproductive cell doing all sorts of other stuff). Even if the dna contains the instructions for that reproductive cell, too, in some sense… which instructions do you “run”? There might be multiple possible variants, some of which don’t actually reproduce the animal you took the dna from.


My favorite trivia here is that flamingos aren't actually "genetically" pink but "environmentally" pink because they pick up the color from eating algae.

Except of course "genetics" and "environment" aren't actually separate things; sure, people's skin color isn't usually affected by their food, but only because most people don't eat colloidal silver.

https://en.wikipedia.org/wiki/Paul_Karason


AFAIK most poisonous frogs also aren’t “naturally” poisonous—they get it from diet. Ones raised in captivity aren’t poisonous unless you go out of your way to feed them the things they need to become poisonous.


bzip2 is marginally better, and then genome-specific compressors were developed, and then finally, people started storing individual genomes as diffs from a single reference, https://en.wikipedia.org/wiki/CRAM_(file_format)

Since genome files contain more data than just ATGC (typically a comment line, then a DNA line, then a quality score line), and each of those draws from a different distribution, DEFLATE on a FASTA file doesn't reach the full potential of the compressor because the huffman table ends up having to hold all three distributions, and the dictionary backlookups aren't as efficient either. It turns out you can split the file into multiple streams, one per line type, and then compress those independently, with slightly better compression ratios, but it's still not great.


You could say exactly the same of all data; it's just 1s and 0s, but when I look I just see blonde, brunette.


If you think like an ML engineer, the genome is a feature vector 3B bases (or 6B binary bits) long that is highly redundant (many sections contain repeats and other regions that are correlated to other regions), and the mapping between that feature vector and an individual's specific properties (their "phenotype", which could be their height at full maturity, or their eye color, or hair properties, or propensity to diseases, etc) is highly nonlinear.

If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

A good example is height. If you take a very large diverse sample of people, and sequence them, you will find that about 50% of the variance in height can be traced to the genomic sequence of that individual (other things, such as socioeconomic status, access to health care, pollution, etc, which are non-genomic, contribute as well). originally many geneticists believed that a small number of genes- tiny parts of the feature vector- would be the important features in the genome that explained height.

But it didn't turn out that way. Instead, height is a nonlinear function of thousands of different locations (either individual bases, entire genes, or other structures that vary between individuals) in the genome. This was less surprising to folks who are molecular biologists (mainly based on the mental models geneticists and MBers use to think about the mapping of genotype to phenotype), and we still don't have great mechanistic explanations of how each individual difference works in concert with all the others to lead to specific heights.

When I started out studying this some 35 years ago the problem sounded fairly simple, I assumed it would be easy to find the place in my genome that led to my funny shaped (inherited) nose, but the more I learn about genomics and phenotypes, the more I appreciate that the problem is unbelievably complex, and really well suited to large datasets and machine learning. All the pharma have petabytes of genome sequences in the cloud that they try hard to analyze but the results are mixed.

I spent my entire thesis working on ATGCAAAT, by the way. https://en.wikipedia.org/wiki/Octamer_transcription_factor is a family of proteins that are incredibly important during growth and development. Your genome is sprinkled with locations that contain that sequence- or ones like it- that are used to regulate the expression of proteins to carry out the development plan.


> If you had a list of all the genomes of all the people in the world, and all their phenotypes (height, eye color, hair type, etc), you could take all their genomes as input variables and treat all their phenotypes as output variables, and make embeddings or other models that mapped from genomes to phenotypes. The result would be a predictive model that could take a human genome, and spit out a prediction of what that person looks like and other details around them (up to the limits of heritability).

Would such a predictive model really be possible? As far as I'm aware there is contradicting research whether a specific phenotype distinctly originates from a SNP/genotype.


I can't technically say with 100% confidence that it would be possible. It does seem extremely likely based on all the evidence I've seen over the past 30 years.

The model would be highly nonlinear and nonlocal, at the very least.


Fascinating, are there lots of people looking at genetics with this ML kind of lens?


Sure, although I'm not aware of anybody who is contemplating quite the level I believe is necessary to really nail the problem into the ground. When I worked at Google, I proposed that Google build a datacenter-sized sequencing center in Iowa or Nebraska near its data centers, buy thousands of sequencers, and run industrial-scale sequencing, push the data straight to the cloud over fat fiber, followed by machine learning, for health research. I don't think Google wants to get involved in the physical sequencing part but they did listen to my ideas and they have several teams working on applying ML to genomics as well as other health research problems, and my part of my job today (working at a biotech) is to manage the flows of petabytes of genomic data into the cloud and make it accessible to our machine learning engineers.

The really interesting approaches these days, IMHO, combine genomics and microscopic imaging of organoids, and many folks are trying to set up a "lab in the loop", in which large-scale experiments run autonomously by sophisticated ML systems could accelerate discovery. It's a fractally complex and challenging problem.

Statistics has been key to understanding genetics from the beginning (see Mendel, Fisher) and so at a big pharma you will see everything from Bayesian bootstrappers using R to deep learners using pytorch.


Guys at Verily are working on Terra.bio with the Broad institute and others. Genomics England in the UK is also experiencing with multimodal and machine learning applied to whole genome sequences [1].

[1] https://www.genomicsengland.co.uk/blog/data-representations-...


But why Google? This is what big pharma are doing. Also you can outsource the data collection part. See for example UK Biobank. Their data are available to multiple companies after some period so it makes it more cost efficient.


Why Google? Because this is a big data problem and Google mastered big data and ML on big data a long time ago. Most big pharma hasn't completely internalized the mindset required to do truly large-scale data analysis.


I have spent the better part of the past year looking obsessively over genomics papers for cancer and I've grown very fond of the field.

Are there any positions at Google/ companies you wold suggest me to look into? I'm coming from algortrading/ ML research with ML MSc.


You could try Calico. They are an Alphabet company that specifically studies aging. There how a good amount of machine learning roles. However biotech typically pays less than finance or software.

https://calicolabs.com/careers/


Thanks!


Yes. For example when word2vec came out, immediately there were people trying similar approaches to protein sequences. Transformers work better.


The genetic code maps nucleotide sequences (DNA) to amino acid sequences (proteins). Every three bases (say AGT) maps to one amino acid. So you can literally read a sequence of ACGTs and decode it into a protein. A sequence that encodes a protein is called a gene.

Almost all variations that humans have in their genomes (compared to each other or a reference genome) are tiny, mostly one base differences called single nucleotide polymorphisms (SNPs). These tiny changes encode who you are. The rest of it just makes you carbon-based organism, a eukaryote, an animal, a mammal etc, just like a whole load of other organisms.



I always make this mistake too as a computational biologist; when talking about DNA it’s megabases not megabytes.


What you are getting at is now called a “pangenome assembly”. Several high profile papers earlier this year, one by Guarracino and Garrison in Nature.

A pangenome is a complex graph model that weaves together hundreds or more genomes/haplotypes—usually of one species, but the idea can extend across species too, or even cells within one individual (think cancer pangenomes).

On the idealized human pangenome graph each human is represented by two threads along each autosome, plus threads through Chr X, Y, and the mitochondrial genome.


While you are correct, the differences between different people's DNA is tiny, less 1% at best. So this information is still very valuable. This article is talking about the first time in finishing sequencing one person's Y chromosome's DNA.


> While you are correct, the differences between different people's DNA is tiny, less 1% at best

How do we know this, if we have only sequenced the chromosome of one individual?


We have sequenced the genome using different sampling and statistical models for a long time.


Traditional sequences of the Y chromosome (and other chromosomes) were missing parts, particularly the highly repetitive regions called "telomeres". This is different from the issue of individual variation (although the authors do provide a map of known variations as well).


The title of the original article and as submitted here seems quite clear that it's of a specific individual: "The complete sequence of *a* human Y chromosome"


Thank you for putting into words exactly the thing I wanted to understand but couldn’t figure out how to ask.


Good question. Practically they call their complete sequence a 'reference sequence' which can be thought of as a baseline for comparison to the complete spectrum of human Y chromosome genetic variation, so at least people have something to use as a standard for comparison. The line in the abstract "mapped available population variation, clinical variants" is about the only mention of the issue.

Ideally we'd have hundreds if not thousands of complete genomes which in total would reveal the population diversity of the human species as it currently exists, but this is a big ask. "Clincal variants" are of particular interest as those are regions of the genome associated with certain inherited diseases, although the promises of individual genomic knowledge leading to a medical revolution have turned out to be wildly overblown.

Since the paper is paywalled, there's not much else to say than that they have a (fairly arbitrary in origin, i.e. it could have been from any one individual or possibly even a chimera of several individuals) reference sequence to which other specific human Y chromosomes can be compared, eventually leading to a larger dataset from many individuals which will reveal the highly conserved and highly variable regions of the chromosome, population-wise.


It means that they are trying to find a baseline from which they can eventually clone a human being.

Let's not pretend that this is not an end goal. It always was.


I had the same question. Perhaps this will help you.

https://en.m.wikipedia.org/wiki/DNA_sequencing


That page describes the human genome as having been sequenced back in 2003. *confusion intensifies*


the project started around 1990, they announced a draft completed in 2000, "completion" in 2003 (this was more a token announcement based on a threshold than a true milestone). Even then the scientists knew that major parts of the centromere, telomere, and highly repetitive regions were not fully resolved, and that was fully admitted. The work by Karen Miga at UCSC and others is more of a mop-up now that genome sequencing is a mature technology and we have much better ways at getting at those tricky regions.

another "completion" happened 3 years ago, before this announcement. but this is the last one. I promise.


As I understand it this was hard because most mechanisms we have for sequencing genes work by first splitting large sequences of genes into random chunks and sequencing each chunk individually. These chunks then need arranged properly to form the entire genome. This is possible at all because we can get overlap between the chunks, but when a sequence has repeating sections overlap still isn’t enough to stitch everything together correctly. There’s also a newer technique involving nano pores. The dna is literally pulled through a little hole and the electrical properties change depending on which base pair is inside the pore, sadly it’s liable to errors in a deterministic way depending on the sequence. I don’t have access to the full paper so I’m not quite sure what was done here, but I have heard people talk about combining the two approaches above


They relied mainly on PacBio HiFi reads (which I think is a truly understated revolutionary technology in genomics), then used Nanopore sequencing to link together the HiFi contigs to cover the truly gnarly regions of the chromosome.

And yeah, you're right: All modern large-scale sequencing is shotgun sequencing where the DNA is randomly broken up, each fragment is sequenced, and then the individual segments (reads) are assembled using a genome assembler.


And the PacBio system they used in this paper is already old news and has been replaced by PacBio Revio system. Higher throughput, cheaper per base and somewhat more accurate.


Here is the "dumb question" I've always had about recording the human genome.

We all have different DNA. So is "the human genome" some kind of "average" DNA, or is it the DNA of whoever they sampled, or is it maybe an overview of what is common for all of us?


They are talking about one 'reference genome'. The variation from human-to-human is relatively small (a few million bases out of 3 billion). The reference genome has historically been some kind of average/mosaic of several individuals (this has obvious disadvantages), good enough to put reads in the right place (mostly), and call 'variants' - the differences that make the test genome unique.

The latest/greatest end-to-end T2T reference is entirely based on 'HG002' an individual from Utah, due partly to new information derived from long read technologies.


Actually the level of variation per genome relative to a reference is still not completely known because we do not have more than a handful of truly complete assemblies. It is clear though that it is much higher than a few million base pairs, perhaps as high as tens of millions, depending on your alignment parameters. Most of the differences are in regions we have not been able to sequence and assemble until the past few years. This paper being a key example. If two males have different versions of large repetitive arrays on the Y then they will already be much more than "a few million" base pairs different


I assume it's one person's DNA. The last time there was talk about this in popular media was the DNA of one person.


Kinda. The Human Genome Project's reference genome is from 4 people. 70% of it is from one person, because that sample happened to be very good.


what would make a sample good? from the point of view of what quantity a sample would entail, there must be no shortage of cadavers or even discarded body parts?


In this case it's one haplotype of one cell line.

If you want a population model of genomes you need a pangenome.

See "pangenome graph", "variation graph", and the human pangenome project.


This news should be dated back in December, not now.

The T2T team published a preprint [1] last December and released the data [2] in March. However, due to the peer review process, the findings have only just been formally published in Nature. The publication timeline can indeed be slow, and in cases like this one, the question is: what's the point when all scientists interested in the topic already know about it and working with this assembly?

[1] https://www.biorxiv.org/content/10.1101/2022.12.01.518724v1.... [2] https://github.com/marbl/CHM13


what's the point when all scientists interested in the topic already know about it and working with this assembly?

In this context, I would say the point of a press release is getting the news out generally to non-scientists.


Is this like: "we have a working keyboard driver" or more like "we identified all 104 keys on a standard 104-key US QWERTY layout"?


They now have the raw data for the CAD files used for manufacturing the keyboard.

We understand some CNC instructions, but there's no good understanding of the control mechanisms or sequencing.

One day we'll identify the keys, but we are very far from the driver, and more importantly, from identifying what each key does.

Oh, we did discover that changing specific bits in the raw data causes the keyboard to fail in interesting ways.


Not answering your question, but I have to say, this is one of the best analogies I've encountered in a while.


The keyboard driver means knowing what effect every key has?


Wow, thanks!


Sounds more like they've been using the 87-key QWERTY so far, but now they also found the numpad.


This is actually kind of a huge deal since it means that all 24 chromosomes have now been fully sequenced. As it says in the abstract, up until now the Y chromosome proved difficult to sequence due to its nature.


I thought the human genome project finished in 2003, but now that I look it up again, apparently "finished" meant 92%. And then it was finished in 2022. And now it's finished again. Are there any remaining milestones left toward maybe finishing it again? Was the Y chromosome really just considered an exception until now?


human_genome_final_FINAL_3.txt


What makes the Y chromosome more difficult?


All of the chromosomes have difficult bits in them. The Y chromosome in particular has huge sections that are difficult.

By difficult, imagine a jigsaw puzzle. Difficult bits in the jigsaw puzzle are where you have the same sub-image repeated over and over again, or where the same image section is repeatedly scattered over the wider image. Puzzle pieces from these bits are difficult because you can't tell which part of the image such a small puzzle piece comes from, because it matches multiple places. It's technically impossible to resolve repetitive features where the repeating unit is larger than the pieces you are trying to assemble together. Modern technology gives us long read sequencing, where sequenced sections of DNA may be up to 100kbp or larger (maybe up to a 1Mbp), where older sequencing methods (that are still heavily used because they are cheaper) give us sequenced sections of DNA between 100-300bp long. (bp stands for base-pairs - one of [ACGT].) These larger puzzle pieces allowed the whole picture to be assembled without ambiguity.

However, this isn't why the Y chromosome was solved last. The reason for this is that the other chromosomes were analysed using a completely homozygous hydatidiform mole, which is where cells generate two copies of their entire genome from just one copy during conception, and therefore the two copies are identical. It makes the sequencing a lot easier if you don't have to deal with having two copies of the DNA that are slightly different. The side-effect of that is the hydatidiform mole doesn't have a Y chromosome, so they had to analyse a different sample later on to get a Y chromosome.


What does "mole" mean in this context?



It's on line 1 of the linked article: "The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications" and comes with 3 citations.


Maybe someone can correct the details since it's been a few years I done this, but we sequence DNA by PCR. Roughly, (1) breaking it up in small pieces and split strands, (2) mixing it with an enzyme that completes each single strand, (3) repeat 1 and 2 a bunch to multiply the strands many times over to make the solution a sense DNA juice, (4) pass it through a machine that'll sequence thousands of these small strands and (5) align these short DNA sequences with a software that matches unique sequences.

I did it with COI gene, which is just a short (1000ish base pairs with our snails IIRC) sequence of purely random ATGC base pairs. Lots of unique sequences make the short strands easy to match, just get a bunch of 10-15 BP bits and you can match the whole thing.

Now if your gene is 62M BP of repeating palindrome sequences, you can imagine how hard it would be to align random pieces sequenced as it will be very hard to find unique sequences to match.


We don't use PCR anymore! It's direct sequencing of the primary DNA. We can read single molecules. That's the quiet revolution in nanotechnology that's driving all these complete assemblies.


As others have asked in this thread, what does "fully sequenced" mean to the layman?


It looks like from the preprint that they sequenced the Y chromosome of HG002, which was one of the original 1000 genomes samples from way back in the day, still held in deep freeze at a number of biobanks.

Short-read sequencing data is a notoriously bad datatype for reconstructing the low-complexity / repetitive regions of genomes, so up until recently the most commonly used reference genomes have left many of these regions "dark". According to the preprint, the Y chromosome has the highest density of these low-complexity regions. It's also something of a bioinformatic nuisance when constructing a generic human reference genome, as it's only present in 50% of the population.


Isn't the problem the absence of random DNA?

I wouldn't call random data 'complex', but it is easy to sequence when assembling short reads.


It provides a complete baseline/reference DNA "map". Common "problems" show up on specific parts of this map. So you can sequence small subsets of a patient's DNA and compare it to these "problem areas" to detect genetic diseases.


You can tell it's the Y chromosome because of the way it is.


Thats pretty neat!


For a very entertaining and educational book that tangentially related; I can highly recommend: Y - The Descent of Man by Steve Jones [0]

I am thrilled to see more chromosomes being mapped/sequenced. Please excuse my high-school level of biology knowledge here, but have we definitively progressed beyond correlation when it comes to genes, gene expression, and how they all interact?

Take "Blue eyes" as an example, we know that [1] OCA2 is responsible for brown/blue eye colour. BUT are we sure that none of the 20,000 others are involved/needed?

In laymans terms I would guess that to get 'blue eyes' would require several genes to ALL be on/active/present as well. More like a recipe, then a simple on/off it's located in 1 position.

[0] https://www.goodreads.com/work/editions/436071-y-the-descent... [1] https://www.nature.com/articles/s41433-021-01749-x


Eye color is subtle. How do you define blue eyes? It's not just one phenotype; there are a range of blue eyes, whose differences correspond to fairly subtle variations between people. The article you linked [1], is quite good but requires quite some time to absorb completely.


I would imagine all sorts of genes are necessary to have blue eyes depending on how you look at it. After all, you need eyes in order to have blue eyes, and a whole bunch of other machinery to open them to check the color...


First I would like to state that human genetics are very, very complex and we still do not have a complete understanding of them. For example, only relatively recently has the field of epigenetics made major advancementa, which (broadly speaking) studies how a cell's behavior can be changed WITHOUT it's DNA being changed.[1] This brings into question the age-old questions about environmental impact vs genetic impact in even greater detail. Anyways, sidebar over.

No, we are not completely sure how many chromosomes play a role in determining eye color. However, we do have a pretty good guess. Most recent estimates I found out the number at 16. OCA2 and HERC2 have the largest impact on eye color, but, there are many OTHER genes that also have smaller impacts on eye color. [2] That article I cited is actually amazing. But, it does touch on some more advanced subjects that are more introductory college-level biology or AP Biology than standard high school bio e.g. Gene Regulation, introns and extrons, etc.

To answer your more general question, in my (admittedly only mildly less basic, introductory college-level biology) opinion, it is unlikely that we will, anytime soon, reach a point where anything in genetics can be completly, 100% definitive. This is not to say that we haven't made amazing advancements in the field of genetics and biology more broadly. BUT, it would be a mistake to take for granted the complexity of the human genome. There is most definitely things we still do not understand about the genome, and will not for some time.

But, practically speaking, while other genes may have impact on some simple phenotypic traits, such as eye color, we can generally make an accurate guess based on only a few genes. In eye color, for example, one study was able to predict eye color with only 6 genes with about ~75% accuracy. [3]

The crazy part is, sometimes the genes that effect the phenotypic trait, don't actually store genetic material that determines the trait. In other words, they don't DIRECTLY determine the trait at all. Rather, the only effect ORHER genes which then effect the expression of the trait more directly. You can imagine that this could become very complex very very quickly when you have multiple genes effecting other genes which effect other genes, not to mention accoubting for environmental and demographic baises when doing these studies and you begin to see why this genetics is such a difficult field of study.

Apologies for the long post and rambling. Hopefully I was still able to provide you with some mediocre introductory-collage-level biology

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791696/ [2] https://www.nature.com/articles/jhg2010126 [3] https://doi.org/10.1016/j.cub.2009.01.027


This is exciting!

BTW. if you want to know the applications of this work, have a look at this ACM SIGPLAN Keynote: https://youtu.be/JTU3JYp3JYc?si=jOZz611ATQar3Gec (helped me understand DNA more than all biology classes at my high school)


Thanks for that ACM SIGPLAN talk link, it was very fascinating and engaging.


So, did they find the reason why BMW drivers are always tailgating me on the Autobahn?


Yes, move out of the left lane.


Because you stay in the left lane when you shouldn't?


Are you driving an Audi?



1) What we had was not complete because sequencing happened for (relatively) small chunks. Due to overlap the sequence could be reconstructed, but not for areas with a lot of repeats.

2) This incomplete sequence has nevertheless been VERY useful, because these repeats mostly are outside of protein encoding regions ('genes').

3) 'The' human genome is of course non existent because you and I have a different one. What is meant is that this has been done for a 'bunch' of individual humans.

4) Because we're much more alike than we are different in our genome, this is still very informative.


Could we in theory just build up any kind of creature from scratch if we write out its genome? How would we compile it into organic matter?


No, this is not technically possible. Cells are simply to complex to just build using humanity's current level of technology. The closest thing we have done is to fully synthesize a bacterium's genome, remove the genome from a bacterial cell, then transplant the synthetic genome into the cell and have the cell survive: https://en.wikipedia.org/wiki/Mycoplasma_laboratorium

That's insanely cool, but very far from what you are asking. An analogy: We have re-installed the operating system of an existing computer, when you are asking whether we can manufacture a computer.


There's a bootstrapping problem. You need something to run the code. An egg cell and sperm cell contain loads of other stuff in addition to the genome and the only way to get that stuff is from the genome...

It's like if you had a program to 3D print a 3D printer but you don't have a 3D printer that can read that program.

It's all ridiculously complicated. To get an idea about it you could learn about how simple RNA viruses like influenza or HIV replicate. It's easy enough for anyone to understand. But no single person can fully understand humans (or other mammals, plants etc, we're not special in that respect).


A good article about the achievement from UC Santa Cruz, where one of the lead scientists of the Telomere-to-Telomere project works:

https://news.ucsc.edu/2023/08/t2t-y-chromosome.html


I guess nih will update this soon: https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1/ since it was gap-less except for Y chromosome


Does it make it any easier to snapshot Chromosomes and see them evolve with age and perhaps better correlate that with the onset of diseases? The correlation is still possible but will having multiple snapshots enable better association?


Dumb question: what does it mean to sequence "the" X?

I presume our DNA are not identical. So do they sequence a particular person's Y chromosome? Who's?

How similar are two Y chromosomes from different persons?


Very similar. The difference between you and a chimp is only 4%, the largest difference between two individual humans is about a tenth of that.

The person they picked is pseudonymously HG002, who is an Ashkenazi man who took part in the project and consented to commercial distribution of his genome.


Interesting they chose Ashkenazi, given this ancestry is fairly unique.

https://gnomad.broadinstitute.org/news/images/2018/10/gnomad...


Geneticists like insular populations to reduce signal to noise ratio. Another favorite population is a utah population of mormons for similar reasons. African genetics for example are trickier to parse out cause and effect as there is a lot more genetic diversity (and therefore noise that makes it difficult to identify true signal), but they too are studies sometimes for this reason specifically. The Yoruba are pretty well sequenced. Among europeans, a good resource is the icelandic genome project, not only due to the sample size but the nature of the bottlenecked population.


Scary cowboy on Sesame Street: I WANNA KNOW Y!

Scientists: here you go


…and mine has been found sorely wanting


What are the implications of this? What's now possible with this feat completed?


So I have my whole genome sequenced by Nebula. What do I have to do to match it up?


You could download the raw sequence reads (FASTQ files) and map them to the new T2T-Y. Probably it makes more sense to wait for a new release of the T2T human genome reference, as mapping to a subset of the genome increases false positives. It helps to have some genome bioinformatics knowledge to correctly handle those T2T references.


Can someone explain how "I have my whole genome sequenced by Nebula" relates to the news just now that "The human Y chromosome has been completely sequenced"?

How can someone have their whole (!) genome sequenced already when so far we weren't able to fully sequence the Y chromosome. And this person seems to have a Y chromosome.


because commercial "whole" genome sequences aren't really whole. But, they normally deliver the raw reads to you in a 50+GB file so I suppose you could take the reads in that file that don't map to the previous reference and try to map them to the new one. Unless you are an expert it's unlikely you would get any actionable results.


BWA-MEM is a good alignment tool. Then you can view the .bam file in IGV.


omg and with the new tech around generative AI...

This is shaping up to be a very interesting decade to be in tech.

We might actually be able to generate new sequences and test them against all the diseases to remove them.


Why do they use a paywalled service to publish their research?


TIL the human Y chromosome hadn't been completely sequenced until now.


Does this literally mean every single base pair with all the junk genome and everything else? Or is it some statistical model/extrapolated etc.


It is a completely gapless assembly from the beginning of the start of each chromosome all the way to it's end ("telomere to telomere"). I haven't read in enough detail but I would imagine there are still some slightly fuzzy regions that are sort of a best-guess, although those would be quite small.


How can they sequence a thing that does not exist?

It's the only way that 2023 Gender Science can be true, and all good community members know that it is true.

All Glory to the Commune Leader!

/s




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: