I worked for a while on extremeophile Archaeal viruses - the type that infect organisms that manage to live in volcanic hot springs, for instance. These are ecological niches that are old, and extremely divergent. There's little genetic exchange between life around the hot springs, and life within them.
The typical route of discovering those viruses was first genetic. When you get a genome (especially back when this work was initiated), you'd BLAST all the gene sequences against all known organisms to look for homologs. That's how you'd annotate what the gene does. Much more often than not, you'd get back zero results - these genes had absolutely no sequence similarity to anything else known.
My PI would go through and clone every gene of the virus into bacteria to express the protein. If the protein was soluble, we'd crystallize it. And basically every time, once the structure was solved, if you did a 3D search (using Dali Server or PDBe Fold), there would be a number of near identical hits.
In other words, these genes had diverged entirely at the sequence level, but without changing anything at the structural (and thus functional) level.
Presumably, if AlphaFold is finding the relationship, there's some information preserved at the sequence level - but that could potentially be indirect, such as co-evolution. Either way, it's finding things no human-guided algorithm has been able to find.
> Presumably, if AlphaFold is finding the relationship, there's some information preserved at the sequence level
This is not my area of expertise, and maybe I'm misunderstanding this, but I thought that what AlphaFold does is extrapolate a structure from the sequence. The actual relationship with the other existing proteins would have been found by the investigators through other, more traditional means (like the 3D search you mentioned).
I'm not sure about that. The way AlphaFold works involves transforming the protein from a vector space representing the sequence to a different vector space representing the folded structure and back again as it performs iterative refinement. Presumably you could perform a comparison in the structure space to find homologs that have completely different sequences - they would just have a high cosine similarity.
Checking sub-regions of the structure would be more difficult, but depending on how the structural representation works it could just be computationally intensive.
This is a very big misconception about AlphaFold. It's not generating a structure totally de novo from sequencing. Instead it's primarily finding relationships on the sequence level to other solved structures. If those structure/sequence relationships didn't exist somewhere, AF wouldn't work because it doesn't really have much information about protein folding from first principles. There are some small de novo elements, but nothing really groundbreaking. Where AF's true strength lies is in it's ability to detect relationships we have been unable to detect with any other method.
Wow, that makes sense. Thank you for explaining this -- it makes Alphafold a little less inexplicable magic and a little more science/engineering in my mind.
What about convergent evolution? Are you ruling that out because you reason that there are many possible structures that could do the same job so it's too much of a coincidence how close it matches?
IANAB, but from what I do understand. It depends what you mean by different genes. Information wise, DNA is a string of base 4 digits(nucleotides) in groups of 3 digits, these groups are called codons. Each codon corresponds to a specific amino acid*. A protein is made up of a bunch of different amino acids chained together. The gene determines which amino acids are chained together and in what order. This long chain of amino acids tends to fold up into a complex 3 dimensional structure, and this 3 dimensional structure determines the protein's function.
Now, there are a couple ways a gene could be different without altering the protein's function. It turns out multiple codons can code for the same amino acid. So if you switch out one codon for another which codes for the same amino acid, obviously you get a chemically identical sequence and therefore the exact same protein. The other way is you switch an amino acid, but this doesn't meaningfully affect the folded 3D structure of the finished protein, at least not in a way that alters its function. Both these types of mutations are quite common; because they don't affect function, they're not "weeded out" by evolution and tend to accumulate over evolutionary time.
* except for a few that are known as start and stop codons. They delineate the start and end of a gene.
You could build houses from bricks, timber or poured concrete that all looked the same in the end. Their internal structures and methods of construction would be different, but they would have the same form.
For a given output, you could write a program in wildly different programming languages, or even use the same language but structure it in wildly different ways.
If there's no match for the source code (genes), then find a match for the output (protein).
In terms of 3D fold - i.e. the general abstract shape of the protein in 3D, you can make loads of substitutions without changing it, generally as long as you stay within the same class.
It's not until you compare the 3D shape that you see the relationship.
The typical route of discovering those viruses was first genetic. When you get a genome (especially back when this work was initiated), you'd BLAST all the gene sequences against all known organisms to look for homologs. That's how you'd annotate what the gene does. Much more often than not, you'd get back zero results - these genes had absolutely no sequence similarity to anything else known.
My PI would go through and clone every gene of the virus into bacteria to express the protein. If the protein was soluble, we'd crystallize it. And basically every time, once the structure was solved, if you did a 3D search (using Dali Server or PDBe Fold), there would be a number of near identical hits.
In other words, these genes had diverged entirely at the sequence level, but without changing anything at the structural (and thus functional) level.
Presumably, if AlphaFold is finding the relationship, there's some information preserved at the sequence level - but that could potentially be indirect, such as co-evolution. Either way, it's finding things no human-guided algorithm has been able to find.