Let me give you the scope of the problem. The genetic code is a redudant code, s...

Let me give you the scope of the problem.

The genetic code is a redudant code, small difference in the code can still yield the same information. There are 64 triplets, start is one and stop are 2 and there are 20 coded amino acids coded with the rest. So 22 out of 64.

There is a direct similarity and a similarity of information. For redudant code the former is useless. You can have a direct similarity of 30% and a similarity of information of 100%. (Considering a 1:3 redundant code at its worst, DNA performs much better)

There is also a third layer of redundancy, that is still under investigation, where certain sequences of triplets can be permuted and still yield the same result. The order of assembly is redundant for some big projects also, allowing for the code to be permuted in chunks as well.

So we are looking for the similarity of a redundant code that allows for permutations on two scales.

Without considering how the measure of similarity is taken, something being X percent identical means absolute BUNK. It can not be a direct comparison of the code.

SARS COV2 ~ 30000 pairs HIV ~ 10000 pairs Spike ~ 4000 pairs

The SARS COV 2 virus has 3 times the code of HIV. You are also dealing with different sizes.

But one can cut everything between start and stop, translate it to amino acids and permute the result into oblivion. Whatever is left can be compared by a huge variety of measures.

But when working with it, you just sequence your stuff, feed it into the commerical software and click compare.