Could you please explain how this mapping works? Why it needs so much RAM? Is it doing a fuzzy search of sorts for known sequences (genes)? Why can't it do so one by one?
bwa specifically performs a burrows wheeler transform of a 3GB string. other mapping algorithms usually rely on some sort of indexing of the genome. the program then loads this into memory and queries that index for each “read” (a dna fragment from the dna sequencer).
Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an encoding with one byte per nucleotide.
I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?
It's very common to use 2 bits per nucleotide despite the human genome having ambiguous bases on top of the 4 letters. These tools typically encode the ambiguous base as a random nucleotide but keep track of them and then convert them to a random nucleotide later.
In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]
With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].