Would have been nice to have a Julia version too. Some time ago I suggested [1] to create a Julia flavor of Biostar Handbook [2]. And now there is an initiative[3] to create similar, but open source book instead. So anyone can contribute already.
If by chance anyone is not aware, the namesake is Rosalind Franklin [1] who made seminal contributions in the fields of X-ray crystallography and electron microscopy.
So, I picked one at semi-random - http://rosalind.info/problems/prtm/ and found a usability problem (a popup that doesn't work; in FF or Safari) and a wrong example answer. Here's the description.
> Given: A protein string P of length at most 1000 aa.
> Return: The total weight of P. Consult the monoisotopic mass table.
This is NOT the correct answer because as the expanded text says, "the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule."
The table says "the monoisotopic mass of water is considered to be 18.01056" so
This site was developed by Pavel Pevzner, who teaches bioinformatics at UCSD. We used this site as the main curriculum in one of our final bioinformatics class, and after solving ~ 10 - 15 problems a week for 10 weeks, I don't recall a single time where the error was in the problem set solutions.
Re: the problem - not a hundred percent on this, but I think the issue is that they are vague on the fact that this is a theoretical question, not a practical one. The key is that the question itself does not mention the addition of the water molecule, just that you have a sequence P with a dictionary of weights.
Edit 1: If memory serves me correct, after the initial ionization phase of mass spectroscopy, the additional water molecule is discarded, making it insignificant in the analysis of your peptide sequences.
Edit 2: If anyone is interested in following through this site, I would highly recommended using the existing problem tracks
http://rosalind.info/problems/list-view/?location=bioinforma...
These will help lay out the problems in a logical order an ensure you have the skills you need to progress. Alignment problems are a great way to learn dynamic programming and will allow you to move onto some of these other problems (like mass spec and HMMs) more reasonably (at least, in my experience!) Good luck!
I was thinking about this some more. You wrote "after the initial ionization phase of mass spectroscopy".
In high school I tried to build a mass spectrometer. It didn't work - I couldn't get a high enough vacuum, and a few years later as a physics undergrad did I find that that was only one of several problems I had. It was fun to try though.
But I do know that the ionized particle has a charge, and that electron affects the overall mass, by about 1/1836 Dalton . That's 0.00054 Dalton, while the table lists masses down to even higher accuracy, like 71.03711 .
The example output gives a value down to 3 decimal digits, so at that precision there's a 50% chance that the electron mass will affect the result.
Isn't this problem therefore implicitly teaching an excessive trust in significant digits?
Now, I suspect that the mass spectrometers they use aren't that accurate. But it's bugging me now.
As mbreese wrote elsewhere here, I'm (clearly) reading too much into the problem. I don't think bioinformatics is the right field for me.
IANAP (really, IA(BARELY)AB), but I am not exactly clear where these additional electrons are sneaking in. Either way, I'm not sure the change would be significant enough to move the spectral peak outside of the range for a specific residue. There are lots of false peaks, and a lot of pre / post processing goes into managing those complexities.
Re bioinformatics /residues / lingo / life - bioinformatics is a catch all term for a set of loosely related tools being applied across the full field of disciplines. The early pioneers in bioinformatics (like Pevzner himself) were all physicists and electrical engineers who moved into the realm of biology. Your exact terminology is generally dependent on the specific problem you are solving and may be slightly different that the more general uses of the same words. Either way, I hope that if you are interested in the underlying topics that bioinformatics deals with, you won't let some issues in vernacular trip you up!
https://en.wikipedia.org/wiki/Dalton_(unit)#Measurement says "Although relative atomic masses are defined for neutral atoms, they are measured (by mass spectrometry) for ions: hence, the measured values must be corrected for the mass of the electrons that were removed to form the ions, and also for the mass equivalent of the electron binding energy,"
> The mass of the molecular ion corresponds to the nominal or monoisotopic mass of the molecule, with the mass of the electron added or lost usually consequential.
However, I can't help but think that last word should be "inconsequential."
I did find https://onlinelibrary.wiley.com/doi/full/10.1002/rcm.8478#rc... which shows it's possible to measure the difference between ¹⁶O¹⁸O⁺⁺ and ¹⁷O⁺ ("The interference of ¹⁶O¹⁸O⁺⁺ can be detected 0.002 mass units before the larger ¹⁷O⁺ peak starts.") That must be due to binding energy as the naive charge/mass ratio is identical.
That's only 4x more than the 0.00054 mass units I estimated for an electron.
I looked at some of the other example problems. I do not think I am interested in this field.
A closer reading shows that I got tripped up by what "residue" means. But I'm not sure the author of the question got it right either? At the very least, I'm confused by it.
The first paragraph of the expanded question text has: "every pair of adjacent amino acids has lost one molecule of water, meaning that a polypeptide containing n amino acids has had n−1 water molecules removed"
The second paragraph has: "Thus, the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule."
The fifth paragraph has: "The mass of a protein is the sum of the monoisotopic masses of its amino acid residues plus the mass of a single water molecule"
And the monoisotopic mass table says "Note: the monoisotopic mass of water is considered to be 18.01056 Da."
So I thought that the water molecule was important in the calculation.
However, the last paragraph (which I only now closely read) says it isn't important, with "In the following several problems on applications of mass spectrometry, we avoid the complication of having to distinguish between residues and non-residues by only considering peptides excised from the middle of the protein. This is a relatively safe assumption because in practice, peptide analysis is often performed in tandem mass spectrometry."
Since it didn't mention "water", and instead used the specialist term "residue", I missed the connection earlier.
That said, the text seems to use "residue" inconsistently. There's the definition "a residue is a molecule from which a water molecule has been removed; every amino acid in a protein are residues except the leftmost and the rightmost ones."
but there's also the usage: "the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule"
Surely that should be "the mass of a protein is the sum of masses of all its residues plus the mass of its leftmost and rightmost amino acids minus the mass of a single water molecule", yes?
So I looked up the definition of "amino acid residue". It appears to be https://goldbook.iupac.org/terms/view/A00279 "α-Amino-acid residues are therefore structures that lack a hydrogen atom of the amino group (–NH–CHR–COOH), or the hydroxyl moiety of the carboxyl group (NH2–CHR–CO–), or both (–NH–CHR–COO–); all units of a peptide chain are therefore amino-acid residues".
https://en.wikipedia.org/wiki/Protein_sequencing#Whole-mass_... also agrees that "residue" includes the two amino acids at the ends, saying "The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water molecule and adjusted for any post-translational modifications"
Which means ... I don't think the author uses the term "residue" correctly?
Or, more likely, I'm confused by the specialist terminology. Can someone clear up my confusion?
> Or, more likely, I'm confused by the specialist terminology. Can someone clear up my confusion?
I think you're just reading too much into the problem. The terminology can be a bit daunting and it can be very hard for bioinformatics people to realize just how much specialist terminology we end up using. I mean, we have both computer science and molecular biology jargon that we use every day. It's sometimes hard to recognize just how much assumed knowledge there is in this field.
Unfortunately, bioinformatics is full of edge cases like this.
In this example, if you're talking about the size of a full protein, you need to add the mass of a water (to account for the lack of a bond at the N and C terminus). However, if you're talking about the size of a peptide fragment (a sub-sequence), then you don't need to account for the extra water molecule [0]. And as the explanation says -- peptide identification is the most common use-case for this analysis, so we'll just use that. This is also why the table lists the "residue" weights -- because it's easier to calculate the `sum(residue_weight) + water` as opposed to `sum(amino_acid_weight) - (n-1) * water`.
I think the extra explanation isn't entirely clear, but I wouldn't get hung up on that part or let that bother you too much. The problem really boils down to: calculate a weighted sum of a string using these weights for the letters. That's it... no more, no less.
Congrats on working through the exercise. This is honestly great resource for learning about the field from a coding perspective. As others have mentioned, this is from the people who literally wrote the book about much of the field.
[0] Note: In the protein, you aren't actually adding an intact water molecule itself, but accounting for the extra H at one end, and an extra OH at the other end. Which is the same weight as HOH, or 18.01056 Da.
I think it means that I don't want to work in bioinformatics, if there's such ambiguity over what seems to be a basic term.
Does the term "residue" include the ex-amino acid components at the end of the protein?
The text is clear that it doesn't. The popup says that "a residue is a molecule from which a water molecule has been removed". But it's internally inconsistent, and the IUPAC definition says that it does include the ends.
I would follow the IUPAC Gold book definition. That is written by a committee that seeks exact wording. The other source seems a little sloppy with their definition.
As to whether you might consider working in the field of bioinformatics, I can only pass on a comment from a researcher in the field. She says that they are being "swamped with data". Lots of data, but a lack of computer nerds to work on it.
PS I also noticed the reference to post-translational modification in one of your comments. This is one of the factors that makes for the amazing variety in protein/enzyme functioning. Humans have about 20 000 genes to encode about 75 000 different enzymes, plus a multitude of structural, storage, hormonal, etc. proteins. So you can see that translation of the gene is only the beginning of the story; post-translational modifications are needed. https://www.sciencedirect.com/topics/neuroscience/posttransl... It's an unlimited area of study - one I stepped out of and wish I had stayed in. Of course, to make any sense of a complex system, one has to concentrate on a small part of it. Focus on what takes your interest (and you can get paid for).
Years ago I got the advice to not go into software development for science. The reasoning was that those companies are mostly run by people with PhDs who took 6 years to get their degree, and getting paid a pittance. These people find it hard to believe that a non-PhD software developer should be paid significantly more than a PhD.
I don't know if it's true, but when I see "Lots of data, but a lack of computer nerds to work on it." ... I read it as saying there's no money to hire those computer nerds.
The same person advised me that the software developers who go into science often do so for the love of the topic, and willing to be paid less. I don't think I can get that interested in bioinformatics.
> I think it means that I don't want to work in bioinformatics, if there's such ambiguity over what seems to be a basic term.
Yes, biology is a mess of exceptions, "good enough" and "hey, it works". It does have its charm as well, though.
One of the first things I learned from that world is that if you ask a microbiologist if something is always the case, the answer will be "Yes" if it is true more than ~80% of the time.
Think of the normal meaning of the word. The residue is what's left over after whatever was going to happen has finished happening.
For amino acids, the interesting thing is that they get joined into chains that fold up into proteins (which do all the work). The residues after that happens looks like this:
R O R O
H | || H | || H
----- N - C - C --------- N - C - C ----------- N etc
H H
The lines - and | are chemical bonds. The Oxygens are double bonded. N= Nitrogen, C=Carbon, O=Oxygen. The R is one of 20 different chemical groups called side chains that make each amino acid different.
When the amino acids are isolated (not chained) the double-bonded carbon on the right carries an extra OH group. The whole right-hand carbon is often written COOH. The nitrogen on the left hand side of each amino acid carries an extra H. As part of the process that joins the amino acids together, the OH on the left and the H on the right pair off to form water. This is called condensation, and it happens at every junction between two acids. So if there are n amino acids, there are n-1 junctions and n-1 water molecules that were present (in aggregate) in the constituent amino acids that don't make it into the final protein chain.
Note, though, that one of the methods of chopping up protein chains is adding the water molecules back again, so you should know exactly what you're looking if you want to count masses precisely.
Jeez. Didn't mean to write all that. Hope it helps.
Just a note - the OH and the H+ don't specifically pair up to form a single water molecule; realistically, there will be a number of water molecules in play throughout the reaction.
Rather nit-picky, though. Thanks for the diagrams :)
It's very kind of you to describe the chemical process, though I think the text description and image in the problem were enough for me.
Let's suppose the amino acids HO-A-H, HO-B-H and HO-C-H came together (I hope I got the sides correct!) to form HO-A-B-C-H.
Are HO-A- and -C-H "residues"? One of them lost an -H and the other an HO-, so they are "what's left over after whatever was going to happen has finished happening."
The text says as "every amino acid in a protein are residues except the leftmost and the rightmost ones", so those end ones are not residues.
But the IUPAC definition says it can lose one, or the other, or both, which would include the end ones as part of the definition.
Re: the mass of a protein is the sum of all its residues plus the mass of a single water molecule / your second definition, they are equivalent. You could re-write you second version as (sum of all residues minus left and right most) + (sum of left and right residues + two molecules of water) - (one molecule of water) === sum of residues + one water molecule.
(The key is that residues are created whenever a molecule of water is expelled. In theory, you can expel exactly one water molecule from each amino acid to create a residue, with one hydrogen from the amine group and the hydroxl from the carboxylic acid. In reality, it takes two amino acids to start the process of peptide synthesis, and each aa contributes only one part (either the proton or the hydroxl group) of the theoretically expelled water molecule. In reality its not even a single water molecule, as the hydroxyl group is going to pick up a spare proton from solution and the extra proton on the amine will get pick up by gawd knows what) <-- Downright wrong (at least, in regards to its importance to residues.).
The condensation is connected in this instance but not in all instances of residues. I think the usage is more clear if you look at some of the other examples in the biochemistry section of the wiki page
"...a residue refers to a specific monomer within the polymeric chain of a polysaccharide, protein or nucleic acid. One might say, "This protein consists of 118 amino acid residues" or "The histidine residue is considered to be basic because it contains an imidazole ring." (https://en.wikipedia.org/wiki/Residue_(chemistry))
You're right. Sometimes the answer key is wrong. I have to explain this to my professors from time to time, and it's always annoying. And in those cases I have paid money to be graded incorrectly.
I would be happy if I were you though. The point of this exercise is to learn, and I'll bet you'll remember that water molecule for a long time :-)
As an aside - monoisotopic mass is a strange one to use.
In the real world you are a mixture of isotopes, so it's better to use the average mass ( average of the different isotope masses, corrected for abundance ) if you want to compare to experimentally determined masses - say from mass spec.
It's not as if average mass is more complex - for the sake of these calculations it's still just a number looked up from a table...
ie why oh why use the wrong value when it's just as easy to use the right one ()?
() true it's biology so there isn't a right one in all circumstances - lots of interesting effects eg enzymes having slightly different rates of incorporation for different isotopes - however it's closer to the truth than mono-isotopic.
> The monoisotopic mass is not used frequently in fields outside of mass spectrometry because other fields cannot distinguish molecules of different isotopic composition. For this reason, mostly the average molecular mass or even more commonly the molar mass is used. For most purposes such as weighing out bulk chemicals only the molar mass is relevant since what one is weighing is a statistical distribution of varying isotopic compositions.
> This concept is most helpful in mass spectrometry because individual molecules (or atoms, as in ICP-MS) are measured, and not their statistical average as a whole. Since mass spectrometry is often used for quantifying trace-level compounds, maximizing the sensitivity of the analysis is usually desired. By choosing to look for the most abundant isotopic version of a molecule, the analysis is likely to be most sensitive, which enables even smaller amounts of the target compounds to be quantified. Therefore, the concept is very useful to analysts looking for trace-level residues of organic molecules, such as pesticide residue in foods and agricultural products.
However for proteins - which, even if broken down to small peptides in the mass spec, have large numbers of C, N, O, H atoms then monoisotopic makes no sense.
> Leftmost peaks in isotopic clusters correspond to molecules containing only the lowest-mass isotopes of all their atoms: all carbon atoms are C-12, all hydrogen atoms are H-1, all nitrogen atoms are N-14, and so on. These peaks are known to those skilled in the art as monoisotopic peaks. While each chemical species of molecule manifests itself in the mass spectrum as an isotopic cluster, it is characterized by only one monoisotopic peak, thus it became common practice to characterize molecules in the mass range of up to approximately 10 kDa by their monoisotopic masses. For example, it became common practice to use monoisotopic masses in protein identification methods based on comparing mass spectral data to databases of masses of protein fragments.
https://www.sciencedirect.com/topics/biochemistry-genetics-a... also disagrees, quoting from "Protein Identification by Peptide Mass Fingerprinting (PMF)", Nachimuthu Saraswathy, Ponnusamy Ramalingam, in Concepts and Techniques in Genomics and Proteomics, 2011
> 13.4 Data analysis and identification of protein
> The peak list is compared with a peak list generated from the database proteins. The commonly used computer search engines are MS-Fit, Mascot, Peptident, Profound, etc. The monoisotopic mass of the each peak, the protease used, the number of missed cleavages in order to account for the possible incomplete digestion are given as input.
That is, it appears that monoisotopic mass makes good sense for small peptides in mass spectroscopy.
Let's take a short tryptic peptide.
LQGIVSWGSGCAQK
Formula is C62 N18 O19 S1 H100
Let's look at C - C12 is around 98.93 natural abundance, the N 99.6, O 99.76, H 99.98
If we forget the others for simplicity of calculation and only look at C. Then the probability of getting a peptide with all C12 is 0.9893^62 ~ 0.51 ie only half of the sample will be monoisotopic mass - double the length of the peptide and it's down to a quarter - full length protein you are looking at vanishingly small amounts.
The original problem was to calculate masses of things upto 1000aa - something of 1000aa
would have a frequency of monoisotopic species of 2.04058E-21 - ie a handful of molecules out of 6x10^23 of a mole.
The value of monoisotopic values decreases as the size and complexity of the molecule goes up.
You wrote: "The value of monoisotopic values decreases as the size and complexity of the molecule goes up."
Yes, that is is agreement with the text I quoted earlier - "it became common practice to characterize molecules in the mass range of up to approximately 10 kDa by their monoisotopic masses".
Note (from another part of the second link I gave) "This eight amino acid peptide was named GmPep914 (DHPRGGNY), based on its monoisotopic mass."
So, there's plenty of clear evidence that people do use monoisotopic masses for mass spectra analysis of at least some peptides.
What is your point? That this question is poorly written? I think I started this thread to point that out.
My point was that while for small molecules, the mono-isotopic mass makes perfect sense as it's the major species, for larger proteins it isn't and indeed becomes a vanishingly small proportion.
Note for something around 10kda the difference between
the average and mono-isotopic mass will be around ~6 daltons - with the experimental accuracy around ~1 dalton!
Depends on your definition of small - but I accept the point that if the peptides are small enough it can be useful and I went too far there.
Remember the problem posed was to calculate the mass of proteins upto 1000aa where the difference between mono-isotopic mass and real average mass would be many 10's of daltons - much more than the missing water!
I think the problem is poorly worded, which lends itself to the confusion we experience, unless we know (presumably from the context of the goals of this project) that these physical details are beyond the scope of the project.
That is, I think "useful" here is meant as "useful in learning to program", not "useful in actual mass spectra analysis."
Eg, you write "calculate the mass of proteins up to 1000aa".
There's a couple of picky details.
1) The text says "total weight of [A protein string] P", where "The standard weight assigned to each member of the 20-symbol amino acid alphabet is the monoisotopic mass of the corresponding amino acid".
(I'm ignoring that "weight" != "mass" because in this context those are synonyms.)
It reads like the text defines "standard weight" in terms of the monoisotopic mass, and asks to compute that weight. So the problem is not asking to "calculate the mass of proteins up to 1000aa" but "calculate the monoisiotopic mass of proteins up to 1000aa." It further says "all amino acid masses are assumed to be monoisotopic unless otherwise stated".
(Alas, the explanatory text goes on to say "There are two standard ways of computing the mass", which contradicts the assertion that there's a "the standard weight.")
2) It says "protein string" in the problem, not protein, and clarifies that "In the following several problems on applications of mass spectrometry, we avoid the complication of having to distinguish between residues and non-residues by only considering peptides excised from the middle of the protein."
That is, the problem posed was not to calculate the "[monoisotopic] mass of proteins" but something more like the "[monoisotopic] mass of peptides excised from the middle of the protein, represented as a protein string."
3) In trying to understand this @$%@#$%&%$ topic more, I found https://patentimages.storage.googleapis.com/42/6b/2b/ec3a694... which I believe says that for very accurate mass spectrometers, for large proteins, the most abundant mass may be more useful than the average mass.
It's suppose to be an educational tool - bioinformatics is more than writing programs to add up numbers it's about understanding the science behind it.
So I didn't like the question because it treated the problem as a simple 'write a program to add up a list of numbers, based on a lookup table',rather than address the real science issues around protein mass.
( Here the real challenges actually come from post-translational modifications - which makes mass matching a very hard problem indeed - with lots of challenges for anyone in computer science )
In my top-level coment I asked: How (in)correct are the other answers? I-am-not-a-bioinformatics-programmer.
As a variation, how many other questions do you not like?
http://rosalind.info/problems/hamm/ computes the Hamming distance between two strings, using the simplification that only point mutations are important. This is of course not a true reflection of the science behind comparing two DNA strings.
Do you therefore also not like that question? Which others don't you like, because of simplifications they don't explain?
As an education tool, is it not useful to discard complexity in the process of bootstraping towards the full details?
> A lie-to-children (plural lies-to-children) is a simplified explanation of technical or complex subjects as a teaching method for children and laypeople. The technique has been incorporated by academics within the fields of biology, evolution, bioinformatics and the social sciences.
I was going to suggest that the result was bad because of floating point error, but then I reread the value and, it doesn't seem like that amount of variance could be produced by errors introduced in the floating point calculations?
First off, the login page doesn't redirect to the HTTPS version of the page, so it's sending my password over plaintext. What makes this worse is that when I manually go to the TLS page, it gives me a PF_END_OF_FILE_ERROR (I'm running firefox 72.0.2, on Alpine Linux).
The second thing is picking the first example (the character counting problem). Clicking on the thing, it told me that the important words are highlighted, and that the words 'figure N' refer to the figures on the right -- which felt unnecessary, because it's something that anyone visiting wikipedia, or browsing a book, would know.
It’s a great site and greatly accelerated my learning of programming.
The form of learning which I call “problem based” learning is a great format for me. You learn from reading up on a topic. You learn from trying different solutions. Finally, you learn from seeing other people’s answers once you’ve solved it.
Also check out:
Hackerrank.com - all around focus
Project Euler- math focus
Leetcode - more oriented towards interview training but still useful and fun.
We used a version of this site for a bio informatics algorithm class a couple years ago (we used the site for part of the homework assignments, I guess the auto grading of code saves the instructors time...)
The problems are interesting and fun to solve, they didn’t have a lot of context, though They seemed to have added some at the start of each problem.
I think there will be some sites like books, they are timeless. And Rosalind is one of them. I'd add Philip Greenspun's /books (http://philip.greenspun.com/books/)
[1] https://discourse.julialang.org/t/biostar-handbook-computati...
[2] https://www.biostarhandbook.com/
[3] https://github.com/BioJulia/biojulia_handbook/issues/1