In 2011, I began playing around with ideas on how to automate the sequencing of short, modified base-containing RNA oligomers based on the data generated from liquid chromatography followed by tandem mass spectrometry (LC/MS/MS). In the end, I picked up a new programming language (.NET), learned quite a bit about nucleic acid mass spectrometry, and learned how a side-project that I wanted to “spend a few weeks” on can turn into challenging, but rewarding, two year journey.
Background
Many cellular RNA molecules expand beyond the four main ribonucleosides: cytidine (C), uridine (U), guanosine (G), and adenosine (A), with modified ribonucleosides that extend the functional diversity of potential chemical groups in much the same way as the different amino acids allow for the nearly endless polypeptide combinations that result in a wide range of protein function. tRNAs are especially rich in modified nucleosides, which contribute to the maintenance of tRNA structure, tRNA isoacceptor identification for proper amino acid charging, and accurate decoding during translation. Determining the sequences of these modified base-containing RNAs is critical for our overall understanding of life, however, this is not a trivial pursuit. Even RNA containing the most simple of modifications, such as 1-methylguanosine (m1G), can not be sequenced using traditional methods of RNA and DNA sequencing (Sanger, next gen., etc), as no matching complementary base can be placed opposite of the modified base by a reverse transcriptase. Instead, tandem mass spectrometry is the preferred method of analysis, which, described simply, breaks up (collision induced dissociation) the RNA oligomer into smaller parts, reads their mass (to charge ratio), and produces a data spectrum of all of the different break down products which can be used to determine not only the nucleotide composition (modified and unmodified), but also the exact sequence in which they lay.
The data generated by this technology is complex. Above is the theoretical spectrum of the simple RNA 4-mer UACGp. This would be an easy data set to sequence, but if we look at real data for the same 4-mer (below), we see that the spectrum is further complicated by any of these factors: instrument noise, other contaminant DNA and RNA, isotope distribution, salt adducts, and likely other issues.
The challenge is in the interpretation of this data. First, knowledge of how RNA oligomers fragment when subjected to collision induced dissociation is required. Second, the mass of the expected unmodified and modified nucleosides – there are >= 108 nucleosides with unique masses and the identification of new modified nucleosides continues – and the mass corrections required for each fragmentation product need to be known and calculated, respectively. Finally, the interpretation must be done manually by slowly building sequences, one fragmentation product at a time, until a sequence fits. Overall, it is a time consuming effort that requires specialized knowledge.
RoboOligo
When I first began to learn how to interpret this kind of data I was surprised to find that very few people had worked on developing software that could be used as an analytical aide. There are literally hundreds, and nearing the thousands, of specialized software developed to analyze protein mass spectrometry data, which dwarfs the nucleic acid-specific software of approximately ten.
So, I set out to pursue a simple idea about how to automate the sequencing process. Two years later, I had developed a program (RoboOligo) that could reliably determine the de novo sequence of short, complex RNA oligomers that was especially well-suited for analyzing biologically-derived tRNAs. The program also features a manual sequencing function that allows the user to sequence data by simply clicking nucleosides from a list which then get incorporated into the theoretical sequence, meanwhile the program calculates all associated CID fragmentation products and searches and labels peaks within the spectrum that match.
After submitting the paper for publication and following the requisite edits, RoboOligo will be released along with all source code to the public.
Thanks to Dr. Kirk Gaston and Dr. Pat Limbach of the University of Cincinnatti for their significant help.
I wonder if I could demo RoboOligo for a few days, just to sequence my Cap fragments and degradation products.
Hannah please contact me. BioPharmaFinder can do this if you acquired the data on an orbitrap
robert.ross2@thermofisher.com
If BioPharma Finder can do the de novo sequencing, what kind of operation can be possible???
1) Elemental analysis
2) Fragmentation pattern