RoboOligo

In 2011, I began playing around with ideas on how to automate the sequencing of short, modified base-containing  RNA oligomers based on the data generated from liquid chromatography followed by tandem mass spectrometry (LC/MS/MS). In the end, I picked up a new programming language (.NET), learned quite a bit about nucleic acid mass spectrometry, and learned how a side-project that I wanted to “spend a few weeks” on can turn into challenging, but rewarding, two year journey.

Background

Many cellular RNA molecules expand beyond the four main ribonucleosides: cytidine (C), uridine (U), guanosine (G), and adenosine (A), with modified ribonucleosides that extend the functional diversity of  potential chemical groups in much the same way as the different amino acids allow for the nearly endless polypeptide combinations that result in a wide range of protein function. tRNAs are especially rich in modified nucleosides, which contribute to the maintenance of tRNA structure, tRNA isoacceptor identification for proper amino acid charging, and accurate decoding during translation.  Determining the sequences of these modified base-containing RNAs is critical for our overall understanding of life, however, this is not a trivial pursuit. Even RNA containing the most simple of modifications, such as 1-methylguanosine (m1G), can not be sequenced using traditional methods of RNA and DNA sequencing (Sanger, next gen., etc), as no matching complementary base can be placed opposite of the modified base by a reverse transcriptase. Instead, tandem mass spectrometry is the preferred method of analysis, which, described simply, breaks up (collision induced dissociation) the RNA oligomer into smaller parts, reads their mass (to charge ratio), and produces a data spectrum of all of the different break down products which can be used to determine not only the nucleotide composition (modified and unmodified), but also the exact sequence in which they lay.

The breaking up of an RNA oligomer into smaller oligomers by collision induced dissociation (CID). The most common fragmentation products are labeled as: c, y, a-B, and w — the mass to charge ratio of these products are recorded by the mass spectrometer.

The theoretical CID fragmentation spectrum of the RNA oligomer UACGp. The 5′ to 3′ sequence can be determined by reading the ‘c’ ion seires, while the 3′ to 5′ sequence can be determined by reading the ‘y’ ion series. The two, along with the a-B and w ion series, must match in order to correctly sequence the inputted RNA.

The data generated by this technology is complex. Above is the theoretical spectrum of the simple RNA 4-mer UACGp. This would be an easy data set to sequence, but if we look at real data for the same 4-mer (below), we see that the spectrum is further complicated by any of these factors: instrument noise, other contaminant DNA and RNA, isotope distribution, salt adducts, and likely other issues.

Real collision induced dissociation tandem mass spectrometry data of the 4-mer UACGp,

The challenge is in the interpretation of this data. First, knowledge of how RNA oligomers fragment when subjected to collision induced dissociation is required. Second, the mass of the expected unmodified and modified nucleosides – there are >= 108 nucleosides with unique masses and the identification of new modified nucleosides continues – and the mass corrections required for each fragmentation product need to be known and calculated, respectively. Finally, the interpretation must be done manually by slowly building sequences, one fragmentation product at a time, until a sequence fits. Overall, it is a time consuming effort that requires specialized knowledge.

 RoboOligo

When I first began to learn how to interpret this kind of data I was surprised to find that very few people had worked on developing software that could be used as an analytical aide. There are literally hundreds, and nearing the thousands, of specialized software developed to analyze protein mass spectrometry data, which dwarfs the nucleic acid-specific software of approximately ten.

So, I set out to pursue a simple idea about how to automate the sequencing process. Two years later, I had developed a program (RoboOligo) that could reliably determine the de novo sequence of short, complex RNA oligomers that was especially well-suited for analyzing biologically-derived tRNAs. The program also features a manual sequencing function that allows the user to sequence data by simply clicking nucleosides from a list which then get incorporated into the theoretical sequence, meanwhile the program calculates all associated CID fragmentation products and searches and labels peaks within the spectrum that match.

After submitting the paper for publication and following the requisite edits, RoboOligo will be released along with all source code to the public.

The main user interface of RoboOligo.

Thanks to Dr. Kirk Gaston and Dr. Pat Limbach of the University of Cincinnatti for their significant help.

3 thoughts on “RoboOligo

  1. If BioPharma Finder can do the de novo sequencing, what kind of operation can be possible???
    1) Elemental analysis
    2) Fragmentation pattern

Leave a Reply

Your email address will not be published. Required fields are marked *