Bioinformatics

We develop new AI methods to address challenging problems in Biology.

Deep learning for Top-Down Mass-Spectral Deconvolution

Abstract: Top-down mass spectrometry has become the main method for intact proteoform identification, characterization, and quantitation. Because of the complexity of top-down mass spectrometry data, spectral deconvolution is an indispensable step in spectral data analysis, which groups spectral peaks into isotopic envelopes and extracts monoisotopic masses of precursor or fragment ions. The performance of spectral deconvolution methods relies heavily on their scoring functions, which distinguish correct envelopes from incorrect ones. A good scoring function increases the accuracy of deconvoluted masses reported from mass spectra. In this paper, we present EnvCNN, a convolutional neural network-based model for evaluating isotopic envelopes. We show that the model outperforms other scoring functions in distinguishing correct envelopes from incorrect ones and that it increases the number of identifications and improves the statistical significance of identifications in top-down spectral interpretation.

Abdul Rehman Basharat, Xia Ning, and Xiaowen Liu. EnvCNN: A Convolutional Neural Network Model for Evaluating Isotopic Envelopes in Top-Down Mass-spectral Deconvolution. Analytical Chemistry, 92(11):7778–7785, May 2020. PMID: 32356965. https://pubs.acs.org/doi/10.1021/acs.analchem.0c00903

Ranking-based Convolutional Neural Network Models for Peptide-MHC Class I Binding Prediction

Abstract: T-cell receptors can recognize foreign peptides bound to major histocompatibility complex (MHC) class-I proteins, and thus trigger the adaptive immune response. Therefore, identifying peptides that can bind to MHC class-I molecules plays a vital role in the design of peptide vaccines. Many computational methods, for example, the state-of-the art allele-specific method MHCflurry , have been developed to predict the binding affinities between peptides and MHC molecules. In this manuscript, we develop two allele-specific Convolutional Neural Network-based methods named ConvM  and SpConvM  to tackle the binding prediction problem. Specifi cally, we formulate the problem as to optimize the rankings of peptide-MHC bindings via ranking-based learning objectives. Such optimization is more robust and tolerant to the measurement inaccuracy of binding affinities, and therefore enables more accurate prioritization of binding peptides. In addition, we develop a new position encoding method in ConvM  and SpConvM  to better identify the most important amino acids for the binding events. We conduct a comprehensive set of experiments using the latest Immune Epitope Database (IEDB) datasets. Our experimental results demonstrate that our models signifi cantly outperform the state-of-the-art methods including MHCflurry  with an average percentage improvement of 6.70% on AUC and 17.10% on ROC5 across 128 alleles.

Ziqi Chen, Martin Renqiang Min, and Xia Ning. Ranking-based convolutional neural network models for peptide-MHC Class I binding prediction. Frontiers in Molecular Biosciences, 8:128, May 2021. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8165219/(Code is available here).

Improving MHC Class I Antigen-Processing Predictions Using Representation Learning and Cleavage Site-Specific Kernels

MHCrank takes a uniform-length N-flank + peptide + C-flank sequence, C-terminal cleavage site (see gray box), and the peptide’s original length before padding or trimming as input. The amino acids comprising the sequence and cleavage site-specific kernel (CSSK) undergo feature embedding. A convolution layer is applied to the embedding of the entire sequence. The remainder of the MHCrank architecture can be split into six components. Component (1) applies a mean pool to the convolution output corresponding to the N-flank. Component (2) applies a mean pool to the convolution output corresponding to the C-flank. The convolution output corresponding to the peptide sequence is forwarded to two stacked convolution layers. Components (3) and (4) each have two outputs (A and B) obtained from the output of these convolution layers. (3A) extracts the output corresponding to the peptide’s N-terminal amino acid. (4A) extracts the output corresponding to the peptide’s C-terminal amino acid. (3B) applies a mean pool to the peptide’s non-N-terminal amino acids. (4B) applies a mean pool to the peptide’s non-C-terminal amino acids. Component (5) applies a global kernel to the embedded CSSK. Component (6) is a single node that takes the peptide’s original length as input. Two dense layers are applied to the concatenated output of each component. The output from the second dense layer enters an output layer that predicts the probability of the input peptide undergoing antigen processing. Note that the layout of this diagramis largely inspired by the presentation of MHCflurry’s architecture (O’Donnell et al., 2020).

Abstract: In this work, we propose a new deep-learning model, MHCrank, to predict the probability that a peptide will be processed for presentation by MHC class I molecules. We find that the performance of our model is significantly higher than that of two previously published baseline methods: MHCflurry and netMHCpan. This improvement arises from utilizing both cleavage site-specific kernels and learned embeddings for amino acids. By visualizing site-specific amino acid enrichment patterns, we observe that MHCrank’s top-ranked peptides exhibit enrichments at biologically relevant positions and are consistent with previous work. Furthermore, the cosine similarity matrix derived from MHCrank’s learned embeddings for amino acids correlates highly with physiochemical properties that have been experimentally demonstrated to be instrumental in determining a peptide’s favorability for processing. Altogether, the results reported in this work indicate that MHCrank demonstrates strong performance compared with existing methods and could have vast applicability in aiding drug and vaccine development.

Patrick J. Lawrence and Xia Ning. Improving MHC Class I antigen processing predictions using representation learning and cleavage site-specific kernels. Cell Reports Methods, 2(9), 2022. https://www.sciencedirect.com/science/article/pii/S2667237522001758?via%3Dihub

Modeling Path Importance for Effective Alzheimer’s Disease Drug Repurposing

Abstract: Recently, drug repurposing has emerged as an effective and resource-effcient paradigm for AD drug discovery. Among various methods for drug repurposing, network-based methods have shown promising results as they are capable of leveraging complex networks that integrate multiple interaction types, such as protein-protein interactions, to more effectively identify candidate drugs. However, existing approaches typically assume paths of the same length in the network have equal importance in identifying the therapeutic effect of drugs. Other domains have found that same-length paths do not necessarily have the same importance. Thus, relying on this assumption may be deleterious to drug repurposing attempts. In this work, we propose MPI (Modeling Path Importance), a novel network-based method for AD drug repurposing. MPI is unique in that it prioritizes important paths via learned node embeddings, which can effectively capture a network’s rich structural information. Thus, leveraging learned embeddings allows MPI to effectively differentiate the importance among paths and enables enhanced drug repurposing compared to a commonly used baseline method developed by Cheng et al. We observe that among the top-50 ranked drugs, MPI prioritizes 20.0% more drugs with anti-AD evidence compared to the baseline. Finally, Cox proportional-hazard models produced from insurance claims data aid us in identifying the use of etodolac, nicotine, and BBB-crossing ACE-INHs as having a reduced risk of AD, suggesting such drugs may be viable candidates for repurposing and should be explored further in future studies.

Shunian Xiang, Patrick J. Lawrence, Bo Peng, ChienWei Chiang, Dokyoon Kim, Li Shen, and Xia Ning. Modeling path importance for effective Alzheimer’s Disease drug repurposing. In Pacific Symposium on Biocomputing (PSB), 2024.