Plenary and Focus Sessions (Schedule and Abstracts)

Day 1 – Plenary Session #1

Evaluating Publically Available Personal Health Records for Home Health

Authors: Laura Kneale, Yong Choi, Sean Mikles, George Demiris, University of Washington

Abstract: Personal health records (PHRs) were designed to encourage patient engagement. Frequent utilizers of the healthcare system, such as homebound older adults receiving home health services, may benefit from PHRs; however, PHRs have not been evaluated for use in home health. We identified existing PHRs using, a systematic literature review, and We identified the similarities and differences between PHR functionality with the purpose to evaluate how the existing systems would benefit home health clients.

97 PHRs were initially identified, and 22 PHRs met our inclusion criteria. Our preliminary findings suggest that significant gaps exist across the PHRs. For example, only 2 (9.1%) PHRs provided role-based proxy access for informal caregivers, 6 (27.3%) allowed users to upload PDF documents from previous clinical encounters, and 4 (18.2%) were flexible in allowing consumers to choose what data elements to track (e.g. weight, diet, clinical values, etc.). In addition, we are currently assessing the PHRs’ usability from a home health client, informal caregiver, and home health nurse perceptive.

We suggest that available PHRs may be difficult to implement in home health. In this talk, we will provide recommendations to improve utility, and ultimately utilization, of PHRs with home health clients.


Data in Emergency Department Provider Notes at Time of Image Order Entry

Authors: Justin Rousseau, Ivan Ip, Ali Raja, Vlad Valtchinov, Ramin Khorasani, Harvard Medical School, Brigham and Women’s Hospital

Abstract: Objective: Identify opportunities to improve the communication between ordering providers and radiologists at the time of image ordering, which currently is insufficient, posing a patient safety concern. Materials and Methods: We evaluated observational data documented in electronic health record (EHR) notes prior to image ordering from 666 consecutive Emergency Department encounters over an 18-month study period for adult patients with headaches during which head CT was performed We compared relevant concepts specific to headache extracted via ontology-based natural language processing of notes to image order requisitions. Results: History of present illness (HPI) was initially submitted in 33.9% and completed in 23.4% of encounters prior to image ordering. The number of concepts specific to headache per note was significantly greater than the number of indications per image order requisition (median 3 vs. 1; p<0.0001). There was no significant difference between the number of concepts in HPIs completed prior to image ordering compared to those completed after image ordering (p=0.07). Discussion: EHR documentation provides a source of valuable information that could be used in an automated fashion to facilitate and enhance the imaging ordering process. Conclusion: Future work is needed to assess the utility of EHR data prior to image ordering.

Pediatric ECG Feature Identification

Authors: Emily P Hendryx1, Craig G Rusin2, Beatrice M Riviere1

1 Rice University, Houston, TX, 2 Baylor College of Medicine and Texas Children’s Hospital, Houston, TX

Abstract: Since each part of the electrocardiogram (ECG) corresponds to a different stage in the cardiac cycle, tracking changes in individual ECG features over time can help physicians gain further insight into changes in a patient’s clinical status. However, expecting physicians to fully analyze ECG subtleties in real time while analyzing the rest of the presented patient data is impractical, especially over longer periods of time. The goal of this work, therefore, is to automate the ECG feature-identification process on a beat-by-beat basis.

While some algorithms for identifying individual ECG features exist, these methods typically rely on specific timing thresholds and are derived from adult data. To better serve the pediatric population – specifically those with congenital heart disease – we are developing a library of key pediatric ECG morphologies using data collected from the bedside monitors at Texas Children’s Hospital. Key morphologies for the library are identified via the CUR matrix factorization. This beat selection leads to the definition of morphology classes to be used in conjunction with dynamic time warping in identifying individual ECG features in unlabeled beats. The labeled features can then be considered in the development of predictive models for real-time clinical decision support.

This research was funded by a training fellowship from the Gulf Coast Consortia, on the Training Program in Biomedical Informatics, National Library of Medicine (NLM) T15LM007093, PD – Lydia E. Kavraki.


Learning to Diagnose with LSTM Recurrent Neural Networks

Authors: Zachary C Lipton, David C Kale, Charles Elkan, Randall Wetzell, University of California, San Diego

Abstract: Clinical medical data, especially in the ICU, consist of multivariate time series of observations. For each patient visit, sensor data and lab test results are recorded in the patient’s electronic health record. While potentially containing a wealth of insights, the data is difficult to mine effectively, owing to varying length, irregular sampling and missing data. recurrent neural networks, particularly those using Long Short-Term Memory (LSTM) hidden units, are powerful and increasingly popular models for learning from sequence data. They effectively model varying length sequences and capture long range dependencies. We present the first study to empirically evaluate the ability of LSTMs to recognize patterns in multivariate time series of clinical measurements. Specifically, we consider multilabel classification of diagnoses, training a model to classify 128 diagnoses given 13 frequently but irregularly sampled clinical measurements. First, we establish the effectiveness of a simple LSTM network for modeling clinical data. Then we demonstrate a straightforward and effective training strategy in which we replicate targets at each sequence step. Trained only on raw time series, our models outperform several strong baselines, including a multilayer perceptron, recognizing diabetic ketoacidosis, idiopathic scoliosis, asthma and brain neoplasms all with AUC > .85 and F1 > .5.


Automatic Detection of Drug-Drug Interactions Between Clinical Practice Guidelines 

Authors: Geoffrey J Tso1,2, Samson W Tu2, Mark A Musen2, Mary K Goldstein1,2 ,

1Dept. of Veterans Affairs VA Palo Alto Health Care System, Palo Alto, CA; 2Stanford University School of Medicine, Stanford, CA

Abstract: Since many patients have multiple chronic conditions, they are commonly prescribed many medications that can potentially have clinically significant drug-drug interactions (DDI). However, these DDIs are rarely discussed in clinical practice guidelines (CPG). Knowing potential interactions between treatment plans is important in point of care clinical decision making and in clinical decision support (CDS) systems for patients with multiple chronic conditions. In this study, we describe and validate a method for automatically detecting DDIs between CPG recommendations. The system extracts drug and drug class recommendations from narrative CPGs, normalizes the terms, creates a mapping of drugs and drug classes, and then identifies occurrences of DDIs between CPG pairs. We analyzed 75 CPGs written by national organizations in the United States that discuss outpatient management of common chronic diseases. Using a reference list of 360 high risk and clinically significant DDIs as determined by an expert panel, our preliminary analysis identifies 108 of these DDIs in 38 CPG pairs (18 unique CPGs). Four of the CPGs contained specific discussion about these possible high risk DDIs. This study identifies important gaps in CPGs and provides a method to prevent clinically significant DDIs in a CDS system supporting multiple chronic conditions.

Day 1 – Focus Session A1

Conserved Elongation Factor Spt5 Affects Antisense Transcription in Fission Yeast

Authors: Scott P Kallgren1, Ameet Shetty2, Burak H Alver1, Peter J Park1, Fred Winston2

1Department of Biomedical Informatics, 2Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA.

Abstract: Spt5 is the only transcription elongation factor conserved in all three domains of life, but its molecular mechanisms are not yet thoroughly studied genomically. From an inducible depletion strain, we sequenced nascent transcripts (NET-seq), mature mRNA (RNA-seq), and RNA polymerase II-associated chromatin (ChIP-seq) to elucidate general effects of Spt5 on transcription. These show an increase in 5’ CDS antisense transcription by RNA-seq and a general accumulation of RNA Pol II at the 5’ ends of genes by exogenous spike-in-normalized ChIP-seq. We are currently analyzing NET-seq to determine: 1) whether novel antisense transcripts are resulting from new transcription or aberrant decay, 2) whether antisense transcript accumulation affects sense transcription, and 3) how Spt5 affects RNA Pol II pausing positions and magnitude. These results will provide insight into how Spt5 functions to facilitate RNA Pol II elongation in diverse organisms.



Genotype to Phenotype Relationships in Autism Spectrum Disorders

Authors: Jonathan Chang, Columbia University, Sarah R Gilman, Columbia University, Andrew H Chiang, Columbia University, Stephan J Sanders, UCSF & Dennis Vitkup, Columbia University

Abstract: Autism spectrum disorders (ASDs) are characterized by phenotypic and genetic heterogeneity. Our analysis of functional networks perturbed in ASD suggests that both truncating and nontruncating de novo mutations contribute to autism, with a bias against truncating mutations in early embryonic development. We find that functional mutations are preferentially observed in genes likely to be haploinsufficient. Multiple cell types and brain areas are affected, but the impact of ASD mutations appears to be strongest in cortical interneurons, pyramidal neurons and the medium spiny neurons of the striatum, implicating cortical and corticostriatal brain circuits. In females, truncating ASD mutations on average affect genes with 50–100% higher brain expression than in males. Our results also suggest that truncating de novo mutations play a smaller role in the etiology of high-functioning ASD cases. Overall, we find that stronger functional insults usually lead to more severe intellectual, social and behavioral ASD phenotypes.

Longitudinal Metabolome Wide Association Study of Cognitive Decline in Healthy Adults

Authors: Burcu F Darst, Ronald Gangnon, Joshua J Coon, Sterling C Johnson, Corinne D Engelman, University of Wisconsin, Madison

Abstract: Despite being the sixth leading cause of death in the US and its steadily increasing prevalence, little is known about the cause of late onset Alzheimer’s disease (AD). Several metabolomics studies of AD were recently published, but the examination of metabolomic profiles prior to AD diagnosis is important to distinguish predictive versus diagnostic profiles, since the disease process itself influences metabolites. Using longitudinal plasma samples from the Wisconsin Registry for Alzheimer’s Prevention (WRAP), a cohort study enrolling initially asymptomatic participants enriched with a parental history of AD, metabolomic profiles were quantified using mass spectrometry for 28 participants showing cognitive decline and 55 matched cognitively stable participants. A metabolome-wide association study (MWAS) was performed using conditional random effects logistic regression models with strata for gender and age, which participants were matched on. Of the 615 metabolites tested, 20 met statistical significance after adjusting for multiple testing, 10 of which were amino acids that all showed decreased levels in cases. This aligns with recent research suggesting that a lack of essential amino acids could lead to neuronal death in the hippocampus, a hallmark characteristic of AD. Further research is necessary to determine the role amino acids play in the onset of AD.

Day 1 – Focus Session A2

Predicting Required Diagnostic Tests from Patient Triage Data

Authors: Haley Hunter-Zinck, Stephan Gaehde, Department of Veterans Affairs, VA Boston Healthcare System

Abstract: Emergency departments are continuously working to increase patient satisfaction and reduce length of stay. Laboratory tests or imaging procedures are often ordered only after evaluation of the patient by a provider. Accurate prediction of complaint specific diagnostic testing has the potential to allow testing to be initiated immediately after patient triage and reduce length of stay. We investigated whether we could predict patients’ ordered tests from data collected at triage. Using the National Hospital Ambulatory Medical Care Survey from the Centers for Disease Control and Prevention, we extracted from approximately 20,000 patient visits information that would be available upon triage or from previous medical history as well as procedures ordered during the visit. Using a multivariate machine learning framework, we assessed prediction performance and the relative importance of each data feature.

Prediction performance varied greatly depending on the test but mostly due to its frequency of administration. For example, we predicted the order of a complete blood count, administered in 44% of sampled visits, with 78% accuracy. Several variables were important for prediction across all procedures, including arrival by ambulance, acuity score, age, and injury. Overall, we have adequate information in triage data alone to predict relatively common test ordering.

Classification of Literature Derived Drug Side Effect Relationships 

Authors: Justin Mower1,2, Devika Subramanian3, Trevor Cohen1,2

1 Baylor College of Medicine, Houston, TX, 2 University of Texas Health Science Center at Houston, Houston, TX, 3 Rice University, Houston, TX

Abstract: Adverse drug events (ADEs) are one of the leading causes of preventable patient morbidity and mortality. An important aspect of post-marketing drug surveillance involves identifying potential side-effects utilizing ADE reporting systems and/or Electronic Health Records. Due to the inherent noise of these data, identified drug/ADE associations must be manually reviewed by domain experts – a human-intensive process that scales poorly with large numbers of possibly dangerous associations and rapid growth of biomedical literature.

Consequently, recent work has employed scalable Literature Based Discovery methods, which exploit implicit relationships between biomedical entities within the literature to assist in identifying plausible drug/ADE connections. We extend this work by evaluating machine learning classifiers applied to high-dimensional vector representations of relationships extracted from the literature by the SemRep Natural Language Processing system, as a means to identify true drug/ADE connections. Evaluating against a manually curated reference standard, we show that applying a classifier to such representations improves performance over previous approaches. These trained systems are able to reproduce outcomes of the extensive manual literature review process used to create the reference standard, paving the way for assisted, automated review as an integral component of the pharmacovigilance process.

This research was funded by a training fellowship from the Gulf Coast Consortia, on the Training Program in Biomedical Informatics, National Library of Medicine (NLM) T15LM007093, PD – Lydia E. Kavraki.



Assessing the Potential Risk in Drug Prescriptions During Pregnancy 

Authors: Ferdinand Dhombres, Vojtech Huser, Olivier Bodenreider, National Library of Medicine

Abstract: Background: Eighty percent of the pregnant women in the US have at least one drug prescription during pregnancy. In 2015 the FDA introduced new drug labeling regulations, with narrative summaries describing the risk and supporting evidence. Objectives: To assess the potential risk in drug prescriptions during pregnancy, with respect to the new FDA standard. Methods: As a proxy for the FDA standard, we used narrative recommendations from a reference textbook (Briggs, 10th ed. 2015). We analyzed claims data of 159.7M patients from 2003 to 2014. We identified pregnant women by procedure codes for delivery and extracted prescriptions 270 days before delivery. We used the RxNorm API to relate drugs from claims data to the reference. Results: Of the 15,815,624 systemic drugs prescribed to 3,741,743 pregnant women, 93% were covered by the reference. The distribution among 6 broad categories was: “compatible with pregnancy” or “probably compatible” (41.2%), “low risk” (16.2%), “moderate risk” (39.3%), and “high risk” or contraindicated (3.29%). Interestingly, a majority of the risk assessment was supported by evidence from human data. Conclusions: This investigation demonstrates the feasibility of assessing the potential risk in drug prescriptions during pregnancy, with respect to the new FDA standard, as well as stronger evidence.


Day 1 – Focus Session A3

Uncertainty Quantification (UQ) in Breast and Ovarian Cancer Risk Prediction Based on Self-Reported Family History

Authors: Lance Pflieger and Julio C Facelli, Department of Biomedical Informatics, University of Utah

Abstract: Risk prediction models, such as BRCAPRO, BOADICEA and Claus, have been developed in order to identify patients’ risk of developing Hereditary Breast and Ovarian Cancer. These models assume that patient family health history is accurate and complete; however, family history information collected in a typical clinical setting is known to be imprecise. Using UQ methodologies, we show substantial uncertainty in risk classifications. For our analysis, we generated binomial distributions using family history accuracies found in the literature. These distributions were used in Monte Carlo simulation to reclassify the lifetime risk of a known pedigree into risk categories defined by the American Cancer Society. We found, on average, that up to 55% of high-risk pedigrees are misclassified into lower risk categories, with large disparities between best- and worst- case accuracy scenarios. Risk was frequently misclassified into a lower risk category as self-reported specificities are generally higher than sensitives. Our work implies that; i) UQ of the risk prediction needs to be considered when recommending a course of action; ii) better family history collection tools are needed to decrease uncertainty. This study provides a generalizable method for UQ that can applied to other biomedical fields that use predictive modeling.


Performance Drift in Clinical Prediction Across Modeling Methodologies

Authors: Sharon E Davis, Thomas A Lasko, Guanhua Chen, and Michael E Matheny, Vanderbilt University

Abstract: Integrating prediction models into real-time electronic health record decision support can enhance patient and provider decision-making. However, model accuracy can degrade over time as clinical practice and patient populations change, limiting the utility and impact of such models. We explore whether and how modeling methodologies exacerbate or alleviate performance drift by comparing temporal performance of models developed using common statistical and machine learning techniques. We modeled acute kidney injury among hospitalized patients in a national dataset of admissions to Veterans Affairs facilities (n=1,841,951). Admissions in 2003 served as the development cohort, and we assessed performance within 3-month quarters in 2004-2012. Across all models, discrimination was maintained and calibration declined during validation years 1 and 3. The event rate and case mix drifted over time, while predictor-outcome associations did not. We hypothesize that settings with pronounced association drift may lead to differential calibration drift across models and are implementing parallel analysis modeling hospital mortality and readmission to assess performance in cohorts affected by different combinations of event rate, case mix, and association drift. Understanding methods-based differences in performance drift may inform implementation strategies balancing the need to maintain acceptable levels of calibration and efficient use of analytic resources.

Sample-Specific Sparsity Adjustment Improves Differential Abundance Analysis of 16S rRNA Data

Authors: Liyang Diao1, Glen Satten2, Hongyu Zhao3

1 Yale University, Department of Medical Informatics, 2 Centers for Disease Control and Prevention, 3 Yale University School of Public Health, Department of Biostatistics

Abstract: The analysis of microbiome data presents many statistical challenges, especially when the data are very sparse. Although various methods have been proposed to normalize data and address data sparsity, their performance is less than satisfactory. While adjusting counts with a simple pseudocount is a relatively common practice, its effects have not been studied in highly sparse data, where they might affect downstream results the most.

We propose two methods to adjust highly sparse data, and compare the performance of these against fixed pseudocount adjustments, specifically focusing on how downstream results are affected by the adjustments combined with various library size normalization methods. We find that our proposed sample-specific adjustment methods can outperform the pseudocount method in both simulated and experimental data sets, improving the ability of researchers to find true differentially abundant bacteria in 16S rRNA data. 

Day 1 – Plenary Session #2

Predicting Drug Response Curves in a Large Cancer Cell Line Screen

Authors: Nathan H Lazar, Mehmet Gonen, Shannon McWeeney, Adam Margolin, Kemal Sonmez,

Oregon Health & Science University

Abstract: Precision oncology aims to improve cancer patient outcomes by tailoring treatment to an individual patient’s tumor. In order to find genetic markers that predict response, several large cancer cell line (CCL) screens have been performed measuring the growth of CCLs when treated with a panel of drugs at varying doses. The current computational tools used in this area reduce these data to a single value indicating response for each CCL/drug combination. This simplification eliminates a large amount of the experimental data, cannot produce measures of uncertainty and consequently shows poor agreement across studies.

My method uses a three-dimensional tensor factorization framework to predict the full dose-response curve for each CCL/drug combination. Mutation, copy number and expression data for CCLs as well as target and structural features for drugs are used as predictors and parameters are estimated using Bayesian variational approximation. When applied to the largest data set of this type (907 cell lines, 545 drugs and 16 doses) the method can accurately predict responses for CCLs and drugs that are not included in the training set. Additionally, by using sparsity-inducing priors the model can highlight relationships between the CCL genomics and drug features that govern response.

Aggressive Glioblastoma Phenotype Evolves Over Decade-Long Growing Phase

Authors: Daniel I S Rosenbloom, Jiguang Wang, Erik Ladewig, Sakellarios Zairis, Raul Rabadan

Columbia University Medical Center

Abstract: Longitudinal studies of tumor genomics have revealed that tumor evolution rarely follows a linear order of mutation accrual. Instead, lesions observed at later timepoints can lose mutations relative to earlier timepoints, suggesting that these later lesions are evolutionary “throwbacks” that diverged from an initial clone years before diagnosis. We developed an evolutionary model to quantify this process and estimate timing of events in tumorigenesis. Applying our model to whole-exome sequences of 92 glioblastoma patients, we found that half (45/92) exhibit genetically distinct diagnosis and relapse samples, with no shared subclonal mutations. Genetic substitution rates among these patients were remarkably consistent, with a median [interquartile range] of 0.028 [0.018 – 0.041] substitutions per megabase-year. Most strikingly, the common ancestor of diagnosis and relapse samples was estimated to have preceded diagnosis by over a decade in most patients (median 12.6 years, IQR 7.2 – 22.6 years). This long divergence time, coupled with mutational patterns observed in EGFR, TP53, PDGFRA, and other known driver genes, suggests that accumulation of driver alterations in glioblastoma occurs over a decade(s)-long growing phase. This phase results in a diverse population, each clone capable of experiencing a unique set of genetically driven expansions.



Unsupervised Deep Learning Reveals Prognostically Relevant Subtypes of Glioblastoma

Authors: Jonathan D Young, Chunhui Cai, Xinghua Lu, University of Pittsburgh

Abstract: Understanding the cellular signal transduction pathways that drive cells to become cancerous is fundamental to developing personalized cancer therapies that decrease the morbidity and mortality of cancer. The purpose of this study was to develop an unsupervised deep learning model for finding meaningful, lower-dimensional representations of cancer gene expression data. Ultimately, we hope to use these representations to reveal hierarchical relationships (pathways) involved in cancer pathogenesis.

 We downloaded 7,528 gene expression samples (each with 15,404 features) across 17 different cancer types from TCGA. We developed a python deep learning library, which included an unsupervised implementation of a Stacked Restricted Boltzmann Machine (SRBM) – Deep Autoencoder (DA).

Extensive model selection identified a promising hidden layer architecture for this dataset. Logistic regression to predict the pathological N-stage of the samples, using the final hidden layer representations as input, performed better than a proportionally random or tissue-type based classifier. Consensus clustering of the low-dimensional representations allowed for more robust clustering than clustering the high-dimensional input data. Consensus clustering of glioblastoma samples across all models identified 6 clusters with differential prognosis. Numerous novel and previously reported glioblastoma subtype-specific genes were found to be significantly correlated with each glioblastoma subtype.

An SRBM-DA deep learning model can be trained to represent meaningful abstractions of cancer gene expression data that provide novel insight into patient survival. Ultimately, deep learning and consensus clustering revealed a subclass of the proneural glioblastoma subtype that was enriched with G-CIMP phenotype samples and demonstrated improved prognosis.



Computational Studies of Protein-Protein Interface Mutations

Authors: Jennifer C Gaines, Corey S O’Hern, and Lynne Regan, Yale University

Abstract: Computational methods are invaluable for assessing the significance of patient DNA variants uncovered in clinical DNA sequencing. Despite major advances, current approaches have found limited success in predicting the change in binding due to mutations at protein-protein interfaces. Here, we implement a hard-sphere model for amino acid structure to study natural and designed protein-protein interfaces. We show that a hard-sphere model of amino acids can recapitulate the side chain dihedral angle distributions for amino acids at natural protein-protein interfaces. In addition, we calculate the packing fraction in naturally occurring interfaces and find that it is comparable to dense random packing in protein cores. We then study the effects of mutations at protein-protein interfaces using a dataset of experimentally studied interface mutations. Our model will enable the prediction of the change in binding energy due to mutations at protein-protein interfaces, many of which are involved in disease onset and progression.

Modeling of the Minimally Gained Significant Region of Trisomy 12 in Chronic Lymphocytic Leukemia

Authors: Zachary Abrams1, Lynne Abruzzo2, Kevin Coombes1, Philip Payne1

1Department of Biomedical Informatics; 2Department of Pathology, The Ohio State University

Abstract: Chromosomal abnormalities, gains and losses, are among the strongest independent predictors of rapid disease progression and inferior survival in chronic lymphocytic leukemia (CLL). One common CLL cytogenetic aberration is trisomy 12 (tr12), with the gaining of an additional copy of chromosome 12 (c12). This aberration is difficult to model genetically so the underlying genetic drivers in tr12 CLL cases are unknown.

We utilized a lab-developed karyotype parsing and modeling system, the loss-gain-fusion model, which transforms text-based karyotype data into a binary vector for large-scale analysis. We observed 776 CLL patients’ karyotypes to determine if there are differentially gained regions on c12.

We counted gains by breaking c12 into individual cytogenetic bands, then measuring if there were particular sub-bands with higher gains higher. We identified band 12q24 as the most gained region on c12 (gained in 22.8% of the population) compared to the rest of c12 (gained in 21.8% of patients). This suggests 12q24 may be the minimally required c12 gain to drive CLL progression.

In 20 cases where the only cytogenetic aberration was tr12 we looked at the mRNA expression profile and mapped c12’s location on each RNA transcript. We then measured if 12q24’s protein coding genes were differentially overexpressed compared to other c12 regions. Thus we identified genes that are overexpressed in tr12 that potentially isolate the minimally gained region on c12 related to CLL progression.

Day 2 – Focus Session B1

A Bioinformatics Approach to Identify Novel Drugs Against Liver Cancer

Authors: Tasneem Motiwala1, Kelly Regan1, Ryan Reyes2, Samson T Jacob2, Philip R O Payne1

1Biomedical Informatics, The Ohio State University, Columbus, Ohio

2Molecular Virology, Immunology and Medical Genetics, The Ohio State University, Columbus, Ohio

Abstract: The high cost and relative inefficiency of traditional drug discovery approaches have led to a growing interest in drug repositioning. By identifying new indications for existing drugs, drug repurposing offers promise in reducing cost, decreasing drug development timeframe and improving success rates in the clinic. Further, it is an important advancement for diseases like liver cancer that that do not respond well to standard therapy and are in urgent need for effective therapy. Here, through a connectivity-mapping approach, we identified novel drugs for use as first-line therapy in the treatment of hepatocellular carcinoma (HCC) or following progression on sorafenib. Connectivity mapping uses pattern-matching algorithms to compare genome-wide gene expression changes related to biological states of interest: e.g. tumor vs. normal, or drug-resistant vs. sensitive cells against a database of gene expression signatures of various cell lines with drug or gene perturbations. Using this approach, we have identified several drugs that could potentially reverse the gene expression signature of primary HCC and/or sorafenib resistance. Two of the drug hypotheses tested in in vitro growth inhibition and colony formation assays validate the specificity of the prediction. Currently, work is underway to explore the mechanisms of the therapeutic effects of these drugs.

Signatures of Accelerated Somatic Evolution on a Genome-wide Scale

Authors: Kyle S Smith, Debashis Gosh, University of Colorado, Anschutz Medical Campus


Abstract: Using a computational method called SASE-hunter we identified a novel signature of accelerated somatic evolution (SASE) marked by a significant excess of somatic mutations localized in a genomic locus, and prioritized those loci that carried the signature in multiple cancer patients. Detection of clinically relevant signatures of somatic evolution in the promoters of known cancer genes in lymphoma raised testable hypotheses whether SASE could be detected in other cancer types as well, and whether these signatures could be detected in non-coding regions outside gene promoters. The current SASE-hunter method is insufficient to meet the need, and a genome-wide assessment requires development of a novel algorithm, which is more advanced than the original SASE-hunter and has sufficient statistical power to detect SASEs at a genome-wide scale. SASE-mapper is a powerful tool for the identification of SASEs on a genome-wide scale. In addition to those signatures of accelerated somatic evolution previously discovered by SASE-hunter, SASE-mapper identifies many regions in the non-coding regions of the genome outside of promoters associated with alterations in gene expression and clinical outcomes. SASE-mapper is written in Python 2.7 and available at

Identification and Validation of CNVs using WGS Data from 274 Individuals


Authors: David Jakubosky, Christopher DeBoever, Angelo Arias, Hiroko Matsui, Naoki Nariai, Agnieszka D’Antonio-Chronowska, He Li, Kelly A Frazer, University of California, San Diego

Abstract: Copy number variants (CNVs) are an important source of inter-­‐individual genetic variation and contribute to quantitative traits and complex diseases. Algorithms utilizing discordant and split read pair information are used to identify smaller CNVs (50bp-­‐3kb) and those using read depth discover larger CNVs (>=2Kb). Thus, a combination of approaches must be used for CNV discovery, adding complexity to obtaining a complete set of CNVs and data quality control. Here we use high read-­‐ depth (40X) whole genome sequence (WGS) data to call CNVs in 274 individuals, of which 195 are in families (including 30 trios and 25 sets of monozygotic twins) and 79 are unrelated to anyone else in the collection. We found 16013 CNVs, with a minor allele frequency > 1% and ranging in length from 50bp to 209kb (median length = 3049bp). Based on segregation analysis and concordance between twins we estimate that ~80% of the multi-­‐allelic CNVs and ~99% of the biallelic CNVs are valid. Using transcriptome data generated from induced pluripotent stem cells derived from 215 of these individuals, we found 422 genes with significant CNV associations, including 180 genes with CNV lead variants. We demonstrate that high quality CNVs can be called using high read-­‐depth WGS data.

 Day 2 – Focus Session B2

Computing Geographical Access to Hospitals in Two Countries

Authors: Fabrício S P Kury, Raymonde C Uy, Jessica Faruque, Paul Fontelo

Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health

Abstract: Geographical access to hospitals, here defined as the time it takes to drive a car from a person’s residence to the nearest hospital, has controversial association with healthcare utilization and outcomes. In this study we demonstrate how to use hospital data, Census data, and modern online-based Geographical Information System (GIS) APIs to compute, with high precision, the percentage of the population that has geographical access to hospitals in two countries: USA and Brazil. We review the availability of data for each country, the magnitude of the computation task, and how we used cloud computing to deliver results in feasible time. We analyze the sociodemographic and economic characteristics of the served and underserved populations under several time thresholds, filter hospitals according to the types of services they provide, and correlate the size of population covered with the volume of utilization of each hospital. We demonstrate that the vast majority of the population resides very near at least one hospital, that this concentration is sharper in Brazil, and how the numbers change after filtering hospitals. We display highly detailed zoom-able maps and demonstrate how misleading their appearance might be. We conclude by reviewing prominent limitations for these analyses in the case of each country.

Bursting the Information Bubble: Designing Inpatient-Centered Technology Beyond the Hospital Room 

Authors: Andrew D Miller, Ari Pollack, Wanda Pratt, University of Washington

Abstract: Although hospital care is carefully documented and electronically available, few information systems exist for patients and families to use while inpatient. We present findings from three participatory design sessions conducted with 13 former patients, their parents, and clinicians from a large children’s hospital. Participants discussed challenges they faced getting information while in the hospital, and then designed possible technological solutions. Participants created 9 designs aimed at extending parents’ access to and involvement in patients’ care.

Participants’ designs showed how information technology can allow parents and children to disseminate information from within the hospital room, access information from the hospital room remotely, establish collaborative communication with the clinical care team, and learn about their child’s care throughout the hospital stay. For example, two child participants envisioned a communicator watch that their parents would use to talk with clinicians remotely. A parent/clinician team proposed a shared calendar for parents and clinicians to use throughout the stay. Several parent-designed solutions focused on simplifying intake, reducing repetitive questions and allowing parents and children to add information proactively.

These designs show that patients and caregivers can be more than recipients of health information; they can produce, aggregate, and learn information throughout a hospital stay.

User-Centered Design and Evaluation of RxMAGIC: A System for Prescription Management and General Inventory Control for  Low-Resource Settings

Authors: Arielle M Fisher, Lauren Jonkman, Gerald P Douglas, University of Pittsburgh

Abstract: The availability of healthcare services in low-resource settings is limited due to health, economic, and education disparities in underserved populations. Free clinics are critical in providing primary care and pharmaceutical services to these patients, however they represent an understudied work environment in healthcare. In addition to service-related challenges, such as difficulty in obtaining essential medicines, free clinics are burdened with distinctive organizational challenges.

Ensuring an uninterrupted drug supply is essential to providing healthcare in these settings. Accurate information on current stock counts is necessary to minimize stockouts and wastage due to expiry. Informatics tools have tremendous potential to assist healthcare workers and enhance process efficiency if designed to support user workflow.

We developed a system for Prescription Management and General Inventory Control (RxMAGIC) at the Birmingham Free Clinic (BFC) in Pittsburgh, PA, a walk-in clinic that serves medically vulnerable populations. A mixed-methods approach was employed to identify and quantify process inefficiencies in the dispensary. RxMAGIC is a modular, problem-driven solution designed to mitigate workflow challenges and improve pharmacist efficiency by streamlining the dispensing process and improving inventory control. Although RxMAGIC was developed in the context of the BFC, we believe it may alleviate similar medication management challenges in developing countries.

 Day 2 – Focus Session B3

Clinical Decision Support Anomaly Pathways

Authors: Steven Z Kassakian, David A Dorr, Oregon Health and Science University

 Abstract: Clinical decision support (CDS) tools are designed to aid decision making with the ultimate goal of improving health outcomes. CDS is a central part of electronic health record (EHR) systems and has been shown to improve a multitude of outcomes. However, in some clinical practice situations, CDS many not improve outcomes and may have detrimental effects on decision making through the increasingly recognized phenomena of alert fatigue. In many situations, the proper functioning of CDS tools is essential to providing appropriate care and their dysfunction may result in poor care and in some cases harm to patients. The tools are usually built around a complex series of logic based on variables in the EHR. Little is known regarding how to appropriately monitor and detect when CDS tools are not functioning as intended. The field of anomaly detection is focused on finding patterns in data which do not conform to historical or predicted patterns. By applying methods of anomaly detection from other domains, we are exploring the ability to detect broken CDS tools. Our preliminary results have discovered multiple CDS tools that are no longer functioning as designed. Most importantly, we are elucidating the pathways through which these CDS tools fail.



Medical Entity Recognition: a Meta-Learning Approach with Selective Data Augmentation 

Authors: Asma Ben Abacha and Dina Demner-Fushman, National Library of Medicine

Abstract: With the increasing number of annotated corpora for supervised medical entity recognition (MER), it becomes interesting to study the combination and augmentation of these corpora for the same annotation task. Combing annotated corpora such as clinical texts or scientific articles is a challenging task since it generally drops the classification performance for supervised systems. We study the combination of different corpora for MER by using a meta­-learning classifier that combines the results of individual conditional random fields (CRF) models trained on different corpora. We propose selective data augmentation approaches and compare them with several meta­-learning algorithms and baselines. We evaluate our approach using four sub-classifiers trained on four heterogeneous corpora: i2b2, SemEval, Berkeley and NCBI. We show that despite the high disagreements between the individual CRF models on the four test corpora, our selective data augmentation approach improves performance on all test corpora and outperforms the simple combination of individual corpora. Our results confirm that the agreement between label predictions of the pairwise models is an effective metric in selecting relevant sources for data augmentation when used with reliability indicators such as the class balance of each corpus.

Untangling the Structure of High-Throughput Sequencing Data with veRitas

Authors: David M Moskowitz, William J Greenleaf, Stanford University

Abstract: High-throughput sequencing offers unprecedented power in describing genomic and epigenomic changes in biological processes, but effective interpretation requires accounting for variance associated with batches, RNA degradation, and other technical details. In this talk, I will introduce veRitas, a method combining principal component analysis with feature selection to elucidate confounding and technical artifacts. This approach additionally assesses differential expression without parametric assumptions, in contrast to existing methods, which are specific to RNA-seq.

Day 2 – Plenary Session #3

Modeling Neutral Evolution at Small Scales


Authors: Aaron Wacholder, David D Pollock, University of Colorado, Anschutz Medical Campus

Abstract: Developing precise models of neutral genomic evolution will enable sensitive detection of selection in the genome, and thus of function. A large body of research demonstrates that, at the megabase scale, neutral substitution rates are strongly dependent on genomic context, such as the recombination rate, replication timing, and chromatin structure. However, a large fraction of regional substitution rate variation occurs at much smaller scales, and the nature of this variation is largely unknown. Investigation of local substitution rate variation has been hindered because low substitution counts in small regions prevents accurate direct estimation of substitution rates.

We developed a model of substitution rates for each substitution type in 1000 bp windows across the genome, accounting for changes over time and effects at different spatial scales. Applying this model to a whole-genome alignment of the great apes, we find strong effects at all spatial scales that differ across time and among substitution types. We identify a major change in the kilobase-scale substitution process between the human-gorilla and human-chimpanzee divergence, while larger-scale substitution processes have remained relatively stable. These findings provide the starting point for a precise time and space dependent model of neutral substitution rates.


EHR-Wide GxE Study using Smoking Information Extracted from Clinical Notes

Authors:  Travis J Osterman, Lisa Bastarache, Wei-Qi Wei, Jonathan D Mosley, Joshua C Denny, Vanderbilt University

Abstract: Genotype by environment interaction (GxE) studies provide a method to assess whether genomic and environmental effects are additive or whether there is an additional interaction. We describe here a GxE study to investigate associations between tobacco exposure and genetic risk across 105 diseases.

Patients were identified from Vanderbilt University Medical Center’s (VUMC) de-identified DNA biobank (BioVU) which is linked to electronic health record data. Approximately 15,000 individuals with exome array data were selected for this analysis. Tobacco exposure was ascertained by a novel natural language processing algorithm. Phenotypes were determined by International Classification of Disease 9 (ICD-9) codes.

We analyzed 1750 SNP-phenotype pairs previously reported in the NHGRI catalog. To test for smoking x SNP interaction, we used a logistic regression with age, gender, pack years, SNP, and pack years x SNP terms. We calculated p-values for the smoking x SNP interaction term, controlling for the remaining covariates.

Smoking was strongly associated with a number of expected phenotypes such as lung cancer. The SNP x smoking interaction p-value was <0.05 for 57 SNP-phenotype pairs. Evidence of interaction was seen in several cancers, including lung, breast, and prostate cancer. Three cardiovascular phenotypes demonstrated interaction: Ischemic heart disease, hypertension, and aortic aneurysm.

High-Throughput Machine Learning from Electronic Health Records

Authors: Ross Kleiman1,2,†, Paul Bennett1,2,†, Peggy Peissig3, Zhaobin Kuang1, James Linneman3, Scott Hebbring3, Michael Caldwell3, David Page1,2

1Department of Computer Sciences, University of Wisconsin, Madison, 2Computation and Informatics in Biology and Medicine, 3Marshfield Clinic, Marshfield, WI

Co-First Author

Abstract: The use of Electronic Health Record (EHR) systems has increased dramatically in recent years. This vast digitization of medical data allows for new ways to predict diseases that were not possible with paper charts. While prior work has focused on predicting individual diseases, our research builds thousands of models to predict nearly every diagnosis (ICD-9 code) a patient could receive. This high-throughput machine learning approach yields inference on the health landscape of both individual patients and patient populations. Integral in our approach is the use of a dynamic control matching scheme that, for each diagnosis, automatically selects appropriate case and control patients using minimal hand tuning. Across the nearly 4,000 models, we observe a mean AUC of 0.8026±0.0619 predicting 1 month prior to diagnosis, and a mean AUC of 0.7585±0.0631 predicting 6 months prior to diagnosis. Furthermore, we break down our results across 15 major disease categories including pregnancy complications and diseases of the circulatory system. This work opens a potential pathway to pan-diagnostic decision support. Instead of only targeting a small number of well understood diseases, this research shows machine learning techniques can be used to help predict the broad spectrum of diagnoses a patient may receive.

Comparison of Variant Annotation Tool Terminology using the Sequence Ontology

Authors: Nicole Ruiz-Schultz, Barry Moore, Shawn Rynearson, Karen Eilbeck, University of Utah


Analysis of next-generation sequencing (NGS) data involves multiple steps including base calling, quality assessment, read alignment, variant calling, variant annotation, and variant prioritization. Variant annotation is the step in sequence data analysis of determining the effect of a sequence variant with regards to the features of a reference sequence. Many open source and commercial tools are available to perform this step, with differing sets of effects annotated and differing terminology used. These differences can make comparing variant annotations from different tools challenging and in some cases, a one-to-one comparison cannot be made. The goal of this project was to present a comparison of terms used by variant annotation tools, utilizing the Sequence Ontology to map between terms.

Terms from VAAST, VEP, ANNOVAR, Jannovar, Seattleseq, SnpEff and VAT were mapped to the SO if not already using the terminology. Prior to the start of this project, VAAST and VEP used SO terms. SnpEff and Jannovar adopted SO usage during the project. We will present the scope of annotation for each tool, the concordance and discordance between the terms. SO is increasingly used to standardize terms from variant annotation tools currently available so results can be easily compared.

Constructing a Biomedical Relationship Database from Literature using DeepDive

Authors: Emily K Mallory, Ce Zhang, Christopher Re, Russ B Altman, Stanford University

Abstract: A complete repository of biomedical relationships is key for understanding cellular processes, human disease and drug response. After decades of experimental research, the majority of the discovered biomedical relationships exist solely in textual form in the literature. While curated databases have experts manually annotate relevant relationships or interactions from text, these databases struggle to keep up with the exponential growth of the literature. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we developed multiple entity and relationship application tasks to extract biomedical relationships from full text articles. Each relationship extractor identified candidate relations using co-occurring entities within an input sentence. Using a set of generic feature patterns, DeepDive computed a probability that an individual candidate relation was a true relationship based on the sentence. For extracting gene-gene relationships, our system achieved 76% precision and 49% recall in extracting direct and indirect interactions. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. In addition, we developed extractors for gene-disease and gene-drug relationships. This work represents the first application of DeepDive to the biomedical domain.