Big data approaches for analyzing genomic intervals

Computational Biology; Genomics; Bioinformatics; Data Storage Systems; data science

Genomic interval data is frequently siloed, making it difficult to do cross-database analysis. In this project, we will develop data models and distributed computing techniques to improve analysis.

Over the past decade, advances in DNA sequencing technology have driven a dramatic increase in the amount of genomic data. A major part of this is the rapid expansion of epigenomic data, which uses DNA sequencing technology to measure quantitative signals about different parts of the genome. These epigenomic signals vary widely from person to person and are influenced by many factors, including age, cell-type, ethnicity, disease, and environment. This data is a key resource for understanding human disease.

Unfortunately, epigenome data is produced by hundreds of independent labs and consortia, making it a challenge to identify and curate relevant data. Furthermore, the data is typically published in disconnected locations and formats, so methods developers and data analysts are faced with the challenge of identifying, curating, and integrating multiple data sources, leading to repeated effort and incomplete resources. This complicates and weakens genome-based analysis. It would clearly be valuable to facilitate connecting data from multiple labs and projects across data sources.

To solve these challenges, we will develop a way to integrate diverse sources of genomic data. We will first develop novel data models for genomic region data. By borrowing proven ideas from the semantic web, we will simplify and automate the process of collating many sources and types of data. Instead of modeling sets of regions as units, we will model each region independently, creating a graph of relationships and metadata among genomic regions. This will facilitate genomic region-based analysis, including defining highly specific region sets, discovering novel region sets, and connecting regions both within and between region sets.

We will then automate identifying and quality control of genomic region data. We will develop a system to mine public literature for studies that annotate genomic regions and automatically collect, annotate, and store these regions. We will test advanced unsupervised machine learning methods and novel similarity measures to better understand relationships among individual regions and region sets and use these methods to ensure quality of the data in the database. Based on the established data models and data we curate, we will then explore designing a distributed filesystem formed by servers from geographically distributed labs for efficient and easy data querying. We will apply cutting-edge distributed filesystems and cluster processing engines (such as Apache Hadoop and Spark) to implement the system and fast analysis of our genomic interval data.

Ultimately, our project will enable biomedical data analysts to more efficiently make use of epigenome data, even if it is distributed across many physical locations and computing environments.

Desired outcomes

1. We will develop vocabularies and data models for genomic interval data.

2. We will employ novel computational approaches to automate data retrieval, exploration, and quality control.

3. We will develop methods for fast queries on this data even when distributed across remote databases.