Predicting genomic regulatory elements across species using domain adaptive neural networks (Faculty/Rising Researcher Collaboration Opportunity) - PSU Institute for Computational and Data Sciences

Predicting genomic regulatory elements across species using domain adaptive neural networks

PI: Shaun Mahony (Biochemistry & Molecular Biology)

Plan for funding tuition for graduate students, or the remainder of the researcher’s salary for postdoc and research faculty: Requesting two semesters at 50% RA. Tuition support can be provided from PI unrestricted funds.

Every cell in our body contains a copy of the same DNA genome, but different cell types achieve their own particular biological functions and behaviors by producing different sets of RNA molecules and proteins. The first step in gene regulation is controlled by proteins called transcription factors, which recognize specific DNA patterns on the genome and recruit molecular machinery to turn nearby genes on or off (eventually determining which RNAs and proteins are created). Thus, the function and identity of a cell is determined by which combination of transcription factors are active in that cell type. The architecture of gene regulatory networks – i.e., the transcription factors that are active in a given cell type and the DNA patterns that they recognize – is highly conserved across closely related species. In other words, the regulatory patterns that determine whether a gene is expressed in human livers are very closely similar to the regulatory patterns that determine whether a gene is expressed in mouse livers. This observation raises an interesting question: rather than trying to experimentally characterize gene regulatory codes in every species separately, could we train a computational model on experimental data performed in one species and then use that model to accurately predict what the same experiment would look like in other species? Such cross-species regulatory models would open the possibility of studying gene regulation in cell types that are difficult to study directly in humans, and it would allow the study of cell types in agricultural and other species of interest without the need for costly experiments.

We and others have demonstrated that convolutional and transformer based neural networks are highly effective at learning gene regulatory codes from experimental data. However, current approaches are not fully effective at cross-species predictions; models trained on one species and tested on another consistently underperform models trained and tested in the same species. In previous work, we demonstrated that this performance gap is due to a domain shift that exists between genomes from different species (Cochran, et al. Genome Res, 2022). We further showed that a simple domain adaptation approach closed some of the performance gap, but issues remained.

In this project, our goal is to implement additional domain adaptation strategies to enable accurate crossspecies gene regulatory predictions. We are particularly interested in the multi-source training scenario, where we have labeled training data from multiple genomes/domains. We wish to test several recent approaches for training domain adaptive neural networks, including those based on moment matching and Wasserstein distance guided representations. Our ultimate goal is to train cross-species models that can accurately predict gene regulatory features across hundreds of vertebrate genomes, thereby enabling the study of how regulatory networks and cellular function evolves across species.

Computational skills: Experience with implementing neural networks required (ideally PyTorch or TensorFlow experience). Experience in bioinformatics preferred.

Objectives and goals: The Rising researcher will research and implement domain adaptive neural network approaches that would be appropriate for the gene regulatory prediction setting. They will integrate these approaches with an existing convolutional neural network approach for cross-species regulatory feature prediction and test accuracy across species where we have ground truth knowledge. Medium-term goal for the PI is to use preliminary data generated by the junior researcher as the basis for an NSF grant application (in response to Dear Colleague Letter NSF 24-131: https://www.nsf.gov/funding/opportunities/dcl-advancing-research-intersection-biology-artificial). Medium-term goal for the junior researcher is to write a manuscript describing the results of their investigations, in collaboration with the PI.

Connection of the project to ICDS’s mission: This project directly support’s ICDS’s mission by catalyzing an interdisciplinary research team that will synthesize domain expertise with advanced machine learning approaches to address research questions of fundamental importance in molecular biology.

PI engagement with ICDS: The PI was an ICDS Affiliate Member from inception of the program in 2017 until the program was disbanded this year, and served on the ICS Coordinating Committee (20182020). The PI welcomes additional future engagement with the ICDS in all roles.