Interpreting the biological concepts learned by neural networks in genomic predictive tasks (Faculty/Junior Researcher Collaboration Opportunity) - PSU Institute for Computational and Data Sciences

Interpreting the biological concepts learned by neural networks in genomic predictive tasks

PI: Shaun Mahony (Biochemistry & Molecular Biology)

Requesting two semesters at 50% RA. Tuition support can be provided from PI unrestricted funds.

Project Summary:

Neural networks have become widely adopted across a broad range of predictive tasks in genomics and computational biology. Convolutional neural networks and transformer-based language modeling approaches are highly effective at recognizing combinations of subtle DNA sequence features that play roles in particular biological processes and they have an unparalleled ability to integrate diverse experimental measurements into the feature space alongside DNA sequences. These approaches are now preferred for many genomics tasks, including: predicting gene regulatory sequences, predicting the effects of individual-specific genomic variants, predicting and generating sequences that control genes in specific cell types, and imputing experimental data across cell types and species. However, direct interpretation of how/why neural network models make specific predictions is limited by their extremely high complexity. Model interpretation is particularly important in many biological settings. We often care more about what features the model has learned than the results of individual predictions, because our underlying motivation is understanding biological mechanistic principles. In addition, the lack of model interpretability limits the clinical adoption of neural network techniques for genomic predictions, because it is difficult to trust or verify the model predictions.

Some feature attribution techniques have been developed for genomic neural networks, but they have several limitations. For example, DeepLift is a feature attribution approach that is popularly applied to give per-nucleotide attributions for a given DNA-based predictive task, and a technique called TFMoDisco can summarize those attributions across the genome to provide interpretable “motif” features. However, these approaches cannot be applied to non-DNA experimental features, and they cannot be applied to k-mer based language models that underlie a growing number of “foundation models” for biology.

In this project, we aim to develop an alternative approach for feature interpretation in genomics neural networks. Our goal is to implement and test the Testing with Concept Activation Vectors approach (Kim, et al., 2018) to assess how genomic “concepts” are used by neural networks as opposed to focusing on individual DNA element features. While concepts could be individual DNA motif features, they can also be more diffusely defined ideas such as “promoter elements”, “repetitive elements”, or “elements active in cell type X”. We hypothesize that concept activation vectors will provide biologists with a more intuitive explanation of how genomic neural networks are forming their predictions across a wide variety of tasks. We also note that this approach can be applied in many settings where other feature attribution methods cannot, including to k-mer based language models.

Computational skills: Experience with implementing neural networks required (ideally PyTorch or TensorFlow experience). Experience in bioinformatics preferred.

Objectives and goals: The junior researcher will research and implement a Testing with Concept Activation Vectors approach that will be appropriate for genomic neural networks. They will test these approaches to measure concept activations across a wide variety of existing trained neural networks. The medium-term goal for the PI is to use preliminary data generated by the junior researcher inclusion in an NIH R35 grant renewal application. The medium-term goal for the junior researcher is to write a manuscript describing the results of their investigations, in collaboration with the PI.

Connection of the project to ICDS’s mission: This project directly support’s ICDS’s mission by catalyzing an interdisciplinary research team that will synthesize domain expertise with advanced machine learning approaches to address research questions of fundamental importance in molecular biology.

PI engagement with ICDS: The PI was an ICDS Affiliate Member from inception of the program in 2017 until the program was disbanded this year, and served on the ICS Coordinating Committee (20182020). The PI welcomes additional future engagement with the ICDS in all roles.