Machine Learning for Health Risk Prediction from Longitudinal Health Data (Faculty/Rising Researcher Collaboration Opportunity) - PSU Institute for Computational and Data Sciences

Machine Learning for Health Risk Prediction from Longitudinal Health Data

PI: Vasant Honavar (Informatics and Intelligent Systems)

Proposal Description:

The objective of this project is to develop and evaluate powerful machine learning frameworks for health risk prediction that specifically take into consideration the temporal dynamic of electronic health records (EHRs). Although traditional models are simpler to understand, they often rely on linear combinations of a few known risk factors while ignoring the the underlying temporal patterns, nonlinear interactions among variables, and temporal variations in patient health trajectories. This raises the question

There are four primary challenges with temporal EHR data:

• Irregularity in time intervals between clinical events and measurements

• Sparsity due to missing or infrequent measurements

• Heterogeneity of data types (e.g., labs, medications, diagnoses)

• Opacity in model interpretability, especially in high-stakes settings

Although recent advancements in Deep Learning algorithms can address key challenges in EHR data, there is no single DL model that simultaneously overcome all of the above EHR data challenges. This proposal fills that gap by integrating recent advances in predictive modeling from sparsely and irregularly time sampled longitudinal data into state-of-the-art risk models. The resulting models will be applied to a real-world cardiovascular disease risk prediction in collaboration with clinical researchers.

The research aims to address the following questions:

RQ1: How do different modeling strategies (aggregation vs. longitudinal modeling of varying degrees of sophistication) compare in predicting health risks from longitudinal health data
RQ2: Can we develop a unified model that addresses irregularity, sparsity, heterogeneity, and opacity?
RQ3: How do the resulting models perform across diverse subgroups, e.g., in multi-site studies?

Project Objectives:

1. Build and benchmark a family of increasingly sophisticated health risk prediction models using real EHR data

2. Quantify model gains over aggregation-based baselines

3. Evaluate generalizability across external datasets and patient subgroups

Long-Term Goal:

To establish a unified, modular framework for temporal clinical modeling that is generalizable across datasets, interpretable for clinicians, and adaptable to other domains of risk prediction.

Connection to ICDS Mission:

This project directly supports the ICDS mission by advancing core computational methods in machine learning and applying them to improve real-world health outcomes and decision-making.

Ideal student background:

Deep knowledge of machine learning, especially for predictive modeling from high-dimensional longitudinal data, and familiarity with electronic health records data.