Inference on Multivariate Gaussian Processes via Deep Learning Neural Networks for Astronomical Time Series Data Analysis (Faculty/Junior Researcher Collaboration Opportunity)

Inference on Multivariate Gaussian Processes via Deep Learning Neural Networks for Astronomical Time Series Data Analysis

PI: Dr. Hyungsuk Tak (Statistics, Astronomy & Astrophysics, ICDS) and Dr. Eric Ford (Astronomy & Astrophysics, ICDS)

Apply as Junior Researcher 

Multivariate time series data are commonly encountered in astronomy, where the brightness of a particular object is monitored over time using several optical filters or bands. For example, the upcoming large-scale Vera C. Rubin Observatory survey will monitor billions of objects in the sky using six different filters, producing six time series data sets for each astronomical object. Figure 1 shows a realistic five-band time series data set of a quasar observed by the Sloan Digital Sky Survey. When hunting for exoplanets, astronomers collect time series of high-resolution spectroscopic surveys. Astronomers can measure the position and shape of thousands of spectral lines for each observation time. Traditionally, astronomers analyze the position and shape of the cross-correlation function (CCF) of the spectrum with a “mask” (which you can think of as an “average” spectral line). Recent work has begun considering how to make use information from all lines (or groups of lines that behave similarly), before averaging or computing CCFs.

While astronomers have access to many advanced methods and software packages for analyzing single-band (univariate) time series data, there are relatively few tools available to model the cross correlations across multiple time series data sets for each object.

It is conceptually straightforward to model cross-correlations using a Gaussian process framework. For instance, multiple time series data can be vectorized by stacking the band-wise series into a single vector. The corresponding multivariate Gaussian distribution is then defined by a mean vector and covariance matrix that capture cross-band correlations (see Figure 2). However, this leads to a large, dense covariance matrix, creating a significant computational burden. Evaluating the Gaussian process likelihood requires matrix inversion, which incurs a computational cost of O(n3), where n is the length of the combined observation vector. As a result, exact likelihood evaluations become infeasible as the size  of the data grows (e.g., as telescopes observe target stars/galaxies over time). Current exoplanet surveys have begun collecting hundreds of observations of each target star. The ability to use information from many lines depends on developing scalable methods for multi-output timeseries. The Legacy Survey of Space and Time (LSST) recently began all sky testing and is poised to survey the entire southern sky every three nights for ten years. It is important to develop scalable methods to be able to model even a small fraction of the tens of billions of objects to be surveyed by LSST.

Two notable scalable approaches to mitigate this cost are the Vecchia approximation and Kalmanfiltering. The Vecchia approximation reduces computational complexity by approximating the high-dimensional joint distribution with a product of conditional distributions [1], bringing the cost down to approximately O(nlog(n)) or better. However, this approach assumes conditional independence, which may not hold for astronomical multi-band time series data. In contrast, the Kalman-filtering approach allows exact likelihood evaluation with linear complexity O(n) [2], but it is only applicable to certain classes of Gaussian processes, such as continuous-time autoregressive moving average processes based on stochastic differential equations.

Given these limitations, we propose a project that leverages deep learning neural networks to approximate the likelihood function of Gaussian processes for multi-band time series data. This approach is promising because it does not require evaluating the likelihood function directly, provided that simulated datasets can be generated from the Gaussian process model [3, 4, 5]. The goal of this project is to evaluate and identify the strengths and weaknesses of the deep learning approach. Ideally, we would make comparisons to the Vecchia approximation and Kalman f iltering. However, the time required to perform such comparisons will likely depend on the prior expertise of the junior researcher.

Expertise and Skill Sets of Interest: This project requires (1) proficiency in Python and/or Julia, (2) experience using PyTorch, TensorFlow, and/or JAX to implement recurrent neural networks (e.g., LSTM or GRU) or transformer-based models, and (3) knowledge of high-performance computing environments, such as ICDS Roar Collab.

Expectations and Tasks: We seek a post-comps graduate student or a postdoc with training in Statistics, Astronomy & Astrophysics, or a related field (e.g., Applied Math, Computer Science, IST, Physics). Responsibilities include: (1) generating simulated datasets from a given Gaussian process model, (2) fitting recurrent or transformer-based neural network models to each simulated time series dataset to extract feature vectors, and (3) using those feature vectors in a binary neural network classifier with labeled training data to approximate the likelihood function.

Level of Effort: 50% RA-ship for one semester during the academic year 2025–26. This position does **not** cover tuition, so graduate students will need additional funding for tuition. Alternatively, this project can be pursued as a full-time summer project (100% summer RA-ship) during summer 2026, which does not require tuition.

Principal Investigators: Dr. Hyungsuk Tak (Statistics, Astronomy & Astrophysics, ICDS) and Dr. Eric Ford (Astronomy & Astrophysics, ICDS)

Outcomes: Successful completion of the three core tasks will substantially advance the research goals of Dr. Tak’s and Dr. Ford’s groups. The results will form the basis of a scientific publication or a funding proposal to external agencies such as NASA ROSES or NSF AAG.