Privacy-Preserving Linear Regression and Synthetic Data for Reproducible Social Science Research (Faculty/Rising Researcher Collaboration Opportunity) - PSU Institute for Computational and Data Sciences

Privacy-Preserving Linear Regression and Synthetic Data for Reproducible Social Science Research

PI: Aleksandra Slavkovic (Statistics)

The remaining 75% of postdoc’s salary and fringe will be support by the PI’s research funds. We are assuming, the base postdoc stipend for 12 months of $61,008 plus 27.2% fringe.

Project Description

In the broader domain of social sciences, where small- to medium-scale datasets are common, linear regression modeling and subsequent statistical inference are prevalent tools used to answer key scientific questions. Data in the social, economic and behavioral sciences, for example, readily involve sensitive information. The confidentiality protection methodology for sharing data and the results of the relevant statistical analyses has a long history drawing from many fields (e.g., Hundepool et al. (2012), Slavkovi´c & Seeman (2023)). The modern methods now rely predominantly on differential privacy (DP) (Dwork et al. 2006) as the gold standard to ensure rigorous privacy guarantees. Numerous methods for fitting linear regression under DP have been proposed. Yet, most methods focus solely on point estimation, offering limited support for statistical inference, and often rely on strong assumptions on the data.

Meanwhile, reproducibility and replicability are important concepts in trustworthy social science research (National Academies of Sciences, Engineering, and Medicine 2019, Mukherjee et al. 2024). Researchers often want to conduct replication studies to verify or build upon prior analyses. In privacy-aware settings, however, typical DP methods only return model estimates, preventing others from revisiting or extending the analysis without access to the original data. Synthetic data generation (SDG) offers a possible solution. Although the basic idea was proposed in the early 1990s, its broad adoption is still lacking (van Kesteren 2024). Furthermore, most synthetic data generation (SDG) methods with differential privacy (DP) guarantees (Jordon et al. 2018, Xie et al. 2018, Xin et al. 2020, 2022) rely on large datasets and complex models, typically based on deep learning, making them ill-suited for smaller-scale applications. At the same time these very same deep-learning models need more data to improve their performance, and ideally trustworthy data. Moreover, the theoretical implications of using such synthetic data for linear regression remain unexplored, while thier theoretical and computational guarantees are necessary. While our focus is on smaller- to-medium-scale data, due to DP guarantees, the computational needs grow fast.

This project aims to address these challenges by developing a novel method for DP linear regression that enables valid statistical inference and supports synthetic data generation. Specifically, we aim to: (1) design a new approach with formal DP guarantees tailored for low- to moderate-dimensional settings; (2) develop a synthetic data generation mechanism that enables follow-up analyses or replication studies at no additional privacy cost; and (3) conduct extensive comparison studies with mainstream deep learning–based algorithms commonly used in AI applications.

Desired Expertise and Expectations

We are seeking a Junior Researcher with a background in statistics, computer science, or data science. Preferred qualifications include familiarity with differential privacy, statistical inference, or synthetic data, as well as proficiency in Python or R. The ideal candidate is a postdoctoral researcher who can work independently and help supervise undergraduate researchers. We are also open to considering an undergraduate majoring in statistics or data science who has strong programming skills and has completed coursework at the STAT 400 level or higher.

A Rising Researcher will contribute to algorithm design and implementation, empirical evaluation, and collaborative writing for peer-reviewed publication. This project offers a valuable opportunity to engage with both foundational and applied aspects of privacy-aware statistical methods and aligns closely with ICDS’s mission to advance trustworthy and ethical data science. Undergraduate researchers may be involved in evaluating the numerical performance of machine learning and deep learning methods for comparative analysis.

Specific Objectives for the Funding Period

The Rising Researcher will (1) design and implement a new method for linear regression and synthetic data generation, (2) conduct benchmarking experiments against baseline methods, (3) co-author a scientific paper to be submitted to a statistics, data science or AI venue, (4) contribute to reproducible code and documentation for open-source release, and (5) support the organization of an ICDS-affiliated seminar or workshop on reproducible, privacy-aware data analysis.

Medium- to Long-Term Goal

A medium-term goal is to produce a high-quality scientific paper for submission to a statistics, data science, or AI venue. This work will also lay the foundation for submitting a larger collaborative grant to relevant funding agencies. We would engage with a reproducability expert from Cornell University, Lars Vilhuber, and explore partnerships with colleagues at Penn State. The long-term goal is to build scalable, reproducible, privacy-preserving tools for statistical analysis in social science and policy settings.

Connection to ICDS Mission

This project directly contributes to ICDS’s mission of enhancing interdisciplinary education, research, and outreach in artificial intelligence, its applications, and its societal impacts. Our work promotes and practices socially responsible approaches to AI by enabling privacy-preserving data analysis and reproducibility in sensitive domains. It emphasizes the ethical deployment of statistical technologies that can generate social good—particularly in policy-relevant and human-facing research contexts. By developing algorithms that account for both privacy and statistical validity in small-sample settings, the project helps mitigate risks associated with the misuse of AI methods in research and supports the responsible design and application of computational tools. The most relevant centers for this work would be The Center for Artificial Intelligence Foundations and Scientific Applications (CENSAI) and The Center for Socially Responsible Artificial Intelligence (CSRAI).

Engagement with ICDS

PI Slavkovic had a long-standing status as one of the ICDS Associates. In that role she has been engaged as a member of the JEDI team, a member of the Steering Committee for the Center for Artificial Intelligence Foundations and Scientific Applications (CENSAI) and, mosst recently, of the Center for Socially Responsible Artificial Intelligence (CSRAI). We expect the Rising Researchers to engage more closely with one of these centers and support the organization of an ICDS-affiliated seminar or workshop on reproducible, privacy-aware data analysis.