University of Notre Dame
Browse
- No file added yet -

Methods and Applications of Differential Privacy in Statistical Problems

Download (1.96 MB)
thesis
posted on 2023-05-26, 00:00 authored by Bingyue Su

Statistical analysis is vital for research and real applications in various fields and disciplines. When conducting statistical analysis on real-world data, privacy concerns often arise as sensitive information can be inferred from released statistics. In the era of big data, ensuring data privacy during statistical analysis has become increasingly important, necessitating a rigorous definition of privacy. Differential privacy (DP) offers a quantifiable measure of privacy protection, and this dissertation explores its applications in diverse scenarios. Firstly, I investigate the utility of differentially private hypothesis testing, developing new methods and providing results for commonly used tests such as z-test, t-test, and chi-squared test in the settings of one-sample mean tests, two-sample tests, variance tests, goodness-of-fit tests, and independence tests and evaluate the utility of the tests in terms of statistical power while maitaining Type-I error rate. Secondly, a differentially private Metropolis-Hastings (MH) algorithm is designed that outperforms existing privacy-preserving MH algorithms in simulation studies and a real case study in terms of the parameter estimation and prediction accuracy. Thirdly, I examine privacy-preserving data-sharing where multiple parties, or data owners, possess overlapping attributes but non-overlapping individuals. Each data owner privatizes their data to share privacy-preserving synthetic data. I demonstrate that the utility of the merged privatized data surpasses that of individual small datasets without perturbation by showing that the parameter estimation of regression models on merged data is better. Lastly, DP is applied to normalizing flow, a deep generative model family, to generate privacy-preserving synthetic datasets of an electronic health records dataset. I show that the accuracy of the classification and regression model on the synthetic data can be close to the original data. In conclusion, this dissertation provides an in-depth and thorough investigation of several applications of DP in hypothesis testing, MH sampling, and sharing of synthetic data, showcasing the potential of DP to address pressing privacy concerns while preserving the data utility of statistical analysis.

History

Date Modified

2023-05-31

Defense Date

2023-05-22

CIP Code

  • 27.9999

Research Director(s)

Fang Liu

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1380758749

OCLC Number

1380758749

Program Name

  • Applied and Computational Mathematics and Statistics

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC