Methods and Applications of Differential Privacy in Statistical Problems
Statistical analysis is vital for research and real applications in various fields and disciplines. When conducting statistical analysis on real-world data, privacy concerns often arise as sensitive information can be inferred from released statistics. In the era of big data, ensuring data privacy during statistical analysis has become increasingly important, necessitating a rigorous definition of privacy. Differential privacy (DP) offers a quantifiable measure of privacy protection, and this dissertation explores its applications in diverse scenarios. Firstly, I investigate the utility of differentially private hypothesis testing, developing new methods and providing results for commonly used tests such as z-test, t-test, and chi-squared test in the settings of one-sample mean tests, two-sample tests, variance tests, goodness-of-fit tests, and independence tests and evaluate the utility of the tests in terms of statistical power while maitaining Type-I error rate. Secondly, a differentially private Metropolis-Hastings (MH) algorithm is designed that outperforms existing privacy-preserving MH algorithms in simulation studies and a real case study in terms of the parameter estimation and prediction accuracy. Thirdly, I examine privacy-preserving data-sharing where multiple parties, or data owners, possess overlapping attributes but non-overlapping individuals. Each data owner privatizes their data to share privacy-preserving synthetic data. I demonstrate that the utility of the merged privatized data surpasses that of individual small datasets without perturbation by showing that the parameter estimation of regression models on merged data is better. Lastly, DP is applied to normalizing flow, a deep generative model family, to generate privacy-preserving synthetic datasets of an electronic health records dataset. I show that the accuracy of the classification and regression model on the synthetic data can be close to the original data. In conclusion, this dissertation provides an in-depth and thorough investigation of several applications of DP in hypothesis testing, MH sampling, and sharing of synthetic data, showcasing the potential of DP to address pressing privacy concerns while preserving the data utility of statistical analysis.
History
Date Modified
2023-05-31Defense Date
2023-05-22CIP Code
- 27.9999
Research Director(s)
Fang LiuDegree
- Doctor of Philosophy
Degree Level
- Doctoral Dissertation
Alternate Identifier
1380758749OCLC Number
1380758749Program Name
- Applied and Computational Mathematics and Statistics