Rafal Kulik's Home Page

Research

Last modified: 23 October, 2024

Data privacy

With my students I am involved in a several data privacy projects.

MITACS project Privacy Guarantees and Risk Identification: Statistical Framework and Methodology (January 2020 - December 2022, jointly with Privacy Analytics).

Personal information comprises the most sensitive and intimate details of one’s life”. As such it is important to release data in such the way that the risk of re-identification is as low as possible. At the same time, the de-identification process should preserve data quality. Therefore, the goal of the project is to review and further develop statistical methods that are suitable for the so-called Differential Privacy. This framework was introduced in the computer science literature about 10-15 years ago, but related statistical methodology is very limited. Even such basic questions as estimation of re-identification probabilities or how to measure the risk of re-identification, have not been fully answered. Therefore, we will analyse performance and applicability of classical statistical methods, like point and interval estimation, as well as more advanced modern tools like minimax estimation or machine learning. On the other hand, the methods that we are going to develeop as a part of this project, will contribute to general theory of mathematical statistics. The project will consists of both theoretical and computational component. The tools that the project will develop will have direct application to health data.
- Devyani Biswal (PhD student) defended her PhD thesis Contributions to Probabilistic and Statistical Foundations of Differential Privacy in October 2024. The topics included: Differential Privacy from a data utility perspective; Differential Privacy in Time Series; Differential Privacy in Machine Learning algorithms.
MITACS project Statistical framework and methodology for risk and privacy in complex and high-dimensional data (May 2023 - December 2026, jointly with Privacy Analytics).

Modern data collection and storage results in complex and high-dimensional databases: they include a large number of variables, have spatial and temporal dependence, and inconsistent or missing data (e.g., due to non-response bias in surveys or complex data collection practices and linking challenges). At this same time, access and release of information that is, or is derived from, personal information involves complex challenges in terms of the potential for inappropriate disclosure (e.g., identification, attribution, or inferential disclosure risks). In this project we propose to develop a statistical methodology that can inform the evaluation of privacy assurances while preserving the statistical utility of complex, high-dimensional health data. The important themes of this work include high-dimensionality, sparsity and complexity. The project will consist of both theoretical and computational components for statistical inference. Theoretical results will be used for researchers working in mathematical statistics, while the computational work will inform practice. The tools developed will have direct application to census and health data. Our prior work in this area focused on the development of a statistical approach to measure disclosures for low-dimensional data, inspired by modern concepts of technical privacy metrics. However, in this proposed work we intend to focus more on the statistical properties of proposed privacy metrics and their influence on data utility. This will allow us to develop a statistical framework and methodology for data utility while encompassing the technical concepts of disclosure risks.
- Chang Qu (MSc student) is defending his MSc thesis Contributions to Statistical Theory of Data Privacy in November 2024. The topics included:Disclosure Risk Measures and Anonymization; Synthetic Data Generation; Synthpop Package: Description, Challenges and Solutions.
Office of the Privacy Commissioner of Canada Contributions Program 2023-24 Benchmarking Differential Privacy and Existing Anonymization or Deidentification Guidance (July 2023 - March 2024, jointly with Teresa Scassa). See the program announcement.

Government and private industry, including official statistics organizations or health institutions, collect information from individuals and publish aggregate data to serve the public interest. Organizations have long collected information under a promise of confidentiality, on the understanding that the information provided will be used for statistical purposes only and that the release and sharing of information will prevent information from being traced back to a specific individual. Differential privacy provides a means of limiting the information that is released so that an individual’s contribution remains hidden from a statistical release of a single query (or a small number of queries). Recently there has been a significant push to establish differential privacy as a standard in emerging AI technologies. Though the technique is starting to be widely used by tech companies and government agencies, there are challenges that must be overcome before we can see a full adoption of this technology when it comes to deidentification and anonymization. This project will aid in developing a framework necessary to implement differential privacy in practice, as well as help form a decision-making protocol in terms of other privacy technologies and current guidance.
- Final report prepared together with students Heidi Barriault and Patrick Fogaing Koumao.
- Presentation for the Government of Manitoba.
Office of the Privacy Commissioner of Canada Contributions Program 2024-25 Benchmarking large language models and privacy protection (July 2024 - March 2025, jointly with Teresa Scassa). See the program announcement
.

In the current digital age, the accelerated growth of data generated by individuals has fuelled advances in artificial intelligence (AI), particularly the development and deployment of large language models (LLMs). These sophisticated AI systems, capable of understanding, generating and interacting with human language in ways that mimic human thought processes, are becoming integral to applications ranging from personalized content creation to drug discovery. As these models become more deeply embedded in the daily functions of society, the need to protect individual privacy within these systems is crucial. The rapid development of LLMs and the pace at which these tools are evolving present a significant challenge in defining current and practical guidelines that can effectively address the use and deployment of these systems. Given the unique capabilities and risks associated with LLMs, there is a growing need to establish robust privacy standards specifically tailored to these technologies. This project will provide a practical introduction to LLMs and will explore privacy challenges for legal and policy experts and the role of privacy-enhancing technologies. Researchers will survey legal, policy and technical experts, as well as civil society groups to explore the benefits and opportunities of these technologies. They will also provide recommendations and public education materials.
- Students: Heidi Barriault, Bartosz Glowacki, Chang Qu, Yuma Wu.