dpmm: a library for synthetic tabular data generation with rich functionality and end-to-end Differential Privacy guarantees
Project description
dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation
Overview
dpmm is a Python library that implements state-of-the-art Differentially Private Marginal Models for generating synthetic tabular data. Marginal Models have consistently been shown to capture key statistical properties like marginal distributions from the original data and reproduce them in the synthetic data, while Differential Privacy (DP) ensures that individual privacy is rigorously protected.
Summary of main features:
- end-to-end DP pipelines including data preprocessing, generative models, and mechanisms:
- DP data preprocessing -- 1) data domain is either provided as input or extracted with DPpaper, and 2) continous data is discretized with DP (Uniform and PrivTreepaper)
- state-of-the-art DP generative models relying on the select-measure-generate paradigmpaper1,paper2 and Private-PGMpaper -- PrivBayespaper, MSTpaper, and AIMpaper
- floating-point precision of DP mechanismspaper
- superior utility and performance
- rich functionality across all models/pipelines
- DP auditing of underlying mechanisms and models/pipelinespaper1,paper2
NB: Intended Use -- dpmm is designed for research and exploratory use in privacy-preserving synthetic data generation (particularly in simple scenarios such as preserving high-quality 1/2-way marginals in datasets with up to 32 featurespaper1,paper2) and is not intended for production use in complex, real-world applications.
Installation
Prerequisites
- Python 3.10 or 3.11
PyPi install
You can also install from PyPi by running:
pip install dpmm
Local Install
To install from the local github repo run the following command:
git clone git@github.com:sassoftware/dpmm.git
cd dpmm
poetry install
Tests
To run the unit tests, go to the root of the repository (if installed locally), and use the following command:
pytest tests/
Functionality
We provide numerous examples demonstrating the features of dpmm across data preprocssing as well as the training and generation of generative models. The examples are available across all models and model settings, and are accessible from the repository (if installed locally).
Preprocessing
The provided generative pipelines combine automatic DP descritization preprocessing with a generative model and allows for the following features:
| Feature | Description | Example |
|---|---|---|
| dtype support | the following pandas data types are supported natively: datetime, timedelta, float, int, category, bool. |
Dtypes example |
| null-value support | missing values are supported and will be reproduced accordingly if present in any column within the real data. | |
| automatic discretisation | while the default discretisation strategy used by dpmm is priv-tree a more typical uniform strategy is also availble, they can both be combined with an 'auto' mode which will attempt to identify the optimal number of bins for each numerical column column. |
Model Features
| Feature | Description | Example |
|---|---|---|
| domain compression | a compress flag can be set to True to ensure the discretised domain is compressed to improve the privacy budget / data quality trade-off. |
|
| model size control | a max_model_size parameter that ensures the memory footprint of the selected marginals remains lower than the specified upper threshold. |
Max Memory example |
| model serialisation | pipelines can be serialised to / deserialised from disk by provided a valid folder to store the model to. | Serialisation example |
Generation Features
| Feature | Description | Example |
|---|---|---|
| conditional generation | at generation time, it is also possible to provide a partial dataframe containing only some of the columns, in that case the generative pipeline will conditionally generate the remaining columns. | Conditional Generation example |
| deterministic generation | when a random_state value is provided at generation time, the generative process becomes deterministic assuming the same input parameters are provided. |
Random State example |
Models
The implemented models include:
| Method | Description | Reference | Example |
|---|---|---|---|
| PrivBayes+PGM | Differentialy Private Bayesian Network. | PrivBayes: Private Data Release via Bayesian Networks | PrivBayes example |
| MST | Maximum Spanning Tree. | Winning the NIST Contest: A scalable and general approach to differentially private synthetic data | MST example |
| AIM | Adaptive and Iterative Mechanism. | AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data | AIM example |
NB: All models rely on the select-measure-generate paradigmpaper1,paper2 and Private-PGMpaper.
Getting Started
To get started with using the dpmm, follow the steps below:
-
Import the necessary modules and load your data:
import pandas as pd import json from dpmm.pipelines import MSTPipeline wine_dir = Path().parent / "wine" df = pd.read_pickle(wine_dir / "wine.pkl.gz") with (wine_dir / "wine_bounds.json").open("r") as f: domain = json.load(f)
-
Initialize and fit a model:
model = MSTPipeline( # Generator Parameters epsilon=1.0, delta=1e-5, # Discretiser Parametrs proc_epsilon=0.1, ) model.fit(df, domain)
-
Generate synthetic data:
synth_df = model.generate(n_records=100) print(synth_df) """ type fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality 0 white 5.288142 0.190330 0.212473 1.402665 0.032305 37.097305 60.585301 0.990234 2.998241 0.658841 12.467682 1 1 white 5.956364 0.225099 0.210124 15.968057 0.043620 70.073909 202.689578 0.995807 3.198247 0.318414 10.290390 0 2 white 5.315535 0.341091 0.247268 0.628240 0.024938 52.468176 104.892353 0.990975 3.161218 0.971699 11.181373 1 3 white 7.879125 0.234170 0.275704 3.711610 0.039565 68.977194 163.380550 1.005989 3.068622 0.798520 8.075999 0 4 white 6.981342 0.358461 0.337705 3.600390 0.050450 51.567452 134.896467 0.996149 3.272745 0.599021 10.200400 0 """
Troubleshooting
If you encounter any issues, please check the following:
- Ensure that all required packages are installed.
- Verify that your data does not contain missing values or non-integer columns if using certain models.
- Check the model parameters and ensure they are set correctly.
Contributing
Maintainers are accepting patches and contributions to this project. Please read CONTRIBUTING.md for details about submitting contributions to this project.
License
This project is licensed under the Apache 2.0 License. This project also uses code snippets from the following projects:
- private-pgm: Apache 2.0
- opendp: MIT License
- ektelo: Apache 2.0
Additional Resources
Citing
If you use this code, please cite the associated paper:
@inproceedings{mahiou2025dpmm,
title={{dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation}},
author={Mahiou, Sofiane and Dizche, Amir and Nazari, Reza and Wu, Xinmin and Abbey, Ralph and Silva, Jorge and Ganev, Georgi},
booktitle={TPDP},
year={2025}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dpmm-0.1.9.tar.gz.
File metadata
- Download URL: dpmm-0.1.9.tar.gz
- Upload date:
- Size: 57.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/6.12.28-0-virt
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc43064737275f8a58dd003094fd4f2af47f355e045ddf140da71c37343a563d
|
|
| MD5 |
fe39c5800421cae8ee0988fea7d7eaf9
|
|
| BLAKE2b-256 |
7734c1a69020d9279a13fe68ef9eb2410431a712dc89453c0d3e535a1b115938
|
File details
Details for the file dpmm-0.1.9-py3-none-any.whl.
File metadata
- Download URL: dpmm-0.1.9-py3-none-any.whl
- Upload date:
- Size: 69.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.11 Linux/6.12.28-0-virt
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbd71d26caa51733cf1d8382f140faa755d7157e7d471191fee2c4a862a2f51b
|
|
| MD5 |
bfad9bd55821a6cd3c26831e67241242
|
|
| BLAKE2b-256 |
0caa9647c79eb9260b5e899fbd711a190d6a8cc1fea4fe6bb9daf66de9d00cc0
|