Skip to main content

BlockingPy meta package (GPU)

Project description

License Project Status: Active – The project has reached a stable, usable state and is being actively developed. Python version codecov PyPI version Ruff Tests
GitHub last commit Documentation Status PyPI Downloads PyPI (GPU) CUDA ≥12.4

pyOpenSci Peer-Reviewed DOI

BlockingPy

BlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on R blocking package.

Additionally, GPU acceleration is available via blockingpy-gpu (FAISS-GPU).

Purpose

When performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.

Installation

BlockingPy requires Python 3.10 or later. Installation is handled via PIP as follows:

pip install blockingpy

Note

You may need to run the following beforehand:

sudo apt-get install -y libmlpack-dev # on Linux
brew install mlpack # on MacOS

for the GPU version: see here or docs

Basic Usage

Record Linkage

from blockingpy import Blocker
import pandas as pd

# Example data for record linkage
x = pd.DataFrame({
    "txt": [
            "johnsmith",
            "smithjohn",
            "smiithhjohn",
            "smithjohnny",
            "montypython",
            "pythonmonty",
            "errmontypython",
            "monty",
        ]})

y = pd.DataFrame({
    "txt": [
            "montypython",
            "smithjohn",
            "other",
        ]})

# Initialize blocker instance
blocker = Blocker()

# Perform blocking with default ANN : FAISS
block_result = blocker.block(x = x['txt'], y = y['txt'])

Printing block_result contains:

  • The method used (faiss - refers to Facebook AI Similarity Search)
  • Number of blocks created (3 in this case)
  • Number of columns (features) used for blocking (intersecting n-grams generated from both datasets, 17 in this example)
  • Reduction ratio, i.e. how large is the reduction of comparison pairs (here 0.8750 which means blocking reduces comparison by over 87.5%).
print(block_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 3
# Number of columns created for blocking: 17
# Reduction ratio: 0.8750
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 3  

By printing block_result.result we can take a look at the results table containing:

  • row numbers from the original data,
  • block number (integers),
  • distance (from the ANN algorithm).
print(block_result.result)
#    x  y  block      dist
# 0  4  0      0  0.000000
# 1  1  1      1  0.000000
# 2  6  2      2  0.607768

Deduplication

We can perform deduplication by putting previously created DataFrame in the block() method.

dedup_result = blocker.block(x = x['txt'])
print(dedup_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 2
# Number of columns created for blocking: 25
# Reduction ratio: 0.571429
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          4 | 2 
print(dedup_result.result)
#    x  y  block      dist
# 0  1  0      0  0.125000
# 1  3  1      0  0.105573
# 2  1  2      0  0.105573
# 3  5  4      1  0.083333
# 4  4  6      1  0.105573
# 5  5  7      1  0.278312

You can find more comprehensive examples in examples section.

Features

  • Multiple ANN implementations available:

    • FAISS (Facebook AI Similarity Search) (lsh, hnsw, flat)
    • Voyager (Spotify)
    • HNSW (Hierarchical Navigable Small World)
    • MLPACK (both LSH and k-d tree)
    • NND (Nearest Neighbor Descent)
    • Annoy (Spotify)
  • Multiple distance metrics such as:

    • Euclidean
    • Cosine
    • Inner Product

    and more...

  • Support for both shingle-based and embedding-based text representation

  • Comprehensive algorithm parameters customization with control_ann and control_txt

  • Support for already created Document-Term-Matrices (as np.ndarray or csr_matrix)

  • Support for both record linkage and deduplication

  • Evaluation metrics when true blocks are known

  • GPU support for fast blocking of large datasets usin GPU-accelerated indexes from FAISS

You can find detailed information about BlockingPy in documentation.

GPU Support

BlockingPy can process large datasets by utilizing the GPU with faiss_gpu algorithms. The available GPU indexes are (Flat/IVF/IVFPQ/CAGRA). blockingpy-gpu also includes all CPU indexes besides the mlpack backends.

Prerequisites

  • OS: Linux or Windows 11 with WSL2 (Ubuntu)
  • Python: 3.10
  • GPU: Nvidia with driver supporting CUDA ≥ 12.4
  • Tools: conda/mamba + pip

Install

PyPI wheels do not provide CUDA-enabled FAISS. You must install FAISS-GPU via conda/mamba, then install blockingpy-gpu with pip.

# 1) Env
mamba create -n blockingpy-gpu python=3.10 -y
conda activate blockingpy-gpu
conda config --env --set channel_priority flexible

# 2) Install FAISS GPU (nightly cuVS build) - this version was tested
mamba install -y \
  -c pytorch/label/nightly -c rapidsai -c conda-forge \
  "faiss-gpu-cuvs=1.11.0" "libcuvs=25.4.*"

# 3) Install BlockingPy and the rest of deps with pip (or poetry, uv etc.)
pip install blockingpy-gpu

Example Datasets

BlockingPy comes with example datasets fetched via Pooch library:

  • Census-Cis dataset created by Paula McLeod, Dick Heasman and Ian Forbes, ONS, for the ESSnet DI on-the-job training course, Southampton, 25-28 January 2011

  • Deduplication dataset taken from RecordLinkage R package developed by Murat Sariyar and Andreas Borg. Package is licensed under GPL-3 license. Also known as RLdata10000.

The files are hosted on GitHub Releases and can be downloaded via the provided links.

License

BlockingPy is released under MIT license.

Third Party

BlockingPy benefits from many open-source packages such as Faiss or Annoy. For detailed information see third party notice.

Contributing & Development

Please see CONTRIBUTING.md for more information.

Code of Conduct

You can find it here

Acknowledgements

This package is based on the R blocking package developed by BERENZ.

Funding

Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations -- quality, data integration and estimation)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blockingpy_gpu-0.2.8.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blockingpy_gpu-0.2.8-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file blockingpy_gpu-0.2.8.tar.gz.

File metadata

  • Download URL: blockingpy_gpu-0.2.8.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for blockingpy_gpu-0.2.8.tar.gz
Algorithm Hash digest
SHA256 73411248b33e6d5416242794101c73c8f4477236e6a6337587dc48557370d9b0
MD5 7037928b755503267378def3c7dd70f1
BLAKE2b-256 fda754393ec9ab542857adf3b8c9e8a7664eb1bc8424d1693ec10ca4023d716c

See more details on using hashes here.

File details

Details for the file blockingpy_gpu-0.2.8-py3-none-any.whl.

File metadata

  • Download URL: blockingpy_gpu-0.2.8-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for blockingpy_gpu-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2b128e6aa51d6c0855a0f26047b7f463e5b204418207f556fe69d2c32c0c3348
MD5 942597b290c613623c33b1a22c967a98
BLAKE2b-256 dc898a09605b865625aa2a9f9170992de39168edea9bf1df39a04b57f2a1201e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page