Skip to main content

Valentine: find relationships between columns of different tabular datasets

Project description

Valentine

(Schema-) Matching DataFrames Made Easy

Build codecov Codacy Badge Ruff PyPI version Python versions PyPI downloads License Docs


A Python package for capturing potential relationships among columns of different tabular datasets, given as pandas or Polars DataFrames. Valentine is based on the paper Valentine: Evaluating Matching Techniques for Dataset Discovery.

📚 Full documentation: https://delftdata.github.io/valentine/ — getting started, matcher guide, API reference, and migration notes.

Experimental suite version

The original experimental suite version of Valentine, as first published for the needs of the research paper, can be still found here.

Installation instructions

Requirements

  • Python >=3.10,<3.15

To install Valentine simply run:

pip install valentine

To enable Polars support, install the optional extra:

pip install valentine[polars]

To enable the sentence-transformer embedding distance for JaccardDistanceMatcher (see below), install:

pip install valentine[embeddings]

Usage

Valentine can be used to find matches among columns of a given pair of pandas or Polars DataFrames. You can even mix pandas and Polars frames in the same call — Valentine auto-detects the frame type.

Matching methods

In order to do so, the user can choose one of the following matching methods:

  1. Coma(int: max_n, bool: use_instances, bool: use_schema, float: delta, float: threshold) is a pure Python implementation of the COMA 3.0 schema matching algorithm.

    • Parameters:
      • max_n(int) - Maximum number of matches to keep per column, 0 means unlimited (default: 0).
      • use_instances(bool) - Whether to use TF-IDF instance-based matching on data values (default: False).
      • use_schema(bool) - Whether to use schema-based matching on column names, paths, and structure (default: True).
      • delta(float) - Fraction from the best score within which matches are kept (default: 0.15).
      • threshold(float) - Absolute minimum similarity score to keep a match (default: 0.0).
  2. Cupid(float: w_struct, float: leaf_w_struct, float: th_accept) is the python implementation of the paper Generic Schema Matching with Cupid

    • Parameters:
      • w_struct(float) - Structural similarity threshold, default is 0.2.
      • leaf_w_struct(float) - Structural similarity threshold, leaf level, default is 0.2.
      • th_accept(float) - Accept similarity threshold, default is 0.7.
  3. DistributionBased(float: threshold1, float: threshold2) is the python implementation of the paper Automatic Discovery of Attributes in Relational Databases

    • Parameters:
      • threshold1(float) - The threshold for phase 1 of the method, default is 0.15.
      • threshold2(float) - The threshold for phase 2 of the method, default is 0.15.
  4. JaccardDistanceMatcher(...) is a baseline method that scores column pairs by Tversky similarity over their value sets (Jaccard by default). Element equality between values can be decided by a configurable string distance function, including a sentence-transformer embedding option for semantic matching.

    • Parameters:
      • threshold_dist(float) - Acceptance threshold above which two values are considered equal under the chosen distance_fun, default is 0.8. For embeddings, ~0.7 is a typical operating point.
      • distance_fun(StringDistanceFunction) - Per-value similarity function. The enumeration class StringDistanceFunction can be imported from valentine.algorithms.jaccard_distance. Functions currently supported are:
      • tversky_alpha(float) / tversky_beta(float) - Tversky penalty weights for unmatched values on each side (defaults 1.0, 1.0). Defaults give Jaccard; 0.5, 0.5 gives Sørensen-Dice; 1.0, 0.0 (or vice versa) gives set containment — useful when one column is expected to be a subset of the other.
      • embedding_model(str) - Sentence-transformers model name when distance_fun=Embedding (default "all-MiniLM-L6-v2").
      • embedding_device(str or None) - Device override ("cpu", "cuda", "mps"). None (default) auto-picks: cuda → mps → cpu.
      • embedding_batch_size(int or None) - Encode batch size; None uses the sentence-transformers default (32). Larger values amortise per-call overhead on capable hardware.
  5. SimilarityFlooding(Policy: coeff_policy, Formula: formula, StringMatcher: string_matcher) is the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching

    • Parameters:
      • coeff_policy(Policy) - Policy for deciding the weight coefficients of the propagation graph. Policy.INVERSE_PRODUCT or Policy.INVERSE_AVERAGE (default).
      • formula(Formula) - Formula on which iterative fixpoint computation is based. Formula.BASIC, Formula.FORMULA_A, Formula.FORMULA_B, or Formula.FORMULA_C (default).
      • string_matcher(StringMatcher) - String matching function for the initial similarity mapping. StringMatcher.PREFIX_SUFFIX (default), StringMatcher.PREFIX_SUFFIX_TFIDF, or StringMatcher.LEVENSHTEIN.

Matching DataFrames

Pass two or more DataFrames as a list (or any iterable) along with a matcher. Valentine will match columns across all unique pairs. Pandas and Polars frames can be freely mixed:

# Match a pair of DataFrames (pandas, Polars, or mixed)
matches = valentine_match([df1, df2], matcher)

# Match multiple DataFrames (computes all N×(N-1)/2 pairs)
matches = valentine_match([df1, df2, df3], matcher, df_names=["sales", "orders", "products"])

Optionally provide df_names to label each DataFrame (defaults to "aaa", "bbb", etc. — designed to have zero similarity so they don't influence schema-based matchers). Function valentine_match returns a MatcherResults object, an immutable mapping from ColumnPair to similarity scores with convenience methods for filtering, subsetting, and evaluation.

MatcherResults and ColumnPair

Results are keyed by ColumnPair namedtuples with named fields for easy access:

for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
    # Also available: pair.source_table, pair.target_table
    # Shorthand tuples: pair.source, pair.target

MatcherResults provides convenience methods for filtering and subsetting:

top_n_matches = matches.take_top_n(5)
top_n_percent_matches = matches.take_top_percent(25)
high_confidence = matches.filter(min_score=0.7)

# One-to-one selectors — three flavours, pick the one that fits your task:
one_to_one_matches = matches.one_to_one_hungarian()           # globally optimal (default)
one_to_one_strict = matches.one_to_one_hungarian(threshold=0.5)
greedy_legacy = matches.one_to_one_greedy()                   # legacy greedy assignment
mutual_only = matches.one_to_one_mutual_top(n=1)              # mutual nearest neighbour

one_to_one_hungarian (Hungarian assignment via scipy.optimize.linear_sum_assignment) is the recommended default and is what Precision / Recall / F1Score apply when their one_to_one flag is set. one_to_one_greedy preserves the legacy greedy behaviour for backwards compatibility. one_to_one_mutual_top(n) keeps a pair only when each side ranks the other in its top-n — a high-precision filter that drops one-sided affinities.

Match details (Coma)

When using the Coma matcher, per-sub-matcher score breakdowns are available via .details:

for pair, score in matches.items():
    details = matches.get_details(pair)
    if details:
        print(f"{pair.source_column} <-> {pair.target_column}: {details}")
        # e.g. {'NameCM': 0.72, 'PathCM': 0.65, 'LeavesCM': 0.58, ...}

Measuring effectiveness

metrics = matches.get_metrics(ground_truth)

Computes Precision, Recall, F1-score and others as described in the original Valentine paper. The ground truth is a list of (source_column, target_column) tuples:

ground_truth = [("emp_id", "employee_number"), ("fname", "first_name"), ...]

Custom metrics can be specified, and predefined sets are available:

from valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N
metrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})
metrics_predefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)

The 1:1 selection algorithm used when a metric's one_to_one flag is True can be overridden per call (default "hungarian"):

metrics_strict   = matches.get_metrics(ground_truth, metrics={F1Score()}, one_to_one_method="mutual_top")
metrics_legacy   = matches.get_metrics(ground_truth, metrics={F1Score()}, one_to_one_method="greedy")

Valid values are "hungarian" (default), "greedy", and "mutual_top". Metrics whose one_to_one flag is False (e.g. MeanReciprocalRank, RecallAtSizeofGroundTruth) ignore the argument.

Example

The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about job candidates, and then 2) how to assess its effectiveness based on a given ground truth. More examples are available in the examples/ directory, including a pandas example, a Polars example, and a mixed pandas+Polars example.

import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

# Load data using pandas
df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")

# Instantiate matcher and run
matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)

# Iterate over results using ColumnPair named fields
for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")

# If ground truth available valentine could calculate the metrics
ground_truth = [
    ("emp_id", "employee_number"),
    ("fname", "first_name"),
    ("lname", "last_name"),
    ("dept", "department"),
    ("annual_salary", "compensation"),
    ("hire_date", "start_date"),
    ("office_loc", "work_location"),
]

metrics = matches.get_metrics(ground_truth)
print(metrics)

Cite Valentine

Original Valentine paper:
@inproceedings{koutras2021valentine,
  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  pages={468--479},
  year={2021},
  organization={IEEE}
}
Demo Paper:
@article{koutras2021demo,
  title={Valentine in Action: Matching Tabular Data at Scale},
  author={Koutras, Christos and Psarakis, Kyriakos and Siachamis, George and Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and Katsifodimos, Asterios},
  journal={VLDB},
  volume={14},
  number={12},
  pages={2871--2874},
  year={2021},
  publisher={VLDB Endowment}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

valentine-1.0.0.tar.gz (117.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

valentine-1.0.0-py3-none-any.whl (93.5 kB view details)

Uploaded Python 3

File details

Details for the file valentine-1.0.0.tar.gz.

File metadata

  • Download URL: valentine-1.0.0.tar.gz
  • Upload date:
  • Size: 117.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for valentine-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9734bfaa5a7c33048ba0b074a2e783645f615610dfd2a5fabd74d007b4eee18d
MD5 9acb402f6ff876babfc9e6eac9131445
BLAKE2b-256 264d6373e58ee1dd9218a526a9c64e2e66e3bd21539d5e0fb72e965b818b0dfb

See more details on using hashes here.

Provenance

The following attestation bundles were made for valentine-1.0.0.tar.gz:

Publisher: ci-build-test-publish.yml on delftdata/valentine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file valentine-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: valentine-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 93.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for valentine-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d20047ec623f481204f1d2fda9bc67fb716e98f4beafd41a7739954645d78a43
MD5 2f4b1a4b24844e4225d63f458dea522f
BLAKE2b-256 e48501910625f5cf63218754c11e0cca56fabb999c0479651542930778d9d0e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for valentine-1.0.0-py3-none-any.whl:

Publisher: ci-build-test-publish.yml on delftdata/valentine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page