Valentine: find relationships between columns of different tabular datasets

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

delftdata

These details have not been verified by PyPI

Project description

Valentine

(Schema-) Matching DataFrames Made Easy

A Python package for capturing potential relationships among columns of different tabular datasets, given as pandas or Polars DataFrames. Valentine is based on the paper Valentine: Evaluating Matching Techniques for Dataset Discovery.

📚 Full documentation: https://delftdata.github.io/valentine/ — getting started, matcher guide, API reference, and migration notes.

Experimental suite version

The original experimental suite version of Valentine, as first published for the needs of the research paper, can be still found here.

Installation instructions

Requirements

Python >=3.10,<3.15

To install Valentine simply run:

pip install valentine

To enable Polars support, install the optional extra:

pip install valentine[polars]

To enable the sentence-transformer embedding distance for JaccardDistanceMatcher (see below), install:

pip install valentine[embeddings]

Usage

Valentine can be used to find matches among columns of a given pair of pandas or Polars DataFrames. You can even mix pandas and Polars frames in the same call — Valentine auto-detects the frame type.

Matching methods

In order to do so, the user can choose one of the following matching methods:

Coma(int: max_n, bool: use_instances, bool: use_schema, float: delta, float: threshold) is a pure Python implementation of the COMA 3.0 schema matching algorithm.
- Parameters:
  - max_n(int) - Maximum number of matches to keep per column, 0 means unlimited (default: 0).
  - use_instances(bool) - Whether to use TF-IDF instance-based matching on data values (default: False).
  - use_schema(bool) - Whether to use schema-based matching on column names, paths, and structure (default: True).
  - delta(float) - Fraction from the best score within which matches are kept (default: 0.15).
  - threshold(float) - Absolute minimum similarity score to keep a match (default: 0.0).
Cupid(float: w_struct, float: leaf_w_struct, float: th_accept) is the python implementation of the paper Generic Schema Matching with Cupid
- Parameters:
  - w_struct(float) - Structural similarity threshold, default is 0.2.
  - leaf_w_struct(float) - Structural similarity threshold, leaf level, default is 0.2.
  - th_accept(float) - Accept similarity threshold, default is 0.7.
DistributionBased(float: threshold1, float: threshold2) is the python implementation of the paper Automatic Discovery of Attributes in Relational Databases
- Parameters:
  - threshold1(float) - The threshold for phase 1 of the method, default is 0.15.
  - threshold2(float) - The threshold for phase 2 of the method, default is 0.15.
JaccardDistanceMatcher(...) is a baseline method that scores column pairs by Tversky similarity over their value sets (Jaccard by default). Element equality between values can be decided by a configurable string distance function, including a sentence-transformer embedding option for semantic matching.
- Parameters:
  - threshold_dist(float) - Acceptance threshold above which two values are considered equal under the chosen distance_fun, default is 0.8. For embeddings, ~0.7 is a typical operating point.
  - distance_fun(StringDistanceFunction) - Per-value similarity function. The enumeration class StringDistanceFunction can be imported from valentine.algorithms.jaccard_distance. Functions currently supported are:
    - StringDistanceFunction.Levenshtein: Levenshtein distance (default)
    - StringDistanceFunction.DamerauLevenshtein: Damerau-Levenshtein distance
    - StringDistanceFunction.Hamming: Hamming distance
    - StringDistanceFunction.Jaro: Jaro distance
    - StringDistanceFunction.JaroWinkler: Jaro-Winkler distance
    - StringDistanceFunction.Exact: String equality ==
    - StringDistanceFunction.Embedding: cosine similarity on sentence-transformer embeddings (requires the valentine[embeddings] extra)
  - tversky_alpha(float) / tversky_beta(float) - Tversky penalty weights for unmatched values on each side (defaults 1.0, 1.0). Defaults give Jaccard; 0.5, 0.5 gives Sørensen-Dice; 1.0, 0.0 (or vice versa) gives set containment — useful when one column is expected to be a subset of the other.
  - embedding_model(str) - Sentence-transformers model name when distance_fun=Embedding (default "all-MiniLM-L6-v2").
  - embedding_device(str or None) - Device override ("cpu", "cuda", "mps"). None (default) auto-picks: cuda → mps → cpu.
  - embedding_batch_size(int or None) - Encode batch size; None uses the sentence-transformers default (32). Larger values amortise per-call overhead on capable hardware.
SimilarityFlooding(Policy: coeff_policy, Formula: formula, StringMatcher: string_matcher) is the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching
- Parameters:
  - coeff_policy(Policy) - Policy for deciding the weight coefficients of the propagation graph. Policy.INVERSE_PRODUCT or Policy.INVERSE_AVERAGE (default).
  - formula(Formula) - Formula on which iterative fixpoint computation is based. Formula.BASIC, Formula.FORMULA_A, Formula.FORMULA_B, or Formula.FORMULA_C (default).
  - string_matcher(StringMatcher) - String matching function for the initial similarity mapping. StringMatcher.PREFIX_SUFFIX (default), StringMatcher.PREFIX_SUFFIX_TFIDF, or StringMatcher.LEVENSHTEIN.

Matching DataFrames

Pass two or more DataFrames as a list (or any iterable) along with a matcher. Valentine will match columns across all unique pairs. Pandas and Polars frames can be freely mixed:

# Match a pair of DataFrames (pandas, Polars, or mixed)
matches = valentine_match([df1, df2], matcher)

# Match multiple DataFrames (computes all N×(N-1)/2 pairs)
matches = valentine_match([df1, df2, df3], matcher, df_names=["sales", "orders", "products"])

Optionally provide df_names to label each DataFrame (defaults to "aaa", "bbb", etc. — designed to have zero similarity so they don't influence schema-based matchers). Function valentine_match returns a MatcherResults object, an immutable mapping from ColumnPair to similarity scores with convenience methods for filtering, subsetting, and evaluation.

MatcherResults and ColumnPair

Results are keyed by ColumnPair namedtuples with named fields for easy access:

for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
    # Also available: pair.source_table, pair.target_table
    # Shorthand tuples: pair.source, pair.target

MatcherResults provides convenience methods for filtering and subsetting:

top_n_matches = matches.take_top_n(5)
top_n_percent_matches = matches.take_top_percent(25)
high_confidence = matches.filter(min_score=0.7)

# One-to-one selectors — three flavours, pick the one that fits your task:
one_to_one_matches = matches.one_to_one_hungarian()           # globally optimal (default)
one_to_one_strict = matches.one_to_one_hungarian(threshold=0.5)
greedy_legacy = matches.one_to_one_greedy()                   # legacy greedy assignment
mutual_only = matches.one_to_one_mutual_top(n=1)              # mutual nearest neighbour

one_to_one_hungarian (Hungarian assignment via scipy.optimize.linear_sum_assignment) is the recommended default and is what Precision / Recall / F1Score apply when their one_to_one flag is set. one_to_one_greedy preserves the legacy greedy behaviour for backwards compatibility. one_to_one_mutual_top(n) keeps a pair only when each side ranks the other in its top-n — a high-precision filter that drops one-sided affinities.

Match details (Coma)

When using the Coma matcher, per-sub-matcher score breakdowns are available via .details:

for pair, score in matches.items():
    details = matches.get_details(pair)
    if details:
        print(f"{pair.source_column} <-> {pair.target_column}: {details}")
        # e.g. {'NameCM': 0.72, 'PathCM': 0.65, 'LeavesCM': 0.58, ...}

Measuring effectiveness

metrics = matches.get_metrics(ground_truth)

Computes Precision, Recall, F1-score and others as described in the original Valentine paper. The ground truth is a list of (source_column, target_column) tuples:

ground_truth = [("emp_id", "employee_number"), ("fname", "first_name"), ...]

Custom metrics can be specified, and predefined sets are available:

from valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N
metrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})
metrics_predefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)

The 1:1 selection algorithm used when a metric's one_to_one flag is True can be overridden per call (default "hungarian"):

metrics_strict   = matches.get_metrics(ground_truth, metrics={F1Score()}, one_to_one_method="mutual_top")
metrics_legacy   = matches.get_metrics(ground_truth, metrics={F1Score()}, one_to_one_method="greedy")

Valid values are "hungarian" (default), "greedy", and "mutual_top". Metrics whose one_to_one flag is False (e.g. MeanReciprocalRank, RecallAtSizeofGroundTruth) ignore the argument.

Example

The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about job candidates, and then 2) how to assess its effectiveness based on a given ground truth. More examples are available in the examples/ directory, including a pandas example, a Polars example, and a mixed pandas+Polars example.

import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

# Load data using pandas
df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")

# Instantiate matcher and run
matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)

# Iterate over results using ColumnPair named fields
for pair, score in matches.items():
    print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")

# If ground truth available valentine could calculate the metrics
ground_truth = [
    ("emp_id", "employee_number"),
    ("fname", "first_name"),
    ("lname", "last_name"),
    ("dept", "department"),
    ("annual_salary", "compensation"),
    ("hire_date", "start_date"),
    ("office_loc", "work_location"),
]

metrics = matches.get_metrics(ground_truth)
print(metrics)

Cite Valentine

Original Valentine paper:
@inproceedings{koutras2021valentine,
  title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
  author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
  booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  pages={468--479},
  year={2021},
  organization={IEEE}
}
Demo Paper:
@article{koutras2021demo,
  title={Valentine in Action: Matching Tabular Data at Scale},
  author={Koutras, Christos and Psarakis, Kyriakos and Siachamis, George and Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and Katsifodimos, Asterios},
  journal={VLDB},
  volume={14},
  number={12},
  pages={2871--2874},
  year={2021},
  publisher={VLDB Endowment}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

delftdata

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

May 14, 2026

0.5.0

Apr 5, 2026

0.4.1

Oct 14, 2025

0.3.0

Nov 28, 2024

0.2.1

Aug 22, 2024

0.2.0

Feb 14, 2024

0.1.9

Nov 9, 2023

0.1.8

Oct 13, 2023

0.1.7

May 18, 2023

0.1.6

Apr 11, 2023

0.1.5

Oct 25, 2022

0.1.4

Nov 16, 2021

0.1.3

Oct 19, 2021

0.1.2

Oct 18, 2021

0.1.1

Oct 18, 2021

0.1.0

Oct 5, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

valentine-1.0.0.tar.gz (117.1 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

valentine-1.0.0-py3-none-any.whl (93.5 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file valentine-1.0.0.tar.gz.

File metadata

Download URL: valentine-1.0.0.tar.gz
Upload date: May 14, 2026
Size: 117.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for valentine-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9734bfaa5a7c33048ba0b074a2e783645f615610dfd2a5fabd74d007b4eee18d`
MD5	`9acb402f6ff876babfc9e6eac9131445`
BLAKE2b-256	`264d6373e58ee1dd9218a526a9c64e2e66e3bd21539d5e0fb72e965b818b0dfb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for valentine-1.0.0.tar.gz:

Publisher: ci-build-test-publish.yml on delftdata/valentine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: valentine-1.0.0.tar.gz
- Subject digest: 9734bfaa5a7c33048ba0b074a2e783645f615610dfd2a5fabd74d007b4eee18d
- Sigstore transparency entry: 1537431539
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: delftdata/valentine@f0f738927455063841a4ebdda2f1420abc26922b
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/delftdata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-build-test-publish.yml@f0f738927455063841a4ebdda2f1420abc26922b
- Trigger Event: release

File details

Details for the file valentine-1.0.0-py3-none-any.whl.

File metadata

Download URL: valentine-1.0.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 93.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for valentine-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d20047ec623f481204f1d2fda9bc67fb716e98f4beafd41a7739954645d78a43`
MD5	`2f4b1a4b24844e4225d63f458dea522f`
BLAKE2b-256	`e48501910625f5cf63218754c11e0cca56fabb999c0479651542930778d9d0e6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for valentine-1.0.0-py3-none-any.whl:

Publisher: ci-build-test-publish.yml on delftdata/valentine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: valentine-1.0.0-py3-none-any.whl
- Subject digest: d20047ec623f481204f1d2fda9bc67fb716e98f4beafd41a7739954645d78a43
- Sigstore transparency entry: 1537431604
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: delftdata/valentine@f0f738927455063841a4ebdda2f1420abc26922b
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/delftdata
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-build-test-publish.yml@f0f738927455063841a4ebdda2f1420abc26922b
- Trigger Event: release

valentine 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Valentine

Experimental suite version

Installation instructions

Requirements

Usage

Matching methods

Matching DataFrames

MatcherResults and ColumnPair

Match details (Coma)

Measuring effectiveness

Example

Cite Valentine

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance