Valentine: find relationships between columns of different tabular datasets
Project description
Valentine
(Schema-) Matching DataFrames Made Easy
A Python package for capturing potential relationships among columns of different tabular datasets, given as pandas or Polars DataFrames. Valentine is based on the paper Valentine: Evaluating Matching Techniques for Dataset Discovery.
📚 Full documentation: https://delftdata.github.io/valentine/ — getting started, matcher guide, API reference, and migration notes.
Experimental suite version
The original experimental suite version of Valentine, as first published for the needs of the research paper, can be still found here.
Installation instructions
Requirements
- Python >=3.10,<3.15
To install Valentine simply run:
pip install valentine
To enable Polars support, install the optional extra:
pip install valentine[polars]
To enable the sentence-transformer embedding distance for JaccardDistanceMatcher (see below), install:
pip install valentine[embeddings]
Usage
Valentine can be used to find matches among columns of a given pair of pandas or Polars DataFrames. You can even mix pandas and Polars frames in the same call — Valentine auto-detects the frame type.
Matching methods
In order to do so, the user can choose one of the following matching methods:
-
Coma(int: max_n, bool: use_instances, bool: use_schema, float: delta, float: threshold)is a pure Python implementation of the COMA 3.0 schema matching algorithm.- Parameters:
- max_n(int) - Maximum number of matches to keep per column, 0 means unlimited (default: 0).
- use_instances(bool) - Whether to use TF-IDF instance-based matching on data values (default: False).
- use_schema(bool) - Whether to use schema-based matching on column names, paths, and structure (default: True).
- delta(float) - Fraction from the best score within which matches are kept (default: 0.15).
- threshold(float) - Absolute minimum similarity score to keep a match (default: 0.0).
- Parameters:
-
Cupid(float: w_struct, float: leaf_w_struct, float: th_accept)is the python implementation of the paper Generic Schema Matching with Cupid- Parameters:
- w_struct(float) - Structural similarity threshold, default is 0.2.
- leaf_w_struct(float) - Structural similarity threshold, leaf level, default is 0.2.
- th_accept(float) - Accept similarity threshold, default is 0.7.
- Parameters:
-
DistributionBased(float: threshold1, float: threshold2)is the python implementation of the paper Automatic Discovery of Attributes in Relational Databases- Parameters:
- threshold1(float) - The threshold for phase 1 of the method, default is 0.15.
- threshold2(float) - The threshold for phase 2 of the method, default is 0.15.
- Parameters:
-
JaccardDistanceMatcher(...)is a baseline method that scores column pairs by Tversky similarity over their value sets (Jaccard by default). Element equality between values can be decided by a configurable string distance function, including a sentence-transformer embedding option for semantic matching.- Parameters:
- threshold_dist(float) - Acceptance threshold above which two values are considered equal under the chosen
distance_fun, default is 0.8. For embeddings, ~0.7 is a typical operating point. - distance_fun(StringDistanceFunction) - Per-value similarity function. The enumeration class
StringDistanceFunctioncan be imported fromvalentine.algorithms.jaccard_distance. Functions currently supported are:StringDistanceFunction.Levenshtein: Levenshtein distance (default)StringDistanceFunction.DamerauLevenshtein: Damerau-Levenshtein distanceStringDistanceFunction.Hamming: Hamming distanceStringDistanceFunction.Jaro: Jaro distanceStringDistanceFunction.JaroWinkler: Jaro-Winkler distanceStringDistanceFunction.Exact: String equality==StringDistanceFunction.Embedding: cosine similarity on sentence-transformer embeddings (requires thevalentine[embeddings]extra)
- tversky_alpha(float) / tversky_beta(float) - Tversky penalty weights for unmatched values on each side (defaults
1.0,1.0). Defaults give Jaccard;0.5, 0.5gives Sørensen-Dice;1.0, 0.0(or vice versa) gives set containment — useful when one column is expected to be a subset of the other. - embedding_model(str) - Sentence-transformers model name when
distance_fun=Embedding(default"all-MiniLM-L6-v2"). - embedding_device(str or None) - Device override (
"cpu","cuda","mps").None(default) auto-picks: cuda → mps → cpu. - embedding_batch_size(int or None) - Encode batch size;
Noneuses the sentence-transformers default (32). Larger values amortise per-call overhead on capable hardware.
- threshold_dist(float) - Acceptance threshold above which two values are considered equal under the chosen
- Parameters:
-
SimilarityFlooding(Policy: coeff_policy, Formula: formula, StringMatcher: string_matcher)is the python implementation of the paper Similarity Flooding: A Versatile Graph Matching Algorithmand its Application to Schema Matching- Parameters:
- coeff_policy(Policy) - Policy for deciding the weight coefficients of the propagation graph.
Policy.INVERSE_PRODUCTorPolicy.INVERSE_AVERAGE(default). - formula(Formula) - Formula on which iterative fixpoint computation is based.
Formula.BASIC,Formula.FORMULA_A,Formula.FORMULA_B, orFormula.FORMULA_C(default). - string_matcher(StringMatcher) - String matching function for the initial similarity mapping.
StringMatcher.PREFIX_SUFFIX(default),StringMatcher.PREFIX_SUFFIX_TFIDF, orStringMatcher.LEVENSHTEIN.
- coeff_policy(Policy) - Policy for deciding the weight coefficients of the propagation graph.
- Parameters:
Matching DataFrames
Pass two or more DataFrames as a list (or any iterable) along with a matcher. Valentine will match columns across all unique pairs. Pandas and Polars frames can be freely mixed:
# Match a pair of DataFrames (pandas, Polars, or mixed)
matches = valentine_match([df1, df2], matcher)
# Match multiple DataFrames (computes all N×(N-1)/2 pairs)
matches = valentine_match([df1, df2, df3], matcher, df_names=["sales", "orders", "products"])
Optionally provide df_names to label each DataFrame (defaults to "aaa", "bbb", etc. — designed to have zero similarity so they don't influence schema-based matchers). Function valentine_match returns a MatcherResults object, an immutable mapping from ColumnPair to similarity scores with convenience methods for filtering, subsetting, and evaluation.
MatcherResults and ColumnPair
Results are keyed by ColumnPair namedtuples with named fields for easy access:
for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
# Also available: pair.source_table, pair.target_table
# Shorthand tuples: pair.source, pair.target
MatcherResults provides convenience methods for filtering and subsetting:
top_n_matches = matches.take_top_n(5)
top_n_percent_matches = matches.take_top_percent(25)
high_confidence = matches.filter(min_score=0.7)
# One-to-one selectors — three flavours, pick the one that fits your task:
one_to_one_matches = matches.one_to_one_hungarian() # globally optimal (default)
one_to_one_strict = matches.one_to_one_hungarian(threshold=0.5)
greedy_legacy = matches.one_to_one_greedy() # legacy greedy assignment
mutual_only = matches.one_to_one_mutual_top(n=1) # mutual nearest neighbour
one_to_one_hungarian (Hungarian assignment via scipy.optimize.linear_sum_assignment) is the recommended default and is what Precision / Recall / F1Score apply when their one_to_one flag is set. one_to_one_greedy preserves the legacy greedy behaviour for backwards compatibility. one_to_one_mutual_top(n) keeps a pair only when each side ranks the other in its top-n — a high-precision filter that drops one-sided affinities.
Match details (Coma)
When using the Coma matcher, per-sub-matcher score breakdowns are available via .details:
for pair, score in matches.items():
details = matches.get_details(pair)
if details:
print(f"{pair.source_column} <-> {pair.target_column}: {details}")
# e.g. {'NameCM': 0.72, 'PathCM': 0.65, 'LeavesCM': 0.58, ...}
Measuring effectiveness
metrics = matches.get_metrics(ground_truth)
Computes Precision, Recall, F1-score and others as described in the original Valentine paper. The ground truth is a list of (source_column, target_column) tuples:
ground_truth = [("emp_id", "employee_number"), ("fname", "first_name"), ...]
Custom metrics can be specified, and predefined sets are available:
from valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N
metrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})
metrics_predefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)
The 1:1 selection algorithm used when a metric's one_to_one flag is True can be overridden per call (default "hungarian"):
metrics_strict = matches.get_metrics(ground_truth, metrics={F1Score()}, one_to_one_method="mutual_top")
metrics_legacy = matches.get_metrics(ground_truth, metrics={F1Score()}, one_to_one_method="greedy")
Valid values are "hungarian" (default), "greedy", and "mutual_top". Metrics whose one_to_one flag is False (e.g. MeanReciprocalRank, RecallAtSizeofGroundTruth) ignore the argument.
Example
The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about job candidates, and then 2) how to assess its effectiveness based on a given ground truth. More examples are available in the examples/ directory, including a pandas example, a Polars example, and a mixed pandas+Polars example.
import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma
# Load data using pandas
df1 = pd.read_csv("source_candidates.csv")
df2 = pd.read_csv("target_candidates.csv")
# Instantiate matcher and run
matcher = Coma(use_instances=True)
matches = valentine_match([df1, df2], matcher)
# Iterate over results using ColumnPair named fields
for pair, score in matches.items():
print(f"{pair.source_column} <-> {pair.target_column}: {score:.3f}")
# If ground truth available valentine could calculate the metrics
ground_truth = [
("emp_id", "employee_number"),
("fname", "first_name"),
("lname", "last_name"),
("dept", "department"),
("annual_salary", "compensation"),
("hire_date", "start_date"),
("office_loc", "work_location"),
]
metrics = matches.get_metrics(ground_truth)
print(metrics)
Cite Valentine
Original Valentine paper:
@inproceedings{koutras2021valentine,
title={Valentine: Evaluating Matching Techniques for Dataset Discovery},
author={Koutras, Christos and Siachamis, George and Ionescu, Andra and Psarakis, Kyriakos and Brons, Jerry and Fragkoulis, Marios and Lofi, Christoph and Bonifati, Angela and Katsifodimos, Asterios},
booktitle={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
pages={468--479},
year={2021},
organization={IEEE}
}
Demo Paper:
@article{koutras2021demo,
title={Valentine in Action: Matching Tabular Data at Scale},
author={Koutras, Christos and Psarakis, Kyriakos and Siachamis, George and Ionescu, Andra and Fragkoulis, Marios and Bonifati, Angela and Katsifodimos, Asterios},
journal={VLDB},
volume={14},
number={12},
pages={2871--2874},
year={2021},
publisher={VLDB Endowment}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file valentine-1.0.0.tar.gz.
File metadata
- Download URL: valentine-1.0.0.tar.gz
- Upload date:
- Size: 117.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9734bfaa5a7c33048ba0b074a2e783645f615610dfd2a5fabd74d007b4eee18d
|
|
| MD5 |
9acb402f6ff876babfc9e6eac9131445
|
|
| BLAKE2b-256 |
264d6373e58ee1dd9218a526a9c64e2e66e3bd21539d5e0fb72e965b818b0dfb
|
Provenance
The following attestation bundles were made for valentine-1.0.0.tar.gz:
Publisher:
ci-build-test-publish.yml on delftdata/valentine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
valentine-1.0.0.tar.gz -
Subject digest:
9734bfaa5a7c33048ba0b074a2e783645f615610dfd2a5fabd74d007b4eee18d - Sigstore transparency entry: 1537431539
- Sigstore integration time:
-
Permalink:
delftdata/valentine@f0f738927455063841a4ebdda2f1420abc26922b -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/delftdata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-build-test-publish.yml@f0f738927455063841a4ebdda2f1420abc26922b -
Trigger Event:
release
-
Statement type:
File details
Details for the file valentine-1.0.0-py3-none-any.whl.
File metadata
- Download URL: valentine-1.0.0-py3-none-any.whl
- Upload date:
- Size: 93.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d20047ec623f481204f1d2fda9bc67fb716e98f4beafd41a7739954645d78a43
|
|
| MD5 |
2f4b1a4b24844e4225d63f458dea522f
|
|
| BLAKE2b-256 |
e48501910625f5cf63218754c11e0cca56fabb999c0479651542930778d9d0e6
|
Provenance
The following attestation bundles were made for valentine-1.0.0-py3-none-any.whl:
Publisher:
ci-build-test-publish.yml on delftdata/valentine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
valentine-1.0.0-py3-none-any.whl -
Subject digest:
d20047ec623f481204f1d2fda9bc67fb716e98f4beafd41a7739954645d78a43 - Sigstore transparency entry: 1537431604
- Sigstore integration time:
-
Permalink:
delftdata/valentine@f0f738927455063841a4ebdda2f1420abc26922b -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/delftdata
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-build-test-publish.yml@f0f738927455063841a4ebdda2f1420abc26922b -
Trigger Event:
release
-
Statement type: