Sentence similarity (0-1) with SBERT vectors - Python 3.10+
Project description
sentence2simvec
Vector-based sentence similarity (0.0 - 1.0) for Japanese & multilingual texts.
- 3-gram Jaccard surface similarity
- SBERT (MiniLM) semantic similarity
- Python API + CLI (
sentence2simvec) - Can output each sentence's embedding vector (numpy, 384-dim)
- Embedding vectors for each sentence can be obtained and saved as a NumPy array.
Install
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# 3.10, 3.11, 3.12, 3.13, ...
uv venv -p 3.10 .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -U sentence2simvec
or
uv pip install -e .
Usage
- CLI
sentence2simvec "Hello!" "Hello world!" --save-vecs ./vecs # Similarity: 0.7135 # • n-gram = 0.2727 # • cosine = 0.9024 # vecs/vec1.npy, vecs/vec2.npy written
- Python API
from sentence2simvec import similarity_score, sentence_vector # --- Similarity (with vectors) --------------------------- score, details, v1, v2 = similarity_score( "Hello!", "Hello world!", return_vectors=True ) print(score) # 0.87 print(details) # {'jaccard': 0.764…, 'cosine': 0.910…} print(v1.shape) # (384,) # --- Get a single sentence vector ------------------------ vec = sentence_vector("Hello") # ※ vec is L2 regularized (||vec||₂ = 1)
Development
uv venv -p 3.10 .venv && source .venv/bin/activate
uv pip install build twine pytest ipdb sentence-transformers numpy
# debug run
python sentence2simvec/core.py "Hello!" "Hello world!"
# tests
pytest
# build
python -m build
# upload PyPI
export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."
twine upload dist/*
unset TWINE_USERNAME TWINE_PASSWORD
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sentence2simvec-0.0.1.tar.gz
(8.9 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentence2simvec-0.0.1.tar.gz.
File metadata
- Download URL: sentence2simvec-0.0.1.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acb5b37f59b57ef3474b108f602bc348e2ba15f93817cfb78d87909bcc242ec7
|
|
| MD5 |
720b4cfc87d94b8b3548f2543e353d5e
|
|
| BLAKE2b-256 |
044ab095bf283e5a63738a5283769e1d08039031be0a83320accad3105a4fe8f
|
File details
Details for the file sentence2simvec-0.0.1-py3-none-any.whl.
File metadata
- Download URL: sentence2simvec-0.0.1-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ab06dd54e63b6b4ae098d2646238e1057eb7f074a01fb8495f73c2cbe2fd222
|
|
| MD5 |
eb88c757229541d6d1e9b3fbbbaf657e
|
|
| BLAKE2b-256 |
dddef65d3e74199933fd45f5fb169392127b90fc35d5bdd1be1cadd8486d67b7
|