Skip to main content

Sentence similarity (0-1) with SBERT vectors - Python 3.10+

Project description

sentence2simvec

Vector-based sentence similarity (0.0 - 1.0) for Japanese & multilingual texts.

  • 3-gram Jaccard surface similarity
  • SBERT (MiniLM) semantic similarity
  • Python API + CLI (sentence2simvec)
  • Can output each sentence's embedding vector (numpy, 384-dim)
  • Embedding vectors for each sentence can be obtained and saved as a NumPy array.

Install

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# 3.10, 3.11, 3.12, 3.13, ...
uv venv -p 3.10 .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate

uv pip install -U sentence2simvec
or
uv pip install -e .

Usage

  • CLI
    sentence2simvec "Hello!" "Hello world!" --save-vecs ./vecs
    
    # Similarity: 0.7135
    #    • n-gram  = 0.2727
    #    • cosine  = 0.9024
    # vecs/vec1.npy, vecs/vec2.npy written
    
  • Python API
    from sentence2simvec import similarity_score, sentence_vector
    
    # --- Similarity (with vectors) ---------------------------
    score, details, v1, v2 = similarity_score(
        "Hello!",
        "Hello world!",
        return_vectors=True
    )
    print(score)          # 0.87
    print(details)        # {'jaccard': 0.764…, 'cosine': 0.910…}
    print(v1.shape)       # (384,)
    
    # --- Get a single sentence vector ------------------------
    vec = sentence_vector("Hello")
    # ※ vec is L2 regularized (||vec||₂ = 1)
    

Development

uv venv -p 3.10 .venv && source .venv/bin/activate
uv pip install build twine pytest ipdb sentence-transformers numpy

# debug run
python sentence2simvec/core.py "Hello!" "Hello world!"

# tests
pytest

# build
python -m build

# upload PyPI
export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."
twine upload dist/*
unset TWINE_USERNAME TWINE_PASSWORD

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence2simvec-0.0.1.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sentence2simvec-0.0.1-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file sentence2simvec-0.0.1.tar.gz.

File metadata

  • Download URL: sentence2simvec-0.0.1.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for sentence2simvec-0.0.1.tar.gz
Algorithm Hash digest
SHA256 acb5b37f59b57ef3474b108f602bc348e2ba15f93817cfb78d87909bcc242ec7
MD5 720b4cfc87d94b8b3548f2543e353d5e
BLAKE2b-256 044ab095bf283e5a63738a5283769e1d08039031be0a83320accad3105a4fe8f

See more details on using hashes here.

File details

Details for the file sentence2simvec-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sentence2simvec-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5ab06dd54e63b6b4ae098d2646238e1057eb7f074a01fb8495f73c2cbe2fd222
MD5 eb88c757229541d6d1e9b3fbbbaf657e
BLAKE2b-256 dddef65d3e74199933fd45f5fb169392127b90fc35d5bdd1be1cadd8486d67b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page