Language Model Unit Testing Framework

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

LMUnit Workflow This repository provides code for evaluation and reproduction of our results in LMUnit: Fine-grained Evaluation with Natural Language Unit Tests.

🚀 Recent Updates

2025-07: Open-Sourced LMUnit checkpoints Llama 3.1-70B and Qwen 2.5-72B models. Besides, evaluation framework for reproducing our direct scoring and preference tasks results

📋 Table of Contents

🔧 LMUnit Inference Example
📁 Repository Structure
📄 Artifacts
🚀 Quick Start
- Installation
- Run Evaluation Task
Citation

🔧 LMUnit Inference Example

Install LMunit

pip install lmunit

Run Inference Example Script

from lmunit import LMUnit
from vllm import SamplingParams

# Initialize LMUnit with your model
model = LMUnit(
    model_path="ContextualAI/LMUnit-llama3.1-70b", 
    tp_size=4
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=10, logprobs=20)
# Define your evaluation prompt

unit_test = "Does the response specify minimum system logs required for compliance?"

query =  "What security validation protocols ensure HIPAA compliance when connecting research databases to clinical systems?"



response = """Customer Concentration Risk Comparison:
Snowflake's Platform segment shows higher concentration risk than Product segment:

Product: 15% (Customer X), 8% (Customer Y), 77% others
Platform: 20% (Customer Z), 5% (Customer W), 75% others

HIPAA Technical Controls:
Data Encryption Requirements:

Regulation: 45 CFR § 164.312(a)(2)(iv)
Validation: All data encrypted at rest/transit using FIPS 140-2 algorithms
Testing: Vulnerability scanning and penetration testing for encryption weaknesses

Additional Compliance Measures:

Risk analysis for security threats
Access controls for PHI authorization
Incident response planning
Required logs: encryption key management, data access, security incidents"""

prompt = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}"

output = model.generate(prompt, sampling_params)

📁 Repository Structure

lmunit/
├── assets/                 # Documentation assets and figures
├── eval/                   # Evaluation scripts and benchmarks
│   ├── eval.py             # Main evaluation script
│   └── reward_bench2.py    # Reward benchmarking utilities
├── lmunit/                 # Core LMUnit package
│   ├── __init__.py         # Package initialization
│   ├── constants.py        # Framework constants
│   ├── lmunit.py           # Main LMUnit class implementation
│   ├── metrics.py          # Evaluation metrics
│   └── tasks.py            # Task definitions and utilities
├── requirements/           # Dependencies
    ├── requirements.txt    # Main dependencies  
    └── dev.txt             # Development dependencies

📄 Artifacts

📋 Paper

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

🤗 HuggingFace Collection

LMUnit Models Collection - Pre-trained models and evaluation datasets

💾 Checkpoints

Model	Flask	BiGGen-Bench	Human-Internal	InfoBench	RB	LFQA	RB2
LMUnit-LLaMA-3.1-70B	72.03	67.69	93.63	89.00	91.56	76.15	80.5
LMUNIT_Qwen2.5-72B	73.85	69.56	94.44	88.67	91.13	73.85	82.1

🚀 Quick Start

Installation

pip install lmunit

Run Evaluation Task

For running an specific task on an LMUnit model

python eval/eval.py --task <task> --model-path <lmunit-model> --tensor-parallel-size <tp-size>

For running rewardbench2 results:

python eval/reward_bench2.py --model-path <lmunit-model> --tensor-parallel-size <tp-size>

Reproduce LMUnit evaluation suite

./scripts/run_all_evaluations.sh <model_path> <tensor_parallel_size> [output_dir]

Citation

@misc{saadfalcon2024lmunitfinegrainedevaluationnatural,
      title={LMUnit: Fine-grained Evaluation with Natural Language Unit Tests}, 
      author={Jon Saad-Falcon* and Rajan Vivek* and William Berrios* and Nandita Shankar Naik and Matija Franklin and Bertie Vidgen and Amanpreet Singh and Douwe Kiela and Shikib Mehri},
      year={2024},
      eprint={2412.13091},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13091},
      note={*Equal contribution}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

1.0.1

Jul 22, 2025

1.0.0

Jul 22, 2025

0.1.4

Jul 21, 2025

0.1.2

Jul 20, 2025

0.1.1

Jul 20, 2025

0.1.0

Jul 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmunit-1.0.1.tar.gz (11.7 kB view details)

Uploaded Jul 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lmunit-1.0.1-py3-none-any.whl (9.4 kB view details)

Uploaded Jul 22, 2025 Python 3

File details

Details for the file lmunit-1.0.1.tar.gz.

File metadata

Download URL: lmunit-1.0.1.tar.gz
Upload date: Jul 22, 2025
Size: 11.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for lmunit-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4a569c622b866b8f3cb210a3746994618e87c92acebad6494245799db6610d9a`
MD5	`03b079719ade72c21cbf904c489f3e46`
BLAKE2b-256	`ec9e800a0bf5e65c5380150f9cedb6c873e6b6af1eac9427e42007d77bb5d192`

See more details on using hashes here.

File details

Details for the file lmunit-1.0.1-py3-none-any.whl.

File metadata

Download URL: lmunit-1.0.1-py3-none-any.whl
Upload date: Jul 22, 2025
Size: 9.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for lmunit-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2eda7a6d8abb93e25741aa2dd4c6a72130e4f12f0e50b93c0165e81aeb2b1341`
MD5	`b55db77369aaf777100a146c8b40cce7`
BLAKE2b-256	`e796db9c91b4396518aceef05ced4fa14ecae00625afacda2e514abc7a0d238b`

See more details on using hashes here.

lmunit 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests

🚀 Recent Updates

📋 Table of Contents

🔧 LMUnit Inference Example

Install LMunit

Run Inference Example Script

📁 Repository Structure

📄 Artifacts

📋 Paper

🤗 HuggingFace Collection

💾 Checkpoints

🚀 Quick Start

Installation

Run Evaluation Task

Reproduce LMUnit evaluation suite

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes