Skip to main content

Language Model Unit Testing Framework

Project description

LMUnit: Fine-grained Evaluation with Natural Language Unit Tests


LMUnit Workflow This repository provides code for evaluation and reproduction of our results in LMUnit: Fine-grained Evaluation with Natural Language Unit Tests.

๐Ÿš€ Recent Updates

  • 2025-07: Open-Sourced LMUnit checkpoints Llama 3.1-70B and Qwen 2.5-72B models. Besides, evaluation framework for reproducing our direct scoring and preference tasks results

๐Ÿ“‹ Table of Contents

๐Ÿ”ง LMUnit Inference Example

Install LMunit

pip install lmunit

Run Inference Example Script

from lmunit import LMUnit
from vllm import SamplingParams

# Initialize LMUnit with your model
model = LMUnit(
    model_path="ContextualAI/LMUnit-llama3.1-70b", 
    tp_size=4
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=10, logprobs=20)
# Define your evaluation prompt

unit_test = "Does the response specify minimum system logs required for compliance?"

query =  "What security validation protocols ensure HIPAA compliance when connecting research databases to clinical systems?"



response = """Customer Concentration Risk Comparison:
Snowflake's Platform segment shows higher concentration risk than Product segment:

Product: 15% (Customer X), 8% (Customer Y), 77% others
Platform: 20% (Customer Z), 5% (Customer W), 75% others

HIPAA Technical Controls:
Data Encryption Requirements:

Regulation: 45 CFR ยง 164.312(a)(2)(iv)
Validation: All data encrypted at rest/transit using FIPS 140-2 algorithms
Testing: Vulnerability scanning and penetration testing for encryption weaknesses

Additional Compliance Measures:

Risk analysis for security threats
Access controls for PHI authorization
Incident response planning
Required logs: encryption key management, data access, security incidents"""

prompt = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}"

output = model.generate(prompt, sampling_params)

๐Ÿ“ Repository Structure

lmunit/
โ”œโ”€โ”€ assets/                 # Documentation assets and figures
โ”œโ”€โ”€ eval/                   # Evaluation scripts and benchmarks
โ”‚   โ”œโ”€โ”€ eval.py             # Main evaluation script
โ”‚   โ””โ”€โ”€ reward_bench2.py    # Reward benchmarking utilities
โ”œโ”€โ”€ lmunit/                 # Core LMUnit package
โ”‚   โ”œโ”€โ”€ __init__.py         # Package initialization
โ”‚   โ”œโ”€โ”€ constants.py        # Framework constants
โ”‚   โ”œโ”€โ”€ lmunit.py           # Main LMUnit class implementation
โ”‚   โ”œโ”€โ”€ metrics.py          # Evaluation metrics
โ”‚   โ””โ”€โ”€ tasks.py            # Task definitions and utilities
โ”œโ”€โ”€ requirements/           # Dependencies
    โ”œโ”€โ”€ requirements.txt    # Main dependencies  
    โ””โ”€โ”€ dev.txt             # Development dependencies

๐Ÿ“„ Artifacts

๐Ÿ“‹ Paper

๐Ÿค— HuggingFace Collection

๐Ÿ’พ Checkpoints

Model Flask BiGGen-Bench Human-Internal InfoBench RB LFQA RB2
LMUnit-LLaMA-3.1-70B 72.03 67.69 93.63 89.00 91.56 76.15 80.5
LMUNIT_Qwen2.5-72B 73.85 69.56 94.44 88.67 91.13 73.85 82.1

๐Ÿš€ Quick Start

Installation

pip install lmunit

Run Evaluation Task

For running an specific task on an LMUnit model

python eval/eval.py --task <task> --model-path <lmunit-model> --tensor-parallel-size <tp-size>

For running rewardbench2 results:

python eval/reward_bench2.py --model-path <lmunit-model> --tensor-parallel-size <tp-size>

Reproduce LMUnit evaluation suite

./scripts/run_all_evaluations.sh <model_path> <tensor_parallel_size> [output_dir]

Citation

@misc{saadfalcon2024lmunitfinegrainedevaluationnatural,
      title={LMUnit: Fine-grained Evaluation with Natural Language Unit Tests}, 
      author={Jon Saad-Falcon* and Rajan Vivek* and William Berrios* and Nandita Shankar Naik and Matija Franklin and Bertie Vidgen and Amanpreet Singh and Douwe Kiela and Shikib Mehri},
      year={2024},
      eprint={2412.13091},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13091},
      note={*Equal contribution}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmunit-1.0.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmunit-1.0.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file lmunit-1.0.1.tar.gz.

File metadata

  • Download URL: lmunit-1.0.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for lmunit-1.0.1.tar.gz
Algorithm Hash digest
SHA256 4a569c622b866b8f3cb210a3746994618e87c92acebad6494245799db6610d9a
MD5 03b079719ade72c21cbf904c489f3e46
BLAKE2b-256 ec9e800a0bf5e65c5380150f9cedb6c873e6b6af1eac9427e42007d77bb5d192

See more details on using hashes here.

File details

Details for the file lmunit-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: lmunit-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.15

File hashes

Hashes for lmunit-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2eda7a6d8abb93e25741aa2dd4c6a72130e4f12f0e50b93c0165e81aeb2b1341
MD5 b55db77369aaf777100a146c8b40cce7
BLAKE2b-256 e796db9c91b4396518aceef05ced4fa14ecae00625afacda2e514abc7a0d238b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page