A Python library encapsulating best practices for rubric-based evaluation of LLM/VLM outputs using LLM-as-a-judge.
Project description
AutoRubric
A Python library for evaluating text outputs against weighted criteria using LLM-as-a-judge.
@misc{rao2026autorubric,
title={Autorubric: A Unified Framework for Rubric-Based LLM Evaluation},
author={Delip Rao and Chris Callison-Burch},
year={2026},
eprint={2603.00077},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.00077},
}
Installation
pip install autorubric
Quick Example
import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader
async def main():
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-5.1-mini"))
rubric = Rubric.from_dict([
{"weight": 10.0, "requirement": "States NMC cell-level energy density in the 250-300 Wh/kg range"},
{"weight": 8.0, "requirement": "Identifies LFP thermal runaway threshold (~270°C) as higher than NMC (~210°C)"},
{"weight": 6.0, "requirement": "States LFP cycle life advantage (2000-5000 cycles vs 1000-2000 for NMC)"},
{"weight": -15.0, "requirement": "Incorrectly claims LFP has higher gravimetric energy density than NMC"}
])
result = await rubric.grade(
to_grade="""NMC cathodes (LiNixMnyCozO2) achieve 250-280 Wh/kg at the cell level,
while LFP (LiFePO4) typically reaches 150-205 Wh/kg. However, LFP offers superior
thermal stability with decomposition onset at ~270°C compared to ~210°C for NMC,
and delivers 2000-5000 charge cycles versus 1000-2000 for NMC.""",
grader=grader,
query="Compare NMC and LFP cathode materials for EV battery applications.",
)
print(f"Score: {result.score:.2f}")
for criterion in result.report:
print(f" [{criterion.final_verdict}] {criterion.criterion.requirement}")
asyncio.run(main())
Documentation
Full documentation, API reference, and a cookbook with several dozen recipes are available at autorubric.org.
| Resource | Link |
|---|---|
| Project site | autorubric.org |
| API reference | autorubric.org/docs/api |
| Cookbook | autorubric.org/docs/cookbook |
Features
| Feature | Description |
|---|---|
| Weighted criteria | Positive and negative weights with explicit requirements |
| Per-criterion explanations | Every verdict includes the judge's reasoning |
| 100+ LLM providers | OpenAI, Anthropic, Google, Azure, Groq, Ollama, and more via LiteLLM |
| Ensemble judging | Combine multiple LLM judges with configurable aggregation strategies |
| Few-shot calibration | Provide labeled examples to improve grading consistency |
| Multi-choice criteria | Ordinal and nominal scales beyond binary met/unmet verdicts |
| Batch evaluation | High-throughput EvalRunner with checkpointing and resumption |
| Metrics & validation | Agreement metrics, bootstrap confidence intervals, distribution analysis |
| Length penalty | Configurable penalty for overly long responses |
| Thinking/reasoning support | Budget-controlled extended thinking for supported models |
| Response caching | Disk-based caching to avoid redundant LLM calls |
| Dataset support | Structured datasets with per-item rubrics, prompts, and ground truth |
| YAML configuration | Define rubrics, LLM configs, and datasets in YAML |
| Meta-rubric evaluation | Evaluate and automatically improve rubric quality |
License
MIT License - see LICENSE file for details.
Acknowledgments
This research was developed with funding from the Defense Advanced Research Projects Agency’s (DARPA) SciFy program (Agreement No. HR00112520300). The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autorubric-1.0.1.tar.gz.
File metadata
- Download URL: autorubric-1.0.1.tar.gz
- Upload date:
- Size: 127.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f643796f42a71702cf06bad859c5e20d540331bd4ccbd889ada911364ba705cb
|
|
| MD5 |
f7d3f5584ab2f824d0809ea298dc29e2
|
|
| BLAKE2b-256 |
51ff54fbf7e2081ba775794b2e505f9d0e499ea0fb63212697c95f6763024695
|
File details
Details for the file autorubric-1.0.1-py3-none-any.whl.
File metadata
- Download URL: autorubric-1.0.1-py3-none-any.whl
- Upload date:
- Size: 133.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d50dc7191571c7375d7e7ec36c63a8717d0899d40b3fc067f1141c0724c8ea06
|
|
| MD5 |
4aee7940e81fc20e7f366ddd0e81e309
|
|
| BLAKE2b-256 |
c3738a17189c3874b7c3797d57f6446a4a613bbe39d731c186f15db56a7818e9
|