Skip to main content

A Python library encapsulating best practices for rubric-based evaluation of LLM/VLM outputs using LLM-as-a-judge.

Project description

PyPI version Python versions Website arXiv

AutoRubric

A Python library for evaluating text outputs against weighted criteria using LLM-as-a-judge.

  @misc{rao2026autorubric,
        title={Autorubric: A Unified Framework for Rubric-Based LLM Evaluation},
        author={Delip Rao and Chris Callison-Burch},
        year={2026},
        eprint={2603.00077},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2603.00077},
  }

Installation

pip install autorubric

Quick Example

import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader

async def main():
    grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-5.1-mini"))

    rubric = Rubric.from_dict([
        {"weight": 10.0, "requirement": "States NMC cell-level energy density in the 250-300 Wh/kg range"},
        {"weight": 8.0, "requirement": "Identifies LFP thermal runaway threshold (~270°C) as higher than NMC (~210°C)"},
        {"weight": 6.0, "requirement": "States LFP cycle life advantage (2000-5000 cycles vs 1000-2000 for NMC)"},
        {"weight": -15.0, "requirement": "Incorrectly claims LFP has higher gravimetric energy density than NMC"}
    ])

    result = await rubric.grade(
        to_grade="""NMC cathodes (LiNixMnyCozO2) achieve 250-280 Wh/kg at the cell level,
        while LFP (LiFePO4) typically reaches 150-205 Wh/kg. However, LFP offers superior
        thermal stability with decomposition onset at ~270°C compared to ~210°C for NMC,
        and delivers 2000-5000 charge cycles versus 1000-2000 for NMC.""",
        grader=grader,
        query="Compare NMC and LFP cathode materials for EV battery applications.",
    )

    print(f"Score: {result.score:.2f}")
    for criterion in result.report:
        print(f"  [{criterion.final_verdict}] {criterion.criterion.requirement}")

asyncio.run(main())

Documentation

Full documentation, API reference, and a cookbook with several dozen recipes are available at autorubric.org.

Resource Link
Project site autorubric.org
API reference autorubric.org/docs/api
Cookbook autorubric.org/docs/cookbook

Features

Feature Description
Weighted criteria Positive and negative weights with explicit requirements
Per-criterion explanations Every verdict includes the judge's reasoning
100+ LLM providers OpenAI, Anthropic, Google, Azure, Groq, Ollama, and more via LiteLLM
Ensemble judging Combine multiple LLM judges with configurable aggregation strategies
Few-shot calibration Provide labeled examples to improve grading consistency
Multi-choice criteria Ordinal and nominal scales beyond binary met/unmet verdicts
Batch evaluation High-throughput EvalRunner with checkpointing and resumption
Metrics & validation Agreement metrics, bootstrap confidence intervals, distribution analysis
Length penalty Configurable penalty for overly long responses
Thinking/reasoning support Budget-controlled extended thinking for supported models
Response caching Disk-based caching to avoid redundant LLM calls
Dataset support Structured datasets with per-item rubrics, prompts, and ground truth
YAML configuration Define rubrics, LLM configs, and datasets in YAML
Meta-rubric evaluation Evaluate and automatically improve rubric quality

License

MIT License - see LICENSE file for details.

Acknowledgments

This research was developed with funding from the Defense Advanced Research Projects Agency’s (DARPA) SciFy program (Agreement No. HR00112520300). The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autorubric-1.0.1.tar.gz (127.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autorubric-1.0.1-py3-none-any.whl (133.0 kB view details)

Uploaded Python 3

File details

Details for the file autorubric-1.0.1.tar.gz.

File metadata

  • Download URL: autorubric-1.0.1.tar.gz
  • Upload date:
  • Size: 127.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for autorubric-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f643796f42a71702cf06bad859c5e20d540331bd4ccbd889ada911364ba705cb
MD5 f7d3f5584ab2f824d0809ea298dc29e2
BLAKE2b-256 51ff54fbf7e2081ba775794b2e505f9d0e499ea0fb63212697c95f6763024695

See more details on using hashes here.

File details

Details for the file autorubric-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: autorubric-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 133.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for autorubric-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d50dc7191571c7375d7e7ec36c63a8717d0899d40b3fc067f1141c0724c8ea06
MD5 4aee7940e81fc20e7f366ddd0e81e309
BLAKE2b-256 c3738a17189c3874b7c3797d57f6446a4a613bbe39d731c186f15db56a7818e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page