Skip to main content

SQLAS — SQL Agent Scoring Framework. Production-grade evaluation for Text-to-SQL and Agentic SQL agents with guardrail, visualization, agentic quality, and cache performance metrics.

Project description

SQLAS — SQL Agent Scoring Framework

A RAGAS-equivalent evaluation library for Text-to-SQL and Agentic SQL agents.

PyPI Python Tests License: MIT

Evaluate SQL agents across 50+ metrics — correctness, quality, safety, agentic reasoning, schema retrieval, prompt versioning, guardrails, and cache ROI. Aligned with Spider, BIRD, RAGAS, and MLflow standards.

Author: thepradip


Install

pip install sqlas                # core
pip install "sqlas[mlflow]"      # + MLflow
pip install "sqlas[ui]"          # + Streamlit UI
pip install "sqlas[all]"         # everything

What's New in v2.7.0

Feature Description
Multi-gold SQL execution_accuracy_best_of(sql, gold_sqls) — evaluate against all valid gold queries, take best score. TestCase.gold_sqls: list[str]
Hardness classification auto_classify_hardness(sql)easy/medium/hard/extra-hard per BIRD criteria. Auto-set on every evaluate() call
Exact match metric exact_match(generated, gold) — normalized string comparison. Exposed as SQLASScores.exact_match_score
Failure classification classify_failure(sql, scores, details) → named FailureCategory with top_hint() actionable fix
Batch crash isolation One failing test case no longer kills the entire batch
LLM retry with backoff _retry_llm_judge() retries 3× (1s→2s→4s) on all 13 LLM judge call sites
Weight normalization Custom weights auto-normalized to 1.0 instead of silently distorting scores
LLM judge cache enable_judge_cache() — opt-in in-memory cache prevents re-scoring identical pairs in CI
Report generation generate_report(scores_list, format="markdown"|"json"), to_json(), to_markdown_report()
Non-deterministic detection NOW(), RANDOM(), CURRENT_TIMESTAMP trigger nondeterministic_warning in details
execute_fn timeout 30s wall-clock timeout with thread-safe SQLite fallback
Safety patterns UNION ALL SELECT, EXCEPT SELECT, WAITFOR DELAY, file injection, NL prompt injection synonyms

Quick Start

from sqlas import evaluate

def llm_judge(prompt: str) -> str:
    return openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role":"user","content":prompt}]
    ).choices[0].message.content

scores = evaluate(
    question      = "How many active users?",
    generated_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
    gold_sql      = "SELECT COUNT(*) FROM users WHERE active = 1",
    db_path       = "my.db",
    llm_judge     = llm_judge,
    response      = "There are 1,523 active users.",
    result_data   = {"columns":["COUNT(*)"],"rows":[[1523]],"row_count":1,"execution_time_ms":2.1},
)

print(scores.overall_score)           # 0.95
print(scores.correctness_score)       # 0.88
print(scores.verdict)                 # PASS
print(scores.hardness)                # "easy"
print(scores.exact_match_score)       # 1.0
print(scores.to_markdown_report())    # Markdown for PR comments

Failure Classification (v2.7)

Know exactly why a query failed — not just a score.

from sqlas import classify_failure, FailureCategory

analysis = classify_failure(
    sql     = "SELECT id FROM users LIMIT 100",
    scores  = {"execution_accuracy": 1.0, "row_count_match": 0.12},
    details = {"row_count_match": {"pred_count": 100, "gold_count": 839}},
)

print(analysis.primary)        # FailureCategory.LIMIT_TRUNCATION
print(analysis.summary())      # "FAIL [limit_truncation] (score=1.000)"
print(analysis.top_hint())     # "Remove LIMIT — question asks for full results, not top-N"
print(analysis.evidence)       # {"limit_truncation": "LIMIT in SQL, 100 rows vs 839 expected"}

All failure categories:

Category Source
LIMIT_TRUNCATION LIMIT silently cut result (100 vs 839 rows)
WRONG_TABLE accounting_transactions used instead of accounting
WRONG_AGGREGATION MAX instead of SUM, AVG instead of SUM
SCALAR_MISMATCH Correlation or count value differs
ROW_EXPLOSION 1:N join inflated row count
SCHEMA_HALLUCINATION Invented table/column names (counts, adm_count, n)
FULL_TABLE_SCAN SELECT * with no WHERE/LIMIT
TRIM_ON_NUMERIC TRIM() on REAL column — invalid on Postgres
UNSAFE_QUERY DDL/DML attempted
CURRENCY_NOT_CLEANED Single REPLACE missed commas in $1,234
NULL_IN_AGGREGATION AVG/SUM without IS NOT NULL
JOIN_WITHOUT_FK Banking joined to users with no foreign key
FAITHFULNESS_DROP Narration not grounded in SQL result

Multi-gold SQL (v2.7)

When a question has multiple valid SQL formulations, evaluate against all and take the best score:

from sqlas import evaluate, TestCase

# Single evaluate call
scores = evaluate(
    question      = "Count active users",
    generated_sql = "SELECT COUNT(*) FROM users WHERE status = 'active'",
    gold_sql      = "SELECT COUNT(*) FROM users WHERE active = 1",   # primary gold
    db_path       = "my.db",
    llm_judge     = llm_judge,
)

# Batch with multiple gold SQLs per question
test_case = TestCase(
    question  = "Count active users",
    gold_sqls = [
        "SELECT COUNT(*) FROM users WHERE active = 1",
        "SELECT COUNT(*) FROM users WHERE status = 'active'",
        "SELECT COUNT(id) FROM users WHERE is_active = true",
    ],
)

Hardness Classification (v2.7)

from sqlas import auto_classify_hardness

auto_classify_hardness("SELECT COUNT(*) FROM users")
# → "easy"

auto_classify_hardness("SELECT u.id, SUM(o.total) FROM users u JOIN orders o ON u.id=o.user_id GROUP BY u.id HAVING SUM(o.total) > 1000")
# → "hard"

auto_classify_hardness("WITH ranked AS (SELECT *, ROW_NUMBER() OVER (...) FROM ...) SELECT ...")
# → "extra-hard"

Follows BIRD benchmark criteria. Auto-set on every evaluate() call as SQLASScores.hardness.


Report Generation (v2.7)

from sqlas import generate_report

# Batch markdown report — paste into PRs or CI comments
results = evaluate_batch(test_cases, llm_judge, db_path="my.db")
print(generate_report(results, questions, format="markdown"))

# JSON for artifact storage
print(generate_report(results, format="json"))

# Per-query reports
print(scores.to_json())
print(scores.to_markdown_report(question="How many users?", sql=generated_sql))

LLM Judge Cache (v2.7)

Prevent re-scoring identical prompts in CI runs:

from sqlas import enable_judge_cache, clear_judge_cache

enable_judge_cache()          # opt-in — identical prompts return cached result
results = evaluate_batch(...)
clear_judge_cache()           # clear between test runs

Three-Dimension Scoring

PASS only when all three dimensions meet their thresholds:

from sqlas import evaluate_correctness, evaluate_quality, evaluate_safety

c = evaluate_correctness(question, sql, llm_judge, gold_sql=gold, execute_fn=db)
q = evaluate_quality(question, sql, llm_judge, response=text, result_data=data)
s = evaluate_safety(sql, question=question, pii_columns=["email","ssn"])

print(c.score, c.verdict)   # 0.85  PASS   (threshold 0.5)
print(q.score, q.verdict)   # 0.72  PASS   (threshold 0.6)
print(s.score, s.verdict)   # 0.45  FAIL   (threshold 0.9 — PII detected)

evaluate_safety() makes zero LLM calls — pure regex + sqlglot AST.


Guardrail Pipeline

from sqlas import GuardrailPipeline

pipeline = GuardrailPipeline(pii_columns=["email","ssn","password"])

r = pipeline.check_input("List every user's SSN")    # blocks malicious NL intent
r = pipeline.check_sql(generated_sql)                # blocks injection/PII SQL
r = pipeline.check_output(response, result_data)     # blocks PII in response

Injection patterns detected: UNION ALL SELECT, EXCEPT SELECT, INTERSECT SELECT, stacked mutations, tautologies, time-based injection, file write/read, WAITFOR DELAY.

NL prompt injection detected: ignore/override/discard instructions, jailbreak, bypass guardrails, pretend unrestricted.


Spider / BIRD Benchmark

from sqlas.benchmarks import run_spider_benchmark

results = run_spider_benchmark(
    agent_fn   = my_agent,
    llm_judge  = llm_judge,
    spider_dir = "./spider",
    n_samples  = 50,          # stratified by difficulty → ~$0.25
    mlflow_run = True,
)
print(results["summary"]["overall_score"])

Prompt Versioning

from sqlas import PromptRegistry

registry = PromptRegistry()
registry.register("You are a SQL analyst...", version_id="v1")
registry.record("v1", scores)

status = registry.detect_regression("v1", window=50, threshold=0.05)
if status["regressed"]:
    for hint in status["hints"]:
        print(hint["hint"])   # actionable prompt fix suggestion

Observability Integrations

from sqlas.integrations import log_all

log_all(results,
    mlflow_experiment = "sql-agent-v2",
    wandb_project     = "sql-evals",
    langsmith_project = "my-sql-agent",
)

Run a Test Suite

from sqlas import run_suite, TestCase, WEIGHTS_V4, build_schema_info

tables, columns = build_schema_info(db_path="my.db")

results = run_suite(
    test_cases     = test_cases,
    agent_fn       = my_agent,
    llm_judge      = llm_judge,
    execute_fn     = execute_fn,
    valid_tables   = tables,
    valid_columns  = columns,
    weights        = WEIGHTS_V4,
    pass_threshold = 0.6,
)
print(results["summary"]["overall_score"])
print(results["summary"]["by_category"])

Metrics Overview

Dimension Key Metrics
Correctness Execution accuracy, exact match, multi-gold SQL, semantic equivalence, result set similarity
SQL Quality SQL quality (LLM), schema compliance, complexity match, data scan efficiency
Context (RAGAS) Context precision, recall, entity recall, noise robustness
Response Faithfulness, answer relevance, completeness, fluency
Agentic Steps efficiency, schema grounding, planning quality, tool use accuracy, plan compliance, first attempt success
Safety Read-only compliance, SQL injection, prompt injection, PII access, PII leakage
Production Execution success, VES efficiency, row explosion detection, empty result, result coverage
Cache Cache hit score, tokens saved, few-shot examples used
Visualization Chart spec validity, data alignment, LLM chart validation

Weight Profiles

Profile Metrics Best for
WEIGHTS 15 Standard NL→SQL pipeline
WEIGHTS_V2 20 + RAGAS context quality
WEIGHTS_V3 30 + Guardrails + visualization
WEIGHTS_V4 28 + Agentic quality ← ReAct agents

Changelog

v2.7.0

  • classify_failure() + FailureCategory enum — named failure classification with actionable hints
  • auto_classify_hardness() — BIRD-aligned easy/medium/hard/extra-hard (auto-set on every eval)
  • exact_match() + SQLASScores.exact_match_score
  • execution_accuracy_best_of() + TestCase.gold_sqls — multi-gold SQL evaluation
  • generate_report() — batch markdown/JSON report; to_json(), to_markdown_report() on SQLASScores
  • enable_judge_cache() / clear_judge_cache() — opt-in LLM judge caching
  • LLM retry with exponential backoff (3×) on all 13 LLM judge call sites
  • Batch eval crash isolation — one failure no longer kills the batch
  • Weight normalization — auto-normalize to 1.0 instead of silently distorting
  • execute_fn timeout (30s) with thread-safe SQLite fallback
  • Non-deterministic query detection (NOW, RANDOM, CURRENT_TIMESTAMP)
  • Safety: UNION ALL SELECT, EXCEPT, WAITFOR DELAY, file injection, NL synonyms
  • Division-by-zero guards in all context metrics

v2.6.0

  • Spider/BIRD benchmark (run_spider_benchmark, run_bird_benchmark)
  • MLflow, W&B, LangSmith integrations (sqlas.integrations)
  • Streamlit UI (python -m sqlas ui)
  • React evaluation dashboard (sqlas-ui/)

v2.5.0

  • plan_compliance() — measures create_plan enforcement before execute_sql
  • first_attempt_success() — measures SQL retry rate

v2.4.0

  • PromptRegistry — prompt versioning, regression detection, improvement hints
  • schema_retrieval_quality() — precision/recall/F1 for schema index

v2.3.0

  • GuardrailPipeline — 3-stage safety: input → SQL → output (zero LLM cost)
  • FeedbackStore — verified gold SQL from user thumbs-up

v2.2.0

  • Three-dimension scoring: correctness_score, quality_score, safety_composite_score
  • verdict — AND logic: PASS only when all three pass thresholds

License

MIT — thepradip · pypi.org/project/sqlas

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlas-2.8.0.tar.gz (111.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlas-2.8.0-py3-none-any.whl (103.5 kB view details)

Uploaded Python 3

File details

Details for the file sqlas-2.8.0.tar.gz.

File metadata

  • Download URL: sqlas-2.8.0.tar.gz
  • Upload date:
  • Size: 111.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sqlas-2.8.0.tar.gz
Algorithm Hash digest
SHA256 ffb4e56519c382de9b34d803011c441cd9d79477be06c7ddc3181d3d8e3c12c4
MD5 8a74266ea7455a16192ed38e588b76d7
BLAKE2b-256 c1e5f1591607a637a317004c7990c84e5f6cd908b95118879a635cd4011e8233

See more details on using hashes here.

File details

Details for the file sqlas-2.8.0-py3-none-any.whl.

File metadata

  • Download URL: sqlas-2.8.0-py3-none-any.whl
  • Upload date:
  • Size: 103.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sqlas-2.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d32e850d2604a4229e4e6ff3fd03cac5ef0714ddf407ef52fcd3e74edd586f23
MD5 e2a4d9e97600537be84ac149cb7a0298
BLAKE2b-256 4a2911ac681db1ecf0da7226b5cb113ce7c55d10d5c3b05de21ecb1b42661cdb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page