SQLAS — SQL Agent Scoring Framework. Production-grade evaluation for Text-to-SQL and Agentic SQL agents with guardrail, visualization, agentic quality, and cache performance metrics.

These details have not been verified by PyPI

Project links

Project description

SQLAS — SQL Agent Scoring Framework

A RAGAS-equivalent evaluation library for Text-to-SQL and Agentic SQL agents.

Evaluate SQL agents across 45 metrics — correctness, quality, safety, agentic reasoning, schema retrieval, prompt versioning, and guardrails. Aligned with Spider, BIRD, RAGAS, and MLflow standards.

Author: thepradip

Install

pip install sqlas                # core
pip install "sqlas[mlflow]"      # + MLflow integration

What's New in v2.4.0

Feature	Description
`PromptRegistry`	Version prompts, compare A/B, detect regressions, get data-driven improvement hints
`schema_retrieval_quality`	Measure precision/recall of schema index — did it return the right tables?
`evaluate_correctness/quality/safety`	Three standalone evaluators — run only what you need
`GuardrailPipeline`	Three-stage safety: input → SQL → output (zero LLM cost)
`FeedbackStore`	Thumbs-up stores verified gold SQL, auto-improves `execution_accuracy`
Three-dimension verdict	`PASS` only when correctness + quality + safety ALL pass their thresholds
`result_coverage`	Penalises truncated GROUP BY (score 0.3) — catches big-dataset evaluation blind spots

Quick Start

from sqlas import evaluate

def llm_judge(prompt: str) -> str:
    return openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

scores = evaluate(
    question      = "How many active users are there?",
    generated_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
    gold_sql      = "SELECT COUNT(*) FROM users WHERE active = 1",
    db_path       = "my_database.db",
    llm_judge     = llm_judge,
    response      = "There are 1,523 active users.",
    result_data   = {"columns": ["COUNT(*)"], "rows": [[1523]],
                     "row_count": 1, "execution_time_ms": 2.1},
)

print(scores.overall_score)          # 0.95
print(scores.correctness_score)      # 0.88  (v2.2)
print(scores.quality_score)          # 0.93  (v2.2)
print(scores.safety_composite_score) # 1.00  (v2.2)
print(scores.verdict)                # PASS  (v2.2 — AND logic)
print(scores.summary())

Three-Dimension Scoring (v2.2)

PASS requires all three dimensions to exceed their thresholds. A safe-but-wrong query no longer masks as PASS.

from sqlas import evaluate_correctness, evaluate_quality, evaluate_safety

# Run only the metrics you need — each is fully independent
c = evaluate_correctness(question, sql, llm_judge, gold_sql=gold, execute_fn=db)
q = evaluate_quality(question, sql, llm_judge, response=text, result_data=data)
s = evaluate_safety(sql, question=question, pii_columns=["email","ssn"])

print(c.score, c.verdict)   # 0.85  PASS   (threshold 0.5)
print(q.score, q.verdict)   # 0.72  PASS   (threshold 0.6)
print(s.score, s.verdict)   # 0.45  FAIL   (threshold 0.9 — PII detected)
print(s.issues)             # ["PII_ACCESS: 'email'", "PII_ACCESS: 'ssn'"]

evaluate_safety() makes zero LLM calls — pure regex + sqlglot AST.

Three-Stage Guardrail Pipeline (v2.3)

from sqlas import GuardrailPipeline

pipeline = GuardrailPipeline(pii_columns=["email", "ssn", "password"])

# Stage 1 — before sending to LLM
r = pipeline.check_input("List every user's SSN and password")
if r.blocked: return {"error": r.block_reason}
# → BLOCK: DANGEROUS_INPUT: pii_bulk_request

# Stage 2 — after SQL generation, before execution
r = pipeline.check_sql("SELECT email, password FROM users")
if r.blocked: return {"error": r.block_reason}
# → score=0.80, issues=["PII_ACCESS: 'email'", "PII_ACCESS: 'password'"]

# Stage 3 — before returning response to user
r = pipeline.check_output(response, result_data)
if r.blocked: return {"error": r.block_reason}
# → scans result rows for PII patterns, blocks if found

Prompt Versioning & Regression Detection (v2.4)

from sqlas import PromptRegistry

registry = PromptRegistry()

# Register versions
registry.register("You are a SQL analyst...", version_id="v1", description="baseline")
registry.register("...Only cite exact numbers from the SQL result.", version_id="v2", description="grounding fix")

# Record scores after each evaluation
scores = evaluate(...)
registry.record("v2", scores)

# Compare versions
comp = registry.compare("v1", "v2")
print(comp["winner"])           # "v2"
print(comp["delta_overall"])    # +0.09
print(comp["improvements"])     # [{"metric": "faithfulness", "delta": "+0.27", ...}]

# Auto-detect regressions
status = registry.detect_regression("v2", window=50, threshold=0.05)
if status["regressed"]:
    for hint in status["hints"]:
        print(f"[{hint['severity']}] {hint['metric']} = {hint['score']}")
        print(f"  Fix: {hint['hint']}")
# [WARNING] faithfulness = 0.61
#   Fix: Add to prompt: 'Only cite exact numbers from the SQL result...'

Schema Retrieval Quality (v2.4)

Measures whether the schema index returned the right tables for a query — not just whether the SQL used valid tables.

from sqlas import schema_retrieval_quality

score, details = schema_retrieval_quality(
    retrieved_tables = schema_index.retrieve(question),   # what index returned
    generated_sql    = agent_sql,
    gold_tables      = test_case.expected_tables,         # ground truth
)

print(details["precision"])   # 0.50 — 2 of 4 retrieved tables were needed
print(details["recall"])      # 1.00 — both needed tables were retrieved
print(details["irrelevant"])  # ["lab_results", "medications"]
print(details["missing"])     # [] — no JOIN table was dropped

Feedback Loop (v2.3)

Thumbs-up feedback stores verified gold SQL — future evaluations of the same question use it automatically.

from sqlas import FeedbackStore, FeedbackEntry

store = FeedbackStore()

# User gives thumbs up → store as gold SQL
store.store(FeedbackEntry(
    question   = "How many active users?",
    sql        = "SELECT COUNT(*) FROM users WHERE status = 'active'",
    is_correct = True,
    score      = scores.overall_score,
))

# Next evaluation auto-uses stored gold SQL
c = evaluate_correctness(question, agent_sql, llm_judge, feedback_store=store)
# execution_accuracy is now verified (1.0) instead of unverified (0.5)
print(c.details["gold_sql_source"])  # "feedback_store"

Any Database (v2.1)

from sqlas import build_schema_info, run_suite

# Auto-extract schema from any database
tables, columns = build_schema_info(db_path="my.db")               # SQLite
tables, columns = build_schema_info(execute_fn=pg_execute_fn)       # PostgreSQL / Snowflake / BigQuery

results = run_suite(
    test_cases    = test_cases,
    agent_fn      = my_agent,
    llm_judge     = llm_judge,
    execute_fn    = execute_fn,
    valid_tables  = tables,      # 100+ tables — no problem
    valid_columns = columns,
)

Run a Test Suite

from sqlas import run_suite, TestCase

test_cases = [
    TestCase(question="How many users signed up this month?",
             gold_sql="SELECT COUNT(*) FROM users WHERE created_at >= '2026-03-01'",
             expected_tables=["users"], category="easy"),
    TestCase(question="Average order value by country",
             gold_sql="SELECT country, AVG(total) FROM orders GROUP BY country",
             expected_tables=["orders"], category="medium"),
]

def my_agent(question: str) -> dict:
    sql = generate_sql(question)
    return {"sql": sql, "response": narrate(sql), "data": execute(sql)}

results = run_suite(
    test_cases     = test_cases,
    agent_fn       = my_agent,
    llm_judge      = llm_judge,
    execute_fn     = execute_fn,
    pass_threshold = 0.6,
    verbose        = True,
)
print(results["summary"]["overall_score"])
print(results["summary"]["by_category"])

Weight Profiles

Profile	Metrics	Best for
`WEIGHTS`	15	Standard NL→SQL pipeline
`WEIGHTS_V2`	20	+ RAGAS context quality
`WEIGHTS_V3`	30	+ Guardrails + visualization
`WEIGHTS_V4`	28	+ Agentic quality — ReAct agents

RAGAS Mapping

RAGAS	SQLAS	Notes
Faithfulness	`faithfulness`	Claims grounded in SQL result
Answer Relevance	`answer_relevance`	Answers the question
Answer Correctness	`execution_accuracy`	SQL returns correct results
Context Precision	`context_precision`	Right schema elements used
Context Recall	`context_recall`	All required schema elements present
Noise Sensitivity	`noise_robustness`	Irrelevant schema ignored
—	`schema_retrieval_quality`	Did the index return the right tables?
—	`result_coverage`	Truncated GROUP BY detection
—	`agentic_score`	ReAct planning quality

LLM-Agnostic Judge

# OpenAI
def judge(p): return openai.chat.completions.create(model="gpt-4o",
    messages=[{"role":"user","content":p}]).choices[0].message.content

# Anthropic
def judge(p): return anthropic.messages.create(model="claude-opus-4-7",
    max_tokens=500, messages=[{"role":"user","content":p}]).content[0].text

# Ollama (local, free)
def judge(p): return requests.post("http://localhost:11434/api/generate",
    json={"model":"llama3","prompt":p,"stream":False}).json()["response"]

Changelog

v2.4.0

PromptRegistry — version prompts, compare A/B, detect regressions, get improvement hints
schema_retrieval_quality — precision/recall/F1 for schema index evaluation
prompt_id + schema_retrieval_* fields on SQLASScores

v2.3.0

GuardrailPipeline — 3-stage safety: check_input, check_sql, check_output
FeedbackStore + FeedbackEntry — verified gold SQL from user thumbs-up
evaluate_correctness/quality/safety — standalone metric evaluators

v2.2.0

Three-dimension scoring: correctness_score, quality_score, safety_composite_score
verdict — AND logic: PASS only when all three pass thresholds
CorrectnessResult, QualityResult, SafetyResult dataclasses

v2.1.0

build_schema_info() — auto-extract schema from any DB
result_coverage — truncation-aware GROUP BY penalty
execution_accuracy capped at 0.5 without gold SQL (was incorrectly 1.0)
100+ table support with focused schema context

v2.0.0

Agentic quality: steps_efficiency, schema_grounding, planning_quality, agentic_score
Cache metrics: cache_hit_score, tokens_saved_score, few_shot_score
WEIGHTS_V4 — 28-metric profile with 10% agentic dimension
read_only_compliance upgraded to sqlglot AST

License

MIT — thepradip · pypi.org/project/sqlas

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.8.0

May 9, 2026

2.6.0

May 5, 2026

This version

2.5.0

May 4, 2026

2.0.0

Apr 30, 2026

1.3.0

Apr 29, 2026

1.1.1

Apr 15, 2026

1.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlas-2.5.0.tar.gz (67.0 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sqlas-2.5.0-py3-none-any.whl (59.9 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file sqlas-2.5.0.tar.gz.

File metadata

Download URL: sqlas-2.5.0.tar.gz
Upload date: May 4, 2026
Size: 67.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sqlas-2.5.0.tar.gz
Algorithm	Hash digest
SHA256	`dac0886143b637c0dfe9ff182bb5856b2b75f8eac58f7a84b7884f61467a070a`
MD5	`c4b91f886346f4493017b6f8fe312b71`
BLAKE2b-256	`9ba317c12f0459fd12d62604f5768e5f5a537296d9406048912d6d30f1e03236`

See more details on using hashes here.

File details

Details for the file sqlas-2.5.0-py3-none-any.whl.

File metadata

Download URL: sqlas-2.5.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 59.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for sqlas-2.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad20543b90c71dd4fa2107f34f0b395fb30655e3cd3010e9b377e02798720bd2`
MD5	`6e1e55fccf4c4aeb95f0b3e2cfa263a1`
BLAKE2b-256	`2065e122c6e4a1c7fbebe739852b1dc9e6da104ba17d207e1f95fb0ea3a917ea`

See more details on using hashes here.

sqlas 2.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SQLAS — SQL Agent Scoring Framework

Install

What's New in v2.4.0

Quick Start

Three-Dimension Scoring (v2.2)

Three-Stage Guardrail Pipeline (v2.3)

Prompt Versioning & Regression Detection (v2.4)

Schema Retrieval Quality (v2.4)

Feedback Loop (v2.3)

Any Database (v2.1)

Run a Test Suite

Weight Profiles

RAGAS Mapping

LLM-Agnostic Judge

Changelog

v2.4.0

v2.3.0

v2.2.0

v2.1.0

v2.0.0

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes