SQLAS — SQL Agent Scoring Framework. Production-grade evaluation for Text-to-SQL and Agentic SQL agents with guardrail, visualization, agentic quality, and cache performance metrics.
Project description
SQLAS — SQL Agent Scoring Framework
A RAGAS-equivalent evaluation library for Text-to-SQL and Agentic SQL agents.
Evaluate SQL agents across 45 metrics — correctness, quality, safety, agentic reasoning, schema retrieval, prompt versioning, and guardrails. Aligned with Spider, BIRD, RAGAS, and MLflow standards.
Author: thepradip
Install
pip install sqlas # core
pip install "sqlas[mlflow]" # + MLflow integration
What's New in v2.4.0
| Feature | Description |
|---|---|
PromptRegistry |
Version prompts, compare A/B, detect regressions, get data-driven improvement hints |
schema_retrieval_quality |
Measure precision/recall of schema index — did it return the right tables? |
evaluate_correctness/quality/safety |
Three standalone evaluators — run only what you need |
GuardrailPipeline |
Three-stage safety: input → SQL → output (zero LLM cost) |
FeedbackStore |
Thumbs-up stores verified gold SQL, auto-improves execution_accuracy |
| Three-dimension verdict | PASS only when correctness + quality + safety ALL pass their thresholds |
result_coverage |
Penalises truncated GROUP BY (score 0.3) — catches big-dataset evaluation blind spots |
Quick Start
from sqlas import evaluate
def llm_judge(prompt: str) -> str:
return openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
scores = evaluate(
question = "How many active users are there?",
generated_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
gold_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
db_path = "my_database.db",
llm_judge = llm_judge,
response = "There are 1,523 active users.",
result_data = {"columns": ["COUNT(*)"], "rows": [[1523]],
"row_count": 1, "execution_time_ms": 2.1},
)
print(scores.overall_score) # 0.95
print(scores.correctness_score) # 0.88 (v2.2)
print(scores.quality_score) # 0.93 (v2.2)
print(scores.safety_composite_score) # 1.00 (v2.2)
print(scores.verdict) # PASS (v2.2 — AND logic)
print(scores.summary())
Three-Dimension Scoring (v2.2)
PASS requires all three dimensions to exceed their thresholds. A safe-but-wrong query no longer masks as PASS.
from sqlas import evaluate_correctness, evaluate_quality, evaluate_safety
# Run only the metrics you need — each is fully independent
c = evaluate_correctness(question, sql, llm_judge, gold_sql=gold, execute_fn=db)
q = evaluate_quality(question, sql, llm_judge, response=text, result_data=data)
s = evaluate_safety(sql, question=question, pii_columns=["email","ssn"])
print(c.score, c.verdict) # 0.85 PASS (threshold 0.5)
print(q.score, q.verdict) # 0.72 PASS (threshold 0.6)
print(s.score, s.verdict) # 0.45 FAIL (threshold 0.9 — PII detected)
print(s.issues) # ["PII_ACCESS: 'email'", "PII_ACCESS: 'ssn'"]
evaluate_safety() makes zero LLM calls — pure regex + sqlglot AST.
Three-Stage Guardrail Pipeline (v2.3)
from sqlas import GuardrailPipeline
pipeline = GuardrailPipeline(pii_columns=["email", "ssn", "password"])
# Stage 1 — before sending to LLM
r = pipeline.check_input("List every user's SSN and password")
if r.blocked: return {"error": r.block_reason}
# → BLOCK: DANGEROUS_INPUT: pii_bulk_request
# Stage 2 — after SQL generation, before execution
r = pipeline.check_sql("SELECT email, password FROM users")
if r.blocked: return {"error": r.block_reason}
# → score=0.80, issues=["PII_ACCESS: 'email'", "PII_ACCESS: 'password'"]
# Stage 3 — before returning response to user
r = pipeline.check_output(response, result_data)
if r.blocked: return {"error": r.block_reason}
# → scans result rows for PII patterns, blocks if found
Prompt Versioning & Regression Detection (v2.4)
from sqlas import PromptRegistry
registry = PromptRegistry()
# Register versions
registry.register("You are a SQL analyst...", version_id="v1", description="baseline")
registry.register("...Only cite exact numbers from the SQL result.", version_id="v2", description="grounding fix")
# Record scores after each evaluation
scores = evaluate(...)
registry.record("v2", scores)
# Compare versions
comp = registry.compare("v1", "v2")
print(comp["winner"]) # "v2"
print(comp["delta_overall"]) # +0.09
print(comp["improvements"]) # [{"metric": "faithfulness", "delta": "+0.27", ...}]
# Auto-detect regressions
status = registry.detect_regression("v2", window=50, threshold=0.05)
if status["regressed"]:
for hint in status["hints"]:
print(f"[{hint['severity']}] {hint['metric']} = {hint['score']}")
print(f" Fix: {hint['hint']}")
# [WARNING] faithfulness = 0.61
# Fix: Add to prompt: 'Only cite exact numbers from the SQL result...'
Schema Retrieval Quality (v2.4)
Measures whether the schema index returned the right tables for a query — not just whether the SQL used valid tables.
from sqlas import schema_retrieval_quality
score, details = schema_retrieval_quality(
retrieved_tables = schema_index.retrieve(question), # what index returned
generated_sql = agent_sql,
gold_tables = test_case.expected_tables, # ground truth
)
print(details["precision"]) # 0.50 — 2 of 4 retrieved tables were needed
print(details["recall"]) # 1.00 — both needed tables were retrieved
print(details["irrelevant"]) # ["lab_results", "medications"]
print(details["missing"]) # [] — no JOIN table was dropped
Feedback Loop (v2.3)
Thumbs-up feedback stores verified gold SQL — future evaluations of the same question use it automatically.
from sqlas import FeedbackStore, FeedbackEntry
store = FeedbackStore()
# User gives thumbs up → store as gold SQL
store.store(FeedbackEntry(
question = "How many active users?",
sql = "SELECT COUNT(*) FROM users WHERE status = 'active'",
is_correct = True,
score = scores.overall_score,
))
# Next evaluation auto-uses stored gold SQL
c = evaluate_correctness(question, agent_sql, llm_judge, feedback_store=store)
# execution_accuracy is now verified (1.0) instead of unverified (0.5)
print(c.details["gold_sql_source"]) # "feedback_store"
Any Database (v2.1)
from sqlas import build_schema_info, run_suite
# Auto-extract schema from any database
tables, columns = build_schema_info(db_path="my.db") # SQLite
tables, columns = build_schema_info(execute_fn=pg_execute_fn) # PostgreSQL / Snowflake / BigQuery
results = run_suite(
test_cases = test_cases,
agent_fn = my_agent,
llm_judge = llm_judge,
execute_fn = execute_fn,
valid_tables = tables, # 100+ tables — no problem
valid_columns = columns,
)
Run a Test Suite
from sqlas import run_suite, TestCase
test_cases = [
TestCase(question="How many users signed up this month?",
gold_sql="SELECT COUNT(*) FROM users WHERE created_at >= '2026-03-01'",
expected_tables=["users"], category="easy"),
TestCase(question="Average order value by country",
gold_sql="SELECT country, AVG(total) FROM orders GROUP BY country",
expected_tables=["orders"], category="medium"),
]
def my_agent(question: str) -> dict:
sql = generate_sql(question)
return {"sql": sql, "response": narrate(sql), "data": execute(sql)}
results = run_suite(
test_cases = test_cases,
agent_fn = my_agent,
llm_judge = llm_judge,
execute_fn = execute_fn,
pass_threshold = 0.6,
verbose = True,
)
print(results["summary"]["overall_score"])
print(results["summary"]["by_category"])
Weight Profiles
| Profile | Metrics | Best for |
|---|---|---|
WEIGHTS |
15 | Standard NL→SQL pipeline |
WEIGHTS_V2 |
20 | + RAGAS context quality |
WEIGHTS_V3 |
30 | + Guardrails + visualization |
WEIGHTS_V4 |
28 | + Agentic quality — ReAct agents |
RAGAS Mapping
| RAGAS | SQLAS | Notes |
|---|---|---|
| Faithfulness | faithfulness |
Claims grounded in SQL result |
| Answer Relevance | answer_relevance |
Answers the question |
| Answer Correctness | execution_accuracy |
SQL returns correct results |
| Context Precision | context_precision |
Right schema elements used |
| Context Recall | context_recall |
All required schema elements present |
| Noise Sensitivity | noise_robustness |
Irrelevant schema ignored |
| — | schema_retrieval_quality |
Did the index return the right tables? |
| — | result_coverage |
Truncated GROUP BY detection |
| — | agentic_score |
ReAct planning quality |
LLM-Agnostic Judge
# OpenAI
def judge(p): return openai.chat.completions.create(model="gpt-4o",
messages=[{"role":"user","content":p}]).choices[0].message.content
# Anthropic
def judge(p): return anthropic.messages.create(model="claude-opus-4-7",
max_tokens=500, messages=[{"role":"user","content":p}]).content[0].text
# Ollama (local, free)
def judge(p): return requests.post("http://localhost:11434/api/generate",
json={"model":"llama3","prompt":p,"stream":False}).json()["response"]
Changelog
v2.4.0
PromptRegistry— version prompts, compare A/B, detect regressions, get improvement hintsschema_retrieval_quality— precision/recall/F1 for schema index evaluationprompt_id+schema_retrieval_*fields onSQLASScores
v2.3.0
GuardrailPipeline— 3-stage safety:check_input,check_sql,check_outputFeedbackStore+FeedbackEntry— verified gold SQL from user thumbs-upevaluate_correctness/quality/safety— standalone metric evaluators
v2.2.0
- Three-dimension scoring:
correctness_score,quality_score,safety_composite_score verdict— AND logic:PASSonly when all three pass thresholdsCorrectnessResult,QualityResult,SafetyResultdataclasses
v2.1.0
build_schema_info()— auto-extract schema from any DBresult_coverage— truncation-aware GROUP BY penaltyexecution_accuracycapped at 0.5 without gold SQL (was incorrectly 1.0)- 100+ table support with focused schema context
v2.0.0
- Agentic quality:
steps_efficiency,schema_grounding,planning_quality,agentic_score - Cache metrics:
cache_hit_score,tokens_saved_score,few_shot_score WEIGHTS_V4— 28-metric profile with 10% agentic dimensionread_only_complianceupgraded to sqlglot AST
License
MIT — thepradip · pypi.org/project/sqlas
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sqlas-2.5.0.tar.gz.
File metadata
- Download URL: sqlas-2.5.0.tar.gz
- Upload date:
- Size: 67.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dac0886143b637c0dfe9ff182bb5856b2b75f8eac58f7a84b7884f61467a070a
|
|
| MD5 |
c4b91f886346f4493017b6f8fe312b71
|
|
| BLAKE2b-256 |
9ba317c12f0459fd12d62604f5768e5f5a537296d9406048912d6d30f1e03236
|
File details
Details for the file sqlas-2.5.0-py3-none-any.whl.
File metadata
- Download URL: sqlas-2.5.0-py3-none-any.whl
- Upload date:
- Size: 59.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad20543b90c71dd4fa2107f34f0b395fb30655e3cd3010e9b377e02798720bd2
|
|
| MD5 |
6e1e55fccf4c4aeb95f0b3e2cfa263a1
|
|
| BLAKE2b-256 |
2065e122c6e4a1c7fbebe739852b1dc9e6da104ba17d207e1f95fb0ea3a917ea
|