SQLAS — SQL Agent Scoring Framework. Production-grade evaluation for Text-to-SQL and Agentic SQL agents with guardrail, visualization, agentic quality, and cache performance metrics.
Project description
SQLAS — SQL Agent Scoring Framework
A RAGAS-equivalent evaluation library for Text-to-SQL and Agentic SQL agents.
Evaluate SQL agents across 50+ metrics — correctness, quality, safety, agentic reasoning, schema retrieval, prompt versioning, guardrails, and cache ROI. Aligned with Spider, BIRD, RAGAS, and MLflow standards.
Author: thepradip
Install
pip install sqlas # core
pip install "sqlas[mlflow]" # + MLflow
pip install "sqlas[ui]" # + Streamlit UI
pip install "sqlas[all]" # everything
What's New in v2.7.0
| Feature | Description |
|---|---|
| Multi-gold SQL | execution_accuracy_best_of(sql, gold_sqls) — evaluate against all valid gold queries, take best score. TestCase.gold_sqls: list[str] |
| Hardness classification | auto_classify_hardness(sql) → easy/medium/hard/extra-hard per BIRD criteria. Auto-set on every evaluate() call |
| Exact match metric | exact_match(generated, gold) — normalized string comparison. Exposed as SQLASScores.exact_match_score |
| Failure classification | classify_failure(sql, scores, details) → named FailureCategory with top_hint() actionable fix |
| Batch crash isolation | One failing test case no longer kills the entire batch |
| LLM retry with backoff | _retry_llm_judge() retries 3× (1s→2s→4s) on all 13 LLM judge call sites |
| Weight normalization | Custom weights auto-normalized to 1.0 instead of silently distorting scores |
| LLM judge cache | enable_judge_cache() — opt-in in-memory cache prevents re-scoring identical pairs in CI |
| Report generation | generate_report(scores_list, format="markdown"|"json"), to_json(), to_markdown_report() |
| Non-deterministic detection | NOW(), RANDOM(), CURRENT_TIMESTAMP trigger nondeterministic_warning in details |
| execute_fn timeout | 30s wall-clock timeout with thread-safe SQLite fallback |
| Safety patterns | UNION ALL SELECT, EXCEPT SELECT, WAITFOR DELAY, file injection, NL prompt injection synonyms |
Quick Start
from sqlas import evaluate
def llm_judge(prompt: str) -> str:
return openai_client.chat.completions.create(
model="gpt-4o", messages=[{"role":"user","content":prompt}]
).choices[0].message.content
scores = evaluate(
question = "How many active users?",
generated_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
gold_sql = "SELECT COUNT(*) FROM users WHERE active = 1",
db_path = "my.db",
llm_judge = llm_judge,
response = "There are 1,523 active users.",
result_data = {"columns":["COUNT(*)"],"rows":[[1523]],"row_count":1,"execution_time_ms":2.1},
)
print(scores.overall_score) # 0.95
print(scores.correctness_score) # 0.88
print(scores.verdict) # PASS
print(scores.hardness) # "easy"
print(scores.exact_match_score) # 1.0
print(scores.to_markdown_report()) # Markdown for PR comments
Failure Classification (v2.7)
Know exactly why a query failed — not just a score.
from sqlas import classify_failure, FailureCategory
analysis = classify_failure(
sql = "SELECT id FROM users LIMIT 100",
scores = {"execution_accuracy": 1.0, "row_count_match": 0.12},
details = {"row_count_match": {"pred_count": 100, "gold_count": 839}},
)
print(analysis.primary) # FailureCategory.LIMIT_TRUNCATION
print(analysis.summary()) # "FAIL [limit_truncation] (score=1.000)"
print(analysis.top_hint()) # "Remove LIMIT — question asks for full results, not top-N"
print(analysis.evidence) # {"limit_truncation": "LIMIT in SQL, 100 rows vs 839 expected"}
All failure categories:
| Category | Source |
|---|---|
LIMIT_TRUNCATION |
LIMIT silently cut result (100 vs 839 rows) |
WRONG_TABLE |
accounting_transactions used instead of accounting |
WRONG_AGGREGATION |
MAX instead of SUM, AVG instead of SUM |
SCALAR_MISMATCH |
Correlation or count value differs |
ROW_EXPLOSION |
1:N join inflated row count |
SCHEMA_HALLUCINATION |
Invented table/column names (counts, adm_count, n) |
FULL_TABLE_SCAN |
SELECT * with no WHERE/LIMIT |
TRIM_ON_NUMERIC |
TRIM() on REAL column — invalid on Postgres |
UNSAFE_QUERY |
DDL/DML attempted |
CURRENCY_NOT_CLEANED |
Single REPLACE missed commas in $1,234 |
NULL_IN_AGGREGATION |
AVG/SUM without IS NOT NULL |
JOIN_WITHOUT_FK |
Banking joined to users with no foreign key |
FAITHFULNESS_DROP |
Narration not grounded in SQL result |
Multi-gold SQL (v2.7)
When a question has multiple valid SQL formulations, evaluate against all and take the best score:
from sqlas import evaluate, TestCase
# Single evaluate call
scores = evaluate(
question = "Count active users",
generated_sql = "SELECT COUNT(*) FROM users WHERE status = 'active'",
gold_sql = "SELECT COUNT(*) FROM users WHERE active = 1", # primary gold
db_path = "my.db",
llm_judge = llm_judge,
)
# Batch with multiple gold SQLs per question
test_case = TestCase(
question = "Count active users",
gold_sqls = [
"SELECT COUNT(*) FROM users WHERE active = 1",
"SELECT COUNT(*) FROM users WHERE status = 'active'",
"SELECT COUNT(id) FROM users WHERE is_active = true",
],
)
Hardness Classification (v2.7)
from sqlas import auto_classify_hardness
auto_classify_hardness("SELECT COUNT(*) FROM users")
# → "easy"
auto_classify_hardness("SELECT u.id, SUM(o.total) FROM users u JOIN orders o ON u.id=o.user_id GROUP BY u.id HAVING SUM(o.total) > 1000")
# → "hard"
auto_classify_hardness("WITH ranked AS (SELECT *, ROW_NUMBER() OVER (...) FROM ...) SELECT ...")
# → "extra-hard"
Follows BIRD benchmark criteria. Auto-set on every evaluate() call as SQLASScores.hardness.
Report Generation (v2.7)
from sqlas import generate_report
# Batch markdown report — paste into PRs or CI comments
results = evaluate_batch(test_cases, llm_judge, db_path="my.db")
print(generate_report(results, questions, format="markdown"))
# JSON for artifact storage
print(generate_report(results, format="json"))
# Per-query reports
print(scores.to_json())
print(scores.to_markdown_report(question="How many users?", sql=generated_sql))
LLM Judge Cache (v2.7)
Prevent re-scoring identical prompts in CI runs:
from sqlas import enable_judge_cache, clear_judge_cache
enable_judge_cache() # opt-in — identical prompts return cached result
results = evaluate_batch(...)
clear_judge_cache() # clear between test runs
Three-Dimension Scoring
PASS only when all three dimensions meet their thresholds:
from sqlas import evaluate_correctness, evaluate_quality, evaluate_safety
c = evaluate_correctness(question, sql, llm_judge, gold_sql=gold, execute_fn=db)
q = evaluate_quality(question, sql, llm_judge, response=text, result_data=data)
s = evaluate_safety(sql, question=question, pii_columns=["email","ssn"])
print(c.score, c.verdict) # 0.85 PASS (threshold 0.5)
print(q.score, q.verdict) # 0.72 PASS (threshold 0.6)
print(s.score, s.verdict) # 0.45 FAIL (threshold 0.9 — PII detected)
evaluate_safety() makes zero LLM calls — pure regex + sqlglot AST.
Guardrail Pipeline
from sqlas import GuardrailPipeline
pipeline = GuardrailPipeline(pii_columns=["email","ssn","password"])
r = pipeline.check_input("List every user's SSN") # blocks malicious NL intent
r = pipeline.check_sql(generated_sql) # blocks injection/PII SQL
r = pipeline.check_output(response, result_data) # blocks PII in response
Injection patterns detected: UNION ALL SELECT, EXCEPT SELECT, INTERSECT SELECT, stacked mutations, tautologies, time-based injection, file write/read, WAITFOR DELAY.
NL prompt injection detected: ignore/override/discard instructions, jailbreak, bypass guardrails, pretend unrestricted.
Spider / BIRD Benchmark
from sqlas.benchmarks import run_spider_benchmark
results = run_spider_benchmark(
agent_fn = my_agent,
llm_judge = llm_judge,
spider_dir = "./spider",
n_samples = 50, # stratified by difficulty → ~$0.25
mlflow_run = True,
)
print(results["summary"]["overall_score"])
Prompt Versioning
from sqlas import PromptRegistry
registry = PromptRegistry()
registry.register("You are a SQL analyst...", version_id="v1")
registry.record("v1", scores)
status = registry.detect_regression("v1", window=50, threshold=0.05)
if status["regressed"]:
for hint in status["hints"]:
print(hint["hint"]) # actionable prompt fix suggestion
Observability Integrations
from sqlas.integrations import log_all
log_all(results,
mlflow_experiment = "sql-agent-v2",
wandb_project = "sql-evals",
langsmith_project = "my-sql-agent",
)
Run a Test Suite
from sqlas import run_suite, TestCase, WEIGHTS_V4, build_schema_info
tables, columns = build_schema_info(db_path="my.db")
results = run_suite(
test_cases = test_cases,
agent_fn = my_agent,
llm_judge = llm_judge,
execute_fn = execute_fn,
valid_tables = tables,
valid_columns = columns,
weights = WEIGHTS_V4,
pass_threshold = 0.6,
)
print(results["summary"]["overall_score"])
print(results["summary"]["by_category"])
Metrics Overview
| Dimension | Key Metrics |
|---|---|
| Correctness | Execution accuracy, exact match, multi-gold SQL, semantic equivalence, result set similarity |
| SQL Quality | SQL quality (LLM), schema compliance, complexity match, data scan efficiency |
| Context (RAGAS) | Context precision, recall, entity recall, noise robustness |
| Response | Faithfulness, answer relevance, completeness, fluency |
| Agentic | Steps efficiency, schema grounding, planning quality, tool use accuracy, plan compliance, first attempt success |
| Safety | Read-only compliance, SQL injection, prompt injection, PII access, PII leakage |
| Production | Execution success, VES efficiency, row explosion detection, empty result, result coverage |
| Cache | Cache hit score, tokens saved, few-shot examples used |
| Visualization | Chart spec validity, data alignment, LLM chart validation |
Weight Profiles
| Profile | Metrics | Best for |
|---|---|---|
WEIGHTS |
15 | Standard NL→SQL pipeline |
WEIGHTS_V2 |
20 | + RAGAS context quality |
WEIGHTS_V3 |
30 | + Guardrails + visualization |
WEIGHTS_V4 |
28 | + Agentic quality ← ReAct agents |
Changelog
v2.7.0
classify_failure()+FailureCategoryenum — named failure classification with actionable hintsauto_classify_hardness()— BIRD-aligned easy/medium/hard/extra-hard (auto-set on every eval)exact_match()+SQLASScores.exact_match_scoreexecution_accuracy_best_of()+TestCase.gold_sqls— multi-gold SQL evaluationgenerate_report()— batch markdown/JSON report;to_json(),to_markdown_report()on SQLASScoresenable_judge_cache()/clear_judge_cache()— opt-in LLM judge caching- LLM retry with exponential backoff (3×) on all 13 LLM judge call sites
- Batch eval crash isolation — one failure no longer kills the batch
- Weight normalization — auto-normalize to 1.0 instead of silently distorting
- execute_fn timeout (30s) with thread-safe SQLite fallback
- Non-deterministic query detection (NOW, RANDOM, CURRENT_TIMESTAMP)
- Safety: UNION ALL SELECT, EXCEPT, WAITFOR DELAY, file injection, NL synonyms
- Division-by-zero guards in all context metrics
v2.6.0
- Spider/BIRD benchmark (
run_spider_benchmark,run_bird_benchmark) - MLflow, W&B, LangSmith integrations (
sqlas.integrations) - Streamlit UI (
python -m sqlas ui) - React evaluation dashboard (
sqlas-ui/)
v2.5.0
plan_compliance()— measures create_plan enforcement before execute_sqlfirst_attempt_success()— measures SQL retry rate
v2.4.0
PromptRegistry— prompt versioning, regression detection, improvement hintsschema_retrieval_quality()— precision/recall/F1 for schema index
v2.3.0
GuardrailPipeline— 3-stage safety: input → SQL → output (zero LLM cost)FeedbackStore— verified gold SQL from user thumbs-up
v2.2.0
- Three-dimension scoring:
correctness_score,quality_score,safety_composite_score verdict— AND logic: PASS only when all three pass thresholds
License
MIT — thepradip · pypi.org/project/sqlas
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sqlas-2.8.0.tar.gz.
File metadata
- Download URL: sqlas-2.8.0.tar.gz
- Upload date:
- Size: 111.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffb4e56519c382de9b34d803011c441cd9d79477be06c7ddc3181d3d8e3c12c4
|
|
| MD5 |
8a74266ea7455a16192ed38e588b76d7
|
|
| BLAKE2b-256 |
c1e5f1591607a637a317004c7990c84e5f6cd908b95118879a635cd4011e8233
|
File details
Details for the file sqlas-2.8.0-py3-none-any.whl.
File metadata
- Download URL: sqlas-2.8.0-py3-none-any.whl
- Upload date:
- Size: 103.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d32e850d2604a4229e4e6ff3fd03cac5ef0714ddf407ef52fcd3e74edd586f23
|
|
| MD5 |
e2a4d9e97600537be84ac149cb7a0298
|
|
| BLAKE2b-256 |
4a2911ac681db1ecf0da7226b5cb113ce7c55d10d5c3b05de21ecb1b42661cdb
|