Production-grade agentic trajectory evaluation — score multi-step AI agent runs on goal completion, tool accuracy, step efficiency, reasoning coherence, loop detection, and faithfulness
Project description
trajscore
Production-grade agentic trajectory evaluation for multi-step AI agents.
Score any AI agent run on 6 built-in metrics, detect regressions, stream results, and integrate into CI/CD — with zero vendor lock-in.
pip install trajscore
Why trajscore?
In 2026, every team building agentic AI faces the same problem: you can't improve what you can't measure. Agents fail in subtle ways — they loop, misuse tools, hallucinate answers unsupported by observations, or take twice as many steps as needed. No single library evaluated full multi-step trajectories with structured, auditable metrics.
trajscore fixes this.
Quickstart
from trajscore import (
Trajectory, TrajectoryStep, StepType,
TrajectoryEvaluator,
)
trajectory = Trajectory(
trajectory_id="run-001",
task="What is the capital of France?",
steps=[
TrajectoryStep(step_index=0, step_type=StepType.THOUGHT,
content="I should look this up."),
TrajectoryStep(step_index=1, step_type=StepType.TOOL_CALL,
content="search", tool_name="search",
tool_args={"query": "capital of France"}),
TrajectoryStep(step_index=2, step_type=StepType.OBSERVATION,
content="Paris is the capital of France."),
TrajectoryStep(step_index=3, step_type=StepType.FINAL_ANSWER,
content="The capital of France is Paris."),
],
final_answer="The capital of France is Paris.",
expected_tools=["search"],
)
evaluator = TrajectoryEvaluator()
score = evaluator.evaluate(trajectory)
print(f"Overall: {score.overall_score:.3f} Passed: {score.passed}")
print(score.metric_scores)
Built-in Metrics
| Metric | Description |
|---|---|
goal_completion |
Did the agent produce a relevant final answer? |
tool_accuracy |
Did it use the right tools? (F1 vs expected_tools) |
step_efficiency |
Did it reach the goal without unnecessary steps? |
reasoning_coherence |
Do thoughts lead logically to actions? |
loop_detection |
Did the agent repeat actions or thoughts? |
answer_faithfulness |
Is the final answer grounded in observations? |
Batch & Async Evaluation
from trajscore import TrajectoryEvaluator
evaluator = TrajectoryEvaluator()
# Synchronous batch
result = evaluator.evaluate_batch(trajectories, max_workers=8)
# Async batch
import asyncio
result = asyncio.run(evaluator.aevaluate_batch(trajectories))
print(f"Pass rate: {result.pass_rate:.1%}")
print(f"Mean score: {result.mean_overall:.3f}")
Advanced Features
Caching (LRU + TTL + SHA-256)
from trajscore.advanced import TrajectoryCache
cache = TrajectoryCache(max_size=512, ttl=600)
memoized_eval = cache.memoize(evaluator.evaluate)
score = memoized_eval(trajectory) # cached on second call
print(cache.stats())
Evaluation Pipeline
from trajscore.advanced import EvalPipeline
pipeline = (
EvalPipeline()
.filter("non_empty", lambda t: len(t.steps) > 0)
.map("tag_metadata", lambda t: t)
.with_retry("tag_metadata", retries=2)
)
cleaned = pipeline.run(trajectories)
print(pipeline.audit_log)
# Async
import asyncio
cleaned = asyncio.run(pipeline.arun(trajectories))
Declarative Validation
from trajscore.advanced import TrajectoryValidator, TrajectoryRule
validator = (
TrajectoryValidator()
.add_rule(TrajectoryRule("has_steps", lambda t: len(t.steps) > 0, "Need steps"))
.add_rule(TrajectoryRule("has_task", lambda t: bool(t.task), "Need task"))
)
violations = validator.validate(trajectory)
Rate Limiter (sync + async)
from trajscore.advanced import RateLimiter
limiter = RateLimiter(rate=10, capacity=10) # 10 evals/s
if limiter.acquire():
score = evaluator.evaluate(trajectory)
Budget-Controlled Evaluation
from trajscore.advanced import evaluate_with_budget
scores = evaluate_with_budget(trajectories, evaluator.evaluate, budget_seconds=5.0)
Streaming Results
from trajscore.advanced import stream_scores, scores_to_ndjson
for score in stream_scores(trajectories, evaluator.evaluate):
print(score.trajectory_id, score.overall_score)
# NDJSON stream
for line in scores_to_ndjson(trajectories, evaluator.evaluate):
print(line)
Diff & Regression Tracking
from trajscore.advanced import diff_results, RegressionTracker
tracker = RegressionTracker(window=10)
tracker.record(result_v1)
tracker.record(result_v2)
print(tracker.trend()) # "improving" / "declining" / "stable"
diff = tracker.latest_regression()
print(diff.summary())
print(diff.to_json())
Observability
from trajscore.advanced import EvaluationProfiler, DriftDetector, EvaluationReport
profiler = EvaluationProfiler()
scored = profiler.profile(evaluator.evaluate)(trajectory)
print(profiler.report())
detector = DriftDetector(threshold=0.05)
detector.set_baseline(result_v1)
print(detector.detect(result_v2))
report = EvaluationReport(result)
print(report.to_json())
print(report.to_csv())
print(report.to_markdown())
Audit Log & Cost Ledger
from trajscore.advanced import AuditLog, CostLedger
log = AuditLog()
log.log("eval_start", {"run_id": "ci-42"})
ledger = CostLedger()
ledger.record("t1", tokens=1200, cost_usd=0.024)
print(ledger.summary())
Live Trajectory Watcher
from trajscore import TrajectoryWatcher, TrajectoryStep, StepType
watcher = TrajectoryWatcher(
trajectory_id="live-001",
task="Summarize the paper",
on_step=lambda step, idx: print(f"Step {idx}: {step.step_type}"),
)
watcher.add_step(TrajectoryStep(step_index=0, step_type=StepType.THOUGHT, content="Reading..."))
trajectory = watcher.finish("Summary complete.")
score = evaluator.evaluate(trajectory)
Installation
pip install trajscore
Python 3.8+ · No external dependencies (stdlib + pydantic)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trajscore-1.1.3.tar.gz.
File metadata
- Download URL: trajscore-1.1.3.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6438275390f1f1c6242ee77179b80e4ce9d32c3e1b73665711c6b0f18cbe59b5
|
|
| MD5 |
52244aeffc34bbe76d4018967cf72e19
|
|
| BLAKE2b-256 |
5c228e1f479e4f4cf133fb118415b6d1c37d749faa27ec0b9b074be51f5eec8f
|
File details
Details for the file trajscore-1.1.3-py3-none-any.whl.
File metadata
- Download URL: trajscore-1.1.3-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0c8e9921444e1effc70b6b8f070faeec04d2217160e7dd12b20cbac2f243258
|
|
| MD5 |
0538afc09abe24a0e69b9c5aa68379f5
|
|
| BLAKE2b-256 |
f88056ae4a4655daea795998d0416e44f4c6b1b9521a8024555670e3bed76339
|