A comprehensive evaluation framework for temporal IR, QA, and RAG systems
Project description
⏱️ TempoEval
A Comprehensive Framework for Evaluating Temporal Reasoning in RAG Systems
Features • Installation • Quick Start • Metrics • Examples • Docs • Citation
🎯 Overview
TempoEval is a state-of-the-art evaluation framework designed specifically for assessing temporal reasoning capabilities in Retrieval-Augmented Generation (RAG) systems. Unlike traditional metrics that only measure relevance, TempoEval provides 16 specialized metrics that evaluate how well your RAG system understands, retrieves, and generates temporally accurate content.
🤔 Why TempoEval?
Traditional RAG evaluation metrics fail to capture temporal nuances:
| Scenario | Traditional Metrics | TempoEval |
|---|---|---|
| Query: "What happened in 2020?" → Retrieved doc about 2019 | ✅ High similarity | ❌ Low temporal precision |
| Answer mentions dates not in context | ✅ Fluent text | ❌ Temporal hallucination detected |
| Cross-period query needs docs from multiple eras | ❌ Partial coverage | ✅ Full temporal coverage measured |
✨ Key Features
📊 Three-Layer Evaluation Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Layer 3: REASONING METRICS │
│ └─ Event Ordering • Duration Accuracy • Cross-Period Reasoning │
├─────────────────────────────────────────────────────────────────┤
│ Layer 2: GENERATION METRICS │
│ └─ Faithfulness • Hallucination • Coherence • Alignment │
├─────────────────────────────────────────────────────────────────┤
│ Layer 1: RETRIEVAL METRICS │
│ └─ Precision • Recall • NDCG • Coverage • Diversity • MRR │
└─────────────────────────────────────────────────────────────────┘
🔑 Core Capabilities
| Feature | Description |
|---|---|
| 🎯 Focus Time Extraction | Automatically extract temporal focus from queries and documents |
| 📈 16 Specialized Metrics | Comprehensive temporal evaluation across retrieval, generation, and reasoning |
| 🤖 LLM-as-Judge | Use GPT-4, Claude, or other LLMs for nuanced temporal assessment |
| ⚡ Dual-Mode Evaluation | Rule-based (fast) or LLM-based (accurate) metric computation |
| 📊 TempoScore | Unified composite score combining all temporal dimensions |
| 💰 Cost Tracking | Built-in efficiency monitoring for latency and API costs |
| 📦 TEMPO Benchmark | Integrated support for the TEMPO temporal QA benchmark |
📦 Installation
Via pip (Recommended)
pip install tempoeval
From Source
git clone https://github.com/DataScienceUIBK/tempoeval.git
cd tempoeval
pip install -e .
Optional Dependencies
# For LLM-based evaluation (recommended)
pip install openai anthropic
# For BM25 retrieval in examples
pip install gensim pyserini
# For TEMPO benchmark loading
pip install datasets huggingface_hub pyarrow
🚀 Quick Start
Basic Retrieval Evaluation (No LLM Required)
from tempoeval.metrics import TemporalRecall, TemporalNDCG, TemporalPrecision
from tempoeval.core import FocusTime
# Your retrieval results
retrieved_ids = ["doc_2020", "doc_2019", "doc_2021"]
gold_ids = ["doc_2020", "doc_2021"]
# Compute metrics
recall = TemporalRecall().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)
ndcg = TemporalNDCG().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)
print(f"Temporal Recall@5: {recall:.3f}")
print(f"Temporal NDCG@5: {ndcg:.3f}")
Focus Time-Based Evaluation
from tempoeval.core import FocusTime, extract_qft, extract_dft
from tempoeval.metrics import TemporalPrecision
# Extract Focus Time from query
query = "What happened to Bitcoin in 2017?"
qft = extract_qft(query) # FocusTime(years={2017})
# Extract Focus Time from documents
documents = [
"Bitcoin reached $20,000 in December 2017.",
"Ethereum launched in 2015.",
"The SegWit upgrade activated in August 2017.",
]
dfts = [extract_dft(doc) for doc in documents]
# Evaluate temporal precision
precision = TemporalPrecision(use_focus_time=True)
score = precision.compute(qft=qft, dfts=dfts, k=3)
print(f"Temporal Precision@3: {score:.3f}")
LLM-Based Generation Evaluation
import os
from tempoeval.llm import AzureOpenAIProvider
from tempoeval.metrics import (
TemporalFaithfulness,
TemporalHallucination,
TemporalCoherence,
TempoScore
)
# Configure LLM
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
os.environ["AZURE_DEPLOYMENT_NAME"] = "gpt-4o"
llm = AzureOpenAIProvider()
# Your RAG output
query = "When was Bitcoin pruning introduced?"
contexts = ["Bitcoin Core 0.11.0 was released on July 12, 2015 with pruning support."]
answer = "Bitcoin pruning was introduced in version 0.11.0, released on July 12, 2015."
# Evaluate generation quality
faithfulness = TemporalFaithfulness(llm=llm)
hallucination = TemporalHallucination(llm=llm)
coherence = TemporalCoherence(llm=llm)
print(f"Faithfulness: {faithfulness.compute(answer=answer, contexts=contexts):.3f}")
print(f"Hallucination: {hallucination.compute(answer=answer, contexts=contexts):.3f}")
print(f"Coherence: {coherence.compute(answer=answer):.3f}")
# Compute unified TempoScore
tempo_scorer = TempoScore()
result = tempo_scorer.compute(
temporal_precision=0.9,
temporal_recall=0.85,
temporal_faithfulness=1.0,
temporal_coherence=1.0
)
print(f"\n🎯 TempoScore: {result['tempo_weighted']:.3f}")
📊 Metrics
Layer 1: Retrieval Metrics
| Metric | Description | LLM Required |
|---|---|---|
TemporalPrecision |
% of retrieved docs matching query's temporal focus | Optional |
TemporalRecall |
% of relevant temporal docs retrieved | Optional |
TemporalNDCG |
Ranking quality with temporal relevance grading | No |
TemporalMRR |
Reciprocal rank of first temporally relevant doc | No |
TemporalCoverage |
Coverage of required time periods (cross-period) | Yes |
TemporalDiversity |
Variety of time periods in retrieved docs | Optional |
AnchorCoverage |
Coverage of key temporal anchors | Optional |
Layer 2: Generation Metrics
| Metric | Description | LLM Required |
|---|---|---|
TemporalFaithfulness |
Are temporal claims supported by context? | Yes |
TemporalHallucination |
% of fabricated temporal information | Yes |
TemporalCoherence |
Internal consistency of temporal statements | Yes |
AnswerTemporalAlignment |
Does answer focus on the right time period? | Yes |
Layer 3: Reasoning Metrics
| Metric | Description | LLM Required |
|---|---|---|
EventOrdering |
Correctness of event sequence | Yes |
DurationAccuracy |
Accuracy of duration/interval claims | Yes |
CrossPeriodReasoning |
Quality of comparison across time periods | Yes |
Composite Metrics
| Metric | Description |
|---|---|
TempoScore |
Unified score combining all temporal dimensions |
📁 Project Structure
tempoeval/
├── 📦 core/ # Core components
│ ├── focus_time.py # Focus Time extraction
│ ├── evaluator.py # Main evaluation orchestrator
│ ├── config.py # Configuration management
│ └── result.py # Result containers
├── 📊 metrics/ # All 16 metrics
│ ├── retrieval/ # Layer 1 metrics
│ ├── generation/ # Layer 2 metrics
│ ├── reasoning/ # Layer 3 metrics
│ └── composite/ # TempoScore
├── 🤖 llm/ # LLM provider integrations
│ ├── openai_provider.py
│ ├── azure_provider.py
│ └── anthropic_provider.py
├── 📈 datasets/ # Dataset loaders
│ ├── tempo.py # TEMPO benchmark
│ └── timebench.py # TimeBench
├── 🔧 guidance/ # Temporal guidance generation
├── ⚡ efficiency/ # Cost & latency tracking
└── 🛠️ utils/ # Utility functions
📚 Examples
We provide comprehensive examples in the examples/ directory:
| Example | Description | LLM Required |
|---|---|---|
01_retrieval_bm25.py |
Basic retrieval evaluation | ❌ |
02_rag_generation.py |
RAG generation evaluation | ✅ |
03_full_pipeline.py |
Complete RAG pipeline | ✅ |
04_tempo_dataset.py |
Using TEMPO benchmark | ❌ |
05_cross_period.py |
Cross-period queries | ✅ |
06_tempo_hsm_complete.py |
Full HSM evaluation | ✅ |
07_generate_guidance.py |
Generate temporal guidance | ✅ |
08_pipeline_with_generated_guidance.py |
End-to-end pipeline | ✅ |
Running Examples
cd examples
# Copy and configure credentials (for LLM examples)
cp .env.example .env
# Edit .env with your API keys
# Run examples
python 01_retrieval_bm25.py # No LLM needed
python 02_rag_generation.py # Requires .env
🔧 Configuration
Environment Variables (for LLM-based evaluation)
Create a .env file or set environment variables:
# Azure OpenAI (Recommended)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_DEPLOYMENT_NAME=gpt-4o
AZURE_OPENAI_API_VERSION=2024-05-01-preview
# Or OpenAI
OPENAI_API_KEY=your-openai-key
# Or Anthropic
ANTHROPIC_API_KEY=your-anthropic-key
Programmatic Configuration
from tempoeval.core import TempoEvalConfig
config = TempoEvalConfig(
k_values=[5, 10, 20], # Evaluation depths
use_focus_time=True, # Enable Focus Time extraction
llm_provider="azure", # LLM provider
parallel_requests=10, # Concurrent LLM calls
)
📈 TEMPO Benchmark
TempoEval includes built-in support for the TEMPO benchmark - a comprehensive temporal QA dataset:
from tempoeval.datasets import load_tempo, load_tempo_documents
# Load queries with temporal annotations
queries = load_tempo(domain="bitcoin", max_samples=100)
# Load corpus documents
documents = load_tempo_documents(domain="bitcoin")
# Available domains: bitcoin, cardano, economics, hsm (History of Science & Medicine)
⚡ Efficiency Tracking
Track latency and costs for LLM-based evaluation:
from tempoeval.efficiency import EfficiencyTracker
tracker = EfficiencyTracker(model_name="gpt-4o")
# ... run your evaluation ...
# Get summary
summary = tracker.summary()
print(f"Total Cost: ${summary['total_cost_usd']:.4f}")
print(f"Avg Latency: {summary['avg_latency_ms']:.1f}ms")
🧪 Testing
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_core.py -v
# Run with coverage
pytest tests/ --cov=tempoeval --cov-report=html
📖 Documentation
Full documentation is available at: https://tempoeval.readthedocs.io/en/latest/
📄 Citation
If you use TempoEval in your research, please cite our paper:
soon
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built on top of the TEMPO Benchmark
- LLM integrations via OpenAI, Azure OpenAI, and Anthropic
Made with ❤️ for the Temporal IR Community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tempoeval-0.1.0.tar.gz.
File metadata
- Download URL: tempoeval-0.1.0.tar.gz
- Upload date:
- Size: 95.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d032577b46f6ba6f84f33d5191b847be0732b03a9a5bc9853c06133e481c46e
|
|
| MD5 |
063cfdfa8dc76d55af23e7aa7c9e836f
|
|
| BLAKE2b-256 |
aa238f62f87403e4bfc57da5c758237aef6ecd0009317ed45b2c068bdc57dc9c
|
Provenance
The following attestation bundles were made for tempoeval-0.1.0.tar.gz:
Publisher:
publish.yml on DataScienceUIBK/tempoeval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tempoeval-0.1.0.tar.gz -
Subject digest:
5d032577b46f6ba6f84f33d5191b847be0732b03a9a5bc9853c06133e481c46e - Sigstore transparency entry: 915855347
- Sigstore integration time:
-
Permalink:
DataScienceUIBK/tempoeval@5562e92fdd150c3c1f1211d1e00a63a3eef85902 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DataScienceUIBK
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5562e92fdd150c3c1f1211d1e00a63a3eef85902 -
Trigger Event:
release
-
Statement type:
File details
Details for the file tempoeval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tempoeval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 119.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36a542f8f3b8df4dfa1f0c0912ba7a50ad39206533f7f6c26cb900cda4cf1657
|
|
| MD5 |
28089ecf6a004ad415c68ed4b62446fb
|
|
| BLAKE2b-256 |
0342ad08df8f59090c666ee95ef88b32dda8ab43d3f002ebb0e7bb31261e7b8a
|
Provenance
The following attestation bundles were made for tempoeval-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on DataScienceUIBK/tempoeval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tempoeval-0.1.0-py3-none-any.whl -
Subject digest:
36a542f8f3b8df4dfa1f0c0912ba7a50ad39206533f7f6c26cb900cda4cf1657 - Sigstore transparency entry: 915855651
- Sigstore integration time:
-
Permalink:
DataScienceUIBK/tempoeval@5562e92fdd150c3c1f1211d1e00a63a3eef85902 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DataScienceUIBK
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5562e92fdd150c3c1f1211d1e00a63a3eef85902 -
Trigger Event:
release
-
Statement type: