A comprehensive evaluation framework for temporal IR, QA, and RAG systems

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abdoelsayed2016

These details have not been verified by PyPI

Project links

Documentation

Project description

TempoEval

⏱️ TempoEval

A Comprehensive Framework for Evaluating Temporal Reasoning in RAG Systems

Features • Installation • Quick Start • Metrics • Examples • Docs • Citation

🎯 Overview

TempoEval is a state-of-the-art evaluation framework designed specifically for assessing temporal reasoning capabilities in Retrieval-Augmented Generation (RAG) systems. Unlike traditional metrics that only measure relevance, TempoEval provides 16 specialized metrics that evaluate how well your RAG system understands, retrieves, and generates temporally accurate content.

16 Metrics 3 Layers Focus Time

🤔 Why TempoEval?

Traditional RAG evaluation metrics fail to capture temporal nuances:

Scenario	Traditional Metrics	TempoEval
Query: "What happened in 2020?" → Retrieved doc about 2019	✅ High similarity	❌ Low temporal precision
Answer mentions dates not in context	✅ Fluent text	❌ Temporal hallucination detected
Cross-period query needs docs from multiple eras	❌ Partial coverage	✅ Full temporal coverage measured

✨ Key Features

📊 Three-Layer Evaluation Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Layer 3: REASONING METRICS                                     │
│  └─ Event Ordering • Duration Accuracy • Cross-Period Reasoning │
├─────────────────────────────────────────────────────────────────┤
│  Layer 2: GENERATION METRICS                                    │
│  └─ Faithfulness • Hallucination • Coherence • Alignment        │
├─────────────────────────────────────────────────────────────────┤
│  Layer 1: RETRIEVAL METRICS                                     │
│  └─ Precision • Recall • NDCG • Coverage • Diversity • MRR      │
└─────────────────────────────────────────────────────────────────┘

🔑 Core Capabilities

Feature	Description
🎯 Focus Time Extraction	Automatically extract temporal focus from queries and documents
📈 16 Specialized Metrics	Comprehensive temporal evaluation across retrieval, generation, and reasoning
🤖 LLM-as-Judge	Use GPT-4, Claude, or other LLMs for nuanced temporal assessment
⚡ Dual-Mode Evaluation	Rule-based (fast) or LLM-based (accurate) metric computation
📊 TempoScore	Unified composite score combining all temporal dimensions
💰 Cost Tracking	Built-in efficiency monitoring for latency and API costs
📦 TEMPO Benchmark	Integrated support for the TEMPO temporal QA benchmark

📦 Installation

Via pip (Recommended)

pip install tempoeval

From Source

git clone https://github.com/DataScienceUIBK/tempoeval.git
cd tempoeval
pip install -e .

Optional Dependencies

# For LLM-based evaluation (recommended)
pip install openai anthropic

# For BM25 retrieval in examples
pip install gensim pyserini

# For TEMPO benchmark loading
pip install datasets huggingface_hub pyarrow

🚀 Quick Start

Basic Retrieval Evaluation (No LLM Required)

from tempoeval.metrics import TemporalRecall, TemporalNDCG, TemporalPrecision
from tempoeval.core import FocusTime

# Your retrieval results
retrieved_ids = ["doc_2020", "doc_2019", "doc_2021"]
gold_ids = ["doc_2020", "doc_2021"]

# Compute metrics
recall = TemporalRecall().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)
ndcg = TemporalNDCG().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)

print(f"Temporal Recall@5: {recall:.3f}")
print(f"Temporal NDCG@5: {ndcg:.3f}")

Focus Time-Based Evaluation

from tempoeval.core import FocusTime, extract_qft, extract_dft
from tempoeval.metrics import TemporalPrecision

# Extract Focus Time from query
query = "What happened to Bitcoin in 2017?"
qft = extract_qft(query)  # FocusTime(years={2017})

# Extract Focus Time from documents
documents = [
    "Bitcoin reached $20,000 in December 2017.",
    "Ethereum launched in 2015.",
    "The SegWit upgrade activated in August 2017.",
]
dfts = [extract_dft(doc) for doc in documents]

# Evaluate temporal precision
precision = TemporalPrecision(use_focus_time=True)
score = precision.compute(qft=qft, dfts=dfts, k=3)
print(f"Temporal Precision@3: {score:.3f}")

LLM-Based Generation Evaluation

import os
from tempoeval.llm import AzureOpenAIProvider
from tempoeval.metrics import (
    TemporalFaithfulness,
    TemporalHallucination,
    TemporalCoherence,
    TempoScore
)

# Configure LLM
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
os.environ["AZURE_DEPLOYMENT_NAME"] = "gpt-4o"

llm = AzureOpenAIProvider()

# Your RAG output
query = "When was Bitcoin pruning introduced?"
contexts = ["Bitcoin Core 0.11.0 was released on July 12, 2015 with pruning support."]
answer = "Bitcoin pruning was introduced in version 0.11.0, released on July 12, 2015."

# Evaluate generation quality
faithfulness = TemporalFaithfulness(llm=llm)
hallucination = TemporalHallucination(llm=llm)
coherence = TemporalCoherence(llm=llm)

print(f"Faithfulness: {faithfulness.compute(answer=answer, contexts=contexts):.3f}")
print(f"Hallucination: {hallucination.compute(answer=answer, contexts=contexts):.3f}")
print(f"Coherence: {coherence.compute(answer=answer):.3f}")

# Compute unified TempoScore
tempo_scorer = TempoScore()
result = tempo_scorer.compute(
    temporal_precision=0.9,
    temporal_recall=0.85,
    temporal_faithfulness=1.0,
    temporal_coherence=1.0
)
print(f"\n🎯 TempoScore: {result['tempo_weighted']:.3f}")

📊 Metrics

Layer 1: Retrieval Metrics

Metric	Description	LLM Required
`TemporalPrecision`	% of retrieved docs matching query's temporal focus	Optional
`TemporalRecall`	% of relevant temporal docs retrieved	Optional
`TemporalNDCG`	Ranking quality with temporal relevance grading	No
`TemporalMRR`	Reciprocal rank of first temporally relevant doc	No
`TemporalCoverage`	Coverage of required time periods (cross-period)	Yes
`TemporalDiversity`	Variety of time periods in retrieved docs	Optional
`AnchorCoverage`	Coverage of key temporal anchors	Optional

Layer 2: Generation Metrics

Metric	Description	LLM Required
`TemporalFaithfulness`	Are temporal claims supported by context?	Yes
`TemporalHallucination`	% of fabricated temporal information	Yes
`TemporalCoherence`	Internal consistency of temporal statements	Yes
`AnswerTemporalAlignment`	Does answer focus on the right time period?	Yes

Layer 3: Reasoning Metrics

Metric	Description	LLM Required
`EventOrdering`	Correctness of event sequence	Yes
`DurationAccuracy`	Accuracy of duration/interval claims	Yes
`CrossPeriodReasoning`	Quality of comparison across time periods	Yes

Composite Metrics

Metric	Description
`TempoScore`	Unified score combining all temporal dimensions

📁 Project Structure

tempoeval/
├── 📦 core/                    # Core components
│   ├── focus_time.py          # Focus Time extraction
│   ├── evaluator.py           # Main evaluation orchestrator
│   ├── config.py              # Configuration management
│   └── result.py              # Result containers
├── 📊 metrics/                 # All 16 metrics
│   ├── retrieval/             # Layer 1 metrics
│   ├── generation/            # Layer 2 metrics
│   ├── reasoning/             # Layer 3 metrics
│   └── composite/             # TempoScore
├── 🤖 llm/                     # LLM provider integrations
│   ├── openai_provider.py
│   ├── azure_provider.py
│   └── anthropic_provider.py
├── 📈 datasets/                # Dataset loaders
│   ├── tempo.py               # TEMPO benchmark
│   └── timebench.py           # TimeBench
├── 🔧 guidance/                # Temporal guidance generation
├── ⚡ efficiency/              # Cost & latency tracking
└── 🛠️ utils/                   # Utility functions

📚 Examples

We provide comprehensive examples in the examples/ directory:

Example	Description	LLM Required
`01_retrieval_bm25.py`	Basic retrieval evaluation	❌
`02_rag_generation.py`	RAG generation evaluation	✅
`03_full_pipeline.py`	Complete RAG pipeline	✅
`04_tempo_dataset.py`	Using TEMPO benchmark	❌
`05_cross_period.py`	Cross-period queries	✅
`06_tempo_hsm_complete.py`	Full HSM evaluation	✅
`07_generate_guidance.py`	Generate temporal guidance	✅
`08_pipeline_with_generated_guidance.py`	End-to-end pipeline	✅

Running Examples

cd examples

# Copy and configure credentials (for LLM examples)
cp .env.example .env
# Edit .env with your API keys

# Run examples
python 01_retrieval_bm25.py      # No LLM needed
python 02_rag_generation.py      # Requires .env

🔧 Configuration

Environment Variables (for LLM-based evaluation)

Create a .env file or set environment variables:

# Azure OpenAI (Recommended)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_DEPLOYMENT_NAME=gpt-4o
AZURE_OPENAI_API_VERSION=2024-05-01-preview

# Or OpenAI
OPENAI_API_KEY=your-openai-key

# Or Anthropic
ANTHROPIC_API_KEY=your-anthropic-key

Programmatic Configuration

from tempoeval.core import TempoEvalConfig

config = TempoEvalConfig(
    k_values=[5, 10, 20],           # Evaluation depths
    use_focus_time=True,            # Enable Focus Time extraction
    llm_provider="azure",           # LLM provider
    parallel_requests=10,           # Concurrent LLM calls
)

📈 TEMPO Benchmark

TempoEval includes built-in support for the TEMPO benchmark - a comprehensive temporal QA dataset:

from tempoeval.datasets import load_tempo, load_tempo_documents

# Load queries with temporal annotations
queries = load_tempo(domain="bitcoin", max_samples=100)

# Load corpus documents
documents = load_tempo_documents(domain="bitcoin")

# Available domains: bitcoin, cardano, economics, hsm (History of Science & Medicine)

⚡ Efficiency Tracking

Track latency and costs for LLM-based evaluation:

from tempoeval.efficiency import EfficiencyTracker

tracker = EfficiencyTracker(model_name="gpt-4o")

# ... run your evaluation ...

# Get summary
summary = tracker.summary()
print(f"Total Cost: ${summary['total_cost_usd']:.4f}")
print(f"Avg Latency: {summary['avg_latency_ms']:.1f}ms")

🧪 Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_core.py -v

# Run with coverage
pytest tests/ --cov=tempoeval --cov-report=html

📖 Documentation

Full documentation is available at: https://tempoeval.readthedocs.io/en/latest/

📄 Citation

If you use TempoEval in your research, please cite our paper:

soon

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of the TEMPO Benchmark
LLM integrations via OpenAI, Azure OpenAI, and Anthropic

Made with ❤️ for the Temporal IR Community

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abdoelsayed2016

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.1.0

Feb 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tempoeval-0.1.0.tar.gz (95.0 kB view details)

Uploaded Feb 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tempoeval-0.1.0-py3-none-any.whl (119.8 kB view details)

Uploaded Feb 5, 2026 Python 3

File details

Details for the file tempoeval-0.1.0.tar.gz.

File metadata

Download URL: tempoeval-0.1.0.tar.gz
Upload date: Feb 5, 2026
Size: 95.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tempoeval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5d032577b46f6ba6f84f33d5191b847be0732b03a9a5bc9853c06133e481c46e`
MD5	`063cfdfa8dc76d55af23e7aa7c9e836f`
BLAKE2b-256	`aa238f62f87403e4bfc57da5c758237aef6ecd0009317ed45b2c068bdc57dc9c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tempoeval-0.1.0.tar.gz:

Publisher: publish.yml on DataScienceUIBK/tempoeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tempoeval-0.1.0.tar.gz
- Subject digest: 5d032577b46f6ba6f84f33d5191b847be0732b03a9a5bc9853c06133e481c46e
- Sigstore transparency entry: 915855347
- Sigstore integration time: Feb 5, 2026
Source repository:
- Permalink: DataScienceUIBK/tempoeval@5562e92fdd150c3c1f1211d1e00a63a3eef85902
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/DataScienceUIBK
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5562e92fdd150c3c1f1211d1e00a63a3eef85902
- Trigger Event: release

File details

Details for the file tempoeval-0.1.0-py3-none-any.whl.

File metadata

Download URL: tempoeval-0.1.0-py3-none-any.whl
Upload date: Feb 5, 2026
Size: 119.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tempoeval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36a542f8f3b8df4dfa1f0c0912ba7a50ad39206533f7f6c26cb900cda4cf1657`
MD5	`28089ecf6a004ad415c68ed4b62446fb`
BLAKE2b-256	`0342ad08df8f59090c666ee95ef88b32dda8ab43d3f002ebb0e7bb31261e7b8a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tempoeval-0.1.0-py3-none-any.whl:

Publisher: publish.yml on DataScienceUIBK/tempoeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tempoeval-0.1.0-py3-none-any.whl
- Subject digest: 36a542f8f3b8df4dfa1f0c0912ba7a50ad39206533f7f6c26cb900cda4cf1657
- Sigstore transparency entry: 915855651
- Sigstore integration time: Feb 5, 2026
Source repository:
- Permalink: DataScienceUIBK/tempoeval@5562e92fdd150c3c1f1211d1e00a63a3eef85902
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/DataScienceUIBK
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5562e92fdd150c3c1f1211d1e00a63a3eef85902
- Trigger Event: release

tempoeval 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

⏱️ TempoEval

🎯 Overview

🤔 Why TempoEval?

✨ Key Features

📊 Three-Layer Evaluation Architecture

🔑 Core Capabilities

📦 Installation

Via pip (Recommended)

From Source

Optional Dependencies

🚀 Quick Start

Basic Retrieval Evaluation (No LLM Required)

Focus Time-Based Evaluation

LLM-Based Generation Evaluation

📊 Metrics

Layer 1: Retrieval Metrics

Layer 2: Generation Metrics

Layer 3: Reasoning Metrics

Composite Metrics

📁 Project Structure

📚 Examples

Running Examples

🔧 Configuration

Environment Variables (for LLM-based evaluation)

Programmatic Configuration

📈 TEMPO Benchmark

⚡ Efficiency Tracking

🧪 Testing

📖 Documentation

📄 Citation

🤝 Contributing

📜 License

🙏 Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance