AI-powered research paper digest with LLM-based ranking, quality scoring, and scheduling
Project description
Daily Research Digest
AI-powered research paper digest with LLM-based ranking, quality scoring, and automatic scheduling. Fetches papers from Semantic Scholar and HuggingFace Daily Papers with author credibility signals.
Features
- Multi-source paper fetching: Semantic Scholar (with author h-index) and HuggingFace Daily Papers (with upvotes)
- Quality-aware ranking: Combines LLM relevance scores with author h-index and community upvotes
- Smart deduplication: Merges quality signals when papers appear in multiple sources
- LLM-powered relevance: Rank papers using Anthropic, OpenAI, or Google models
- Automatic scheduling: Background scheduler for daily digest generation
- Email delivery: SMTP-based digest emails (GitHub Actions compatible)
- JSON storage: Date-based organization for easy retrieval
What's New in v0.2.0
- Quality Scoring Pipeline: Papers are now ranked using a combined quality score that factors in:
- LLM relevance score (1-10)
- Author h-index (from Semantic Scholar)
- Community upvotes (from HuggingFace)
- Simplified Sources: Streamlined to Semantic Scholar + HuggingFace (Semantic Scholar indexes arXiv papers)
- New Paper Fields:
author_h_indices,huggingface_upvotes,quality_score
Quality Scoring Formula
final_score = llm_relevance × (1 + quality_boost)
quality_boost = 0.2 × h_factor + 0.2 × upvotes_factor
h_factor: Normalized average author h-index (0-1)upvotes_factor: Normalized HuggingFace upvotes (0-1)- Maximum 40% boost from quality signals
- Missing data = no boost (not a penalty)
Installation
# Basic installation
pip install daily-research-digest
# With specific LLM provider
pip install daily-research-digest[anthropic] # For Claude
pip install daily-research-digest[openai] # For GPT
pip install daily-research-digest[google] # For Gemini
# With all providers
pip install daily-research-digest[all]
# Development installation
pip install daily-research-digest[dev]
Quick Start
import asyncio
from pathlib import Path
from daily_research_digest import (
DigestConfig,
DigestGenerator,
DigestStorage,
)
# Configure digest
config = DigestConfig(
categories=["cs.AI", "cs.CL", "cs.LG"], # For reference/filtering
interests="AI agents, large language models, natural language processing",
max_papers=50,
top_n=10,
llm_provider="anthropic",
anthropic_api_key="your-api-key-here",
)
# Set up storage
storage = DigestStorage(Path("./digests"))
# Create generator
generator = DigestGenerator(storage)
# Generate digest
async def main():
result = await generator.generate(config)
print(f"Status: {result['status']}")
if result['status'] == 'completed':
digest = result['digest']
print(f"Generated digest with {len(digest['papers'])} papers")
for paper in digest['papers']:
score = paper.get('quality_score') or paper['relevance_score']
print(f"\n{score:.1f} - {paper['title']}")
print(f" {paper['link']}")
# Show quality signals
if paper.get('author_h_indices'):
avg_h = sum(paper['author_h_indices']) / len(paper['author_h_indices'])
print(f" Avg h-index: {avg_h:.0f}")
if paper.get('huggingface_upvotes'):
print(f" HF upvotes: {paper['huggingface_upvotes']}")
print(f" Reason: {paper['relevance_reason']}")
asyncio.run(main())
Scheduled Digests
import asyncio
from daily_research_digest import ArxivScheduler, DigestGenerator, DigestStorage, DigestConfig
from pathlib import Path
config = DigestConfig(
categories=["cs.AI", "cs.LG"],
interests="machine learning research",
llm_provider="anthropic",
anthropic_api_key="your-key",
)
storage = DigestStorage(Path("./digests"))
generator = DigestGenerator(storage)
scheduler = ArxivScheduler(generator, schedule_hour=6) # 6 AM UTC
async def run_scheduler():
# Start scheduler (runs daily at 6 AM UTC)
scheduler.start(config)
# Keep running
try:
while True:
await asyncio.sleep(3600)
except KeyboardInterrupt:
scheduler.stop()
asyncio.run(run_scheduler())
GitHub Actions Cron Usage
Send daily digest emails using GitHub Actions. The digest runner supports:
- Configurable time windows (24h, 48h, 7d)
- Idempotent execution (won't re-send on workflow reruns)
- Multiple LLM providers
- SMTP email delivery
- Structured JSON logging
Quick Setup
-
Add repository secrets (Settings > Secrets and variables > Actions):
Secret Required Description DIGEST_RECIPIENTSYes Comma-separated email addresses SMTP_HOSTYes SMTP server hostname SMTP_USERNo SMTP username SMTP_PASSNo SMTP password ANTHROPIC_API_KEYYes* Anthropic API key OPENAI_API_KEYAlt OpenAI API key GOOGLE_API_KEYAlt Google API key *Required if using Anthropic (default). Use OpenAI or Google key with corresponding
LLM_PROVIDER. -
Add repository variables (optional, for customization):
Variable Default Description DIGEST_INTERESTSmachine learning...Research interests for search & ranking DIGEST_SUBJECTDaily Research Digest - {date}Email subject DIGEST_TZUTCTimezone DIGEST_WINDOW24hTime window LLM_PROVIDERanthropicLLM provider -
Enable the workflow: The
.github/workflows/digest.ymlfile runs daily at 6 AM UTC.
Manual Trigger
You can manually trigger the digest from the Actions tab using "Run workflow".
CLI Usage
Run the digest sender locally:
# Set required environment variables
export DIGEST_RECIPIENTS="you@example.com"
export DIGEST_INTERESTS="machine learning, AI agents, transformers"
export SMTP_HOST="smtp.gmail.com"
export SMTP_USER="your-email@gmail.com"
export SMTP_PASS="your-app-password"
export ANTHROPIC_API_KEY="your-api-key"
# Run the digest sender
python -m daily_research_digest.digest_send
Environment Variables Reference
| Variable | Required | Default | Description |
|---|---|---|---|
DIGEST_RECIPIENTS |
Yes | - | Comma-separated email addresses |
DIGEST_INTERESTS |
Yes | - | Research interests for search & ranking |
DIGEST_SUBJECT |
No | Daily Research Digest - {date} |
Email subject (supports {date}) |
DIGEST_FROM |
No | noreply@example.com |
Sender address |
DIGEST_TZ |
No | UTC |
Timezone for window calculation |
DIGEST_WINDOW |
No | 24h |
Time window (24h, 1d, 48h, 7d) |
DIGEST_MAX_PAPERS |
No | 50 |
Max papers to fetch per source |
DIGEST_TOP_N |
No | 10 |
Top papers in digest |
SMTP_HOST |
Yes | - | SMTP server hostname |
SMTP_PORT |
No | 587 |
SMTP server port |
SMTP_USER |
No | - | SMTP username |
SMTP_PASS |
No | - | SMTP password |
SMTP_TLS |
No | true |
Use TLS (true/false) |
LLM_PROVIDER |
No | anthropic |
anthropic, openai, or google |
ANTHROPIC_API_KEY |
* | - | Required for anthropic provider |
OPENAI_API_KEY |
* | - | Required for openai provider |
GOOGLE_API_KEY |
* | - | Required for google provider |
Configuration
DigestConfig
categories: List of category codes for reference (e.g.,["cs.AI", "cs.LG"])interests: Research interests description for search and rankingmax_papers: Maximum papers to fetch per source (default: 50)top_n: Number of top papers to include in digest (default: 10)llm_provider: One of "anthropic", "openai", or "google"priority_authors: List of author names to boost in rankingauthor_boost: Multiplier for priority author papers (default: 1.5)- API keys for your chosen provider
LLM Providers
The package supports multiple LLM providers for paper ranking:
| Provider | Model | Package Required |
|---|---|---|
| anthropic | claude-3-haiku-20240307 | langchain-anthropic |
| openai | gpt-3.5-turbo | langchain-openai |
| gemini-1.5-flash | langchain-google-genai |
Each uses fast, cost-effective models optimized for ranking tasks.
Digest Format
Digests are saved as JSON files with the following structure:
{
"date": "2024-01-15",
"generated_at": "2024-01-15T06:00:00Z",
"categories": ["cs.AI", "cs.CL"],
"interests": "AI agents, LLMs",
"total_papers_fetched": 50,
"papers": [
{
"arxiv_id": "2401.12345",
"title": "Paper Title",
"abstract": "Abstract text...",
"authors": ["Author One", "Author Two"],
"categories": ["cs.AI", "cs.CL"],
"published": "2024-01-14T00:00:00Z",
"updated": "2024-01-14T00:00:00Z",
"link": "https://arxiv.org/abs/2401.12345",
"relevance_score": 9.0,
"relevance_reason": "Directly addresses AI agent architectures",
"author_h_indices": [45, 32, 28],
"huggingface_upvotes": 156,
"quality_score": 11.2
}
]
}
Paper Fields
| Field | Type | Description |
|---|---|---|
arxiv_id |
string | Paper identifier (arXiv ID or source-specific) |
title |
string | Paper title |
abstract |
string | Paper abstract |
authors |
string[] | List of author names |
categories |
string[] | Paper categories/fields |
published |
string | Publication date (ISO format) |
link |
string | URL to paper |
relevance_score |
float | LLM-assigned relevance (1-10) |
relevance_reason |
string | LLM explanation for score |
author_h_indices |
int[] | h-index for each author (from Semantic Scholar) |
huggingface_upvotes |
int | Community upvotes (from HuggingFace) |
quality_score |
float | Final combined score |
API Reference
DigestGenerator
Generates paper digests using the quality scoring pipeline.
storage = DigestStorage(Path("./digests"))
generator = DigestGenerator(storage)
result = await generator.generate(config)
Quality Scoring
Compute quality scores manually:
from daily_research_digest import compute_quality_scores
# papers is a list of Paper objects with relevance_score set
compute_quality_scores(papers)
# Each paper now has quality_score populated
for paper in papers:
print(f"{paper.quality_score:.1f} - {paper.title}")
SemanticScholarClient
Fetches papers from Semantic Scholar with author h-index.
from daily_research_digest import SemanticScholarClient
client = SemanticScholarClient(api_key="optional-api-key")
papers = await client.fetch_papers(
query="large language models",
limit=50,
fields_of_study=["Computer Science"]
)
for paper in papers:
if paper.author_h_indices:
avg_h = sum(paper.author_h_indices) / len(paper.author_h_indices)
print(f"{paper.title} - avg h-index: {avg_h:.0f}")
HuggingFaceClient
Fetches trending papers from HuggingFace Daily Papers with upvotes.
from daily_research_digest import HuggingFaceClient
client = HuggingFaceClient()
papers = await client.fetch_papers(limit=50)
for paper in papers:
if paper.huggingface_upvotes:
print(f"{paper.title} - {paper.huggingface_upvotes} upvotes")
DigestStorage
Manages digest persistence.
storage = DigestStorage(Path("./digests"))
storage.save_digest(digest)
digest = storage.get_digest("2024-01-15")
dates = storage.list_digests(limit=30)
ArxivScheduler
Schedules automated digest generation.
scheduler = ArxivScheduler(generator, schedule_hour=6)
scheduler.start(config)
scheduler.stop()
Development
# Clone repository
git clone https://github.com/LevRoz630/daily-research-digest.git
cd daily-research-digest
# Install with dev dependencies
pip install -e ".[dev,all]"
# Run tests
pytest
# Format code
black daily_research_digest tests
# Lint
ruff daily_research_digest tests
# Type check
mypy daily_research_digest
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file daily_research_digest-0.2.0.tar.gz.
File metadata
- Download URL: daily_research_digest-0.2.0.tar.gz
- Upload date:
- Size: 44.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a467774c50aa98bf7a7a69c45f7ce68be541a2afe7c7191c08640316c30e4e6
|
|
| MD5 |
ceb44129369c73e77419a85b1f68f2aa
|
|
| BLAKE2b-256 |
e37996e0f57131e8cfa04fff331fc8faae0e441ab1121c9c1c3302fc2664c594
|
Provenance
The following attestation bundles were made for daily_research_digest-0.2.0.tar.gz:
Publisher:
publish.yml on LevRoz630/daily-research-digest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daily_research_digest-0.2.0.tar.gz -
Subject digest:
3a467774c50aa98bf7a7a69c45f7ce68be541a2afe7c7191c08640316c30e4e6 - Sigstore transparency entry: 895338064
- Sigstore integration time:
-
Permalink:
LevRoz630/daily-research-digest@a85aba8621a0c88910aabcaa384f3359ea4ab0d9 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/LevRoz630
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a85aba8621a0c88910aabcaa384f3359ea4ab0d9 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file daily_research_digest-0.2.0-py3-none-any.whl.
File metadata
- Download URL: daily_research_digest-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd76480435324f5f8ce3c2af4f2afb43bb79dab4bc76df1e6f292ff67c9c79c2
|
|
| MD5 |
7fd6545d1c3bb0dd95b202c8b71a6685
|
|
| BLAKE2b-256 |
275cbd1018539867216a89a95579cb3899a837af0eddb046ab58a111ef7e7286
|
Provenance
The following attestation bundles were made for daily_research_digest-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on LevRoz630/daily-research-digest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daily_research_digest-0.2.0-py3-none-any.whl -
Subject digest:
bd76480435324f5f8ce3c2af4f2afb43bb79dab4bc76df1e6f292ff67c9c79c2 - Sigstore transparency entry: 895338126
- Sigstore integration time:
-
Permalink:
LevRoz630/daily-research-digest@a85aba8621a0c88910aabcaa384f3359ea4ab0d9 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/LevRoz630
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a85aba8621a0c88910aabcaa384f3359ea4ab0d9 -
Trigger Event:
workflow_dispatch
-
Statement type: