Skip to main content

AI-powered research paper digest with LLM-based ranking, quality scoring, and scheduling

Project description

Daily Research Digest

AI-powered research paper digest with LLM-based ranking, quality scoring, and automatic scheduling. Fetches papers from Semantic Scholar and HuggingFace Daily Papers with author credibility signals.

Features

  • Multi-source paper fetching: Semantic Scholar (with author h-index) and HuggingFace Daily Papers (with upvotes)
  • Quality-aware ranking: Combines LLM relevance scores with author h-index and community upvotes
  • Smart deduplication: Merges quality signals when papers appear in multiple sources
  • LLM-powered relevance: Rank papers using Anthropic, OpenAI, or Google models
  • Automatic scheduling: Background scheduler for daily digest generation
  • Email delivery: SMTP-based digest emails (GitHub Actions compatible)
  • JSON storage: Date-based organization for easy retrieval

What's New in v0.2.0

  • Quality Scoring Pipeline: Papers are now ranked using a combined quality score that factors in:
    • LLM relevance score (1-10)
    • Author h-index (from Semantic Scholar)
    • Community upvotes (from HuggingFace)
  • Simplified Sources: Streamlined to Semantic Scholar + HuggingFace (Semantic Scholar indexes arXiv papers)
  • New Paper Fields: author_h_indices, huggingface_upvotes, quality_score

Quality Scoring Formula

final_score = llm_relevance × (1 + quality_boost)

quality_boost = 0.2 × h_factor + 0.2 × upvotes_factor
  • h_factor: Normalized average author h-index (0-1)
  • upvotes_factor: Normalized HuggingFace upvotes (0-1)
  • Maximum 40% boost from quality signals
  • Missing data = no boost (not a penalty)

Installation

# Basic installation
pip install daily-research-digest

# With specific LLM provider
pip install daily-research-digest[anthropic]  # For Claude
pip install daily-research-digest[openai]     # For GPT
pip install daily-research-digest[google]     # For Gemini

# With all providers
pip install daily-research-digest[all]

# Development installation
pip install daily-research-digest[dev]

Quick Start

import asyncio
from pathlib import Path
from daily_research_digest import (
    DigestConfig,
    DigestGenerator,
    DigestStorage,
)

# Configure digest
config = DigestConfig(
    categories=["cs.AI", "cs.CL", "cs.LG"],  # For reference/filtering
    interests="AI agents, large language models, natural language processing",
    max_papers=50,
    top_n=10,
    llm_provider="anthropic",
    anthropic_api_key="your-api-key-here",
)

# Set up storage
storage = DigestStorage(Path("./digests"))

# Create generator
generator = DigestGenerator(storage)

# Generate digest
async def main():
    result = await generator.generate(config)
    print(f"Status: {result['status']}")

    if result['status'] == 'completed':
        digest = result['digest']
        print(f"Generated digest with {len(digest['papers'])} papers")

        for paper in digest['papers']:
            score = paper.get('quality_score') or paper['relevance_score']
            print(f"\n{score:.1f} - {paper['title']}")
            print(f"  {paper['link']}")

            # Show quality signals
            if paper.get('author_h_indices'):
                avg_h = sum(paper['author_h_indices']) / len(paper['author_h_indices'])
                print(f"  Avg h-index: {avg_h:.0f}")
            if paper.get('huggingface_upvotes'):
                print(f"  HF upvotes: {paper['huggingface_upvotes']}")

            print(f"  Reason: {paper['relevance_reason']}")

asyncio.run(main())

Scheduled Digests

import asyncio
from daily_research_digest import ArxivScheduler, DigestGenerator, DigestStorage, DigestConfig
from pathlib import Path

config = DigestConfig(
    categories=["cs.AI", "cs.LG"],
    interests="machine learning research",
    llm_provider="anthropic",
    anthropic_api_key="your-key",
)

storage = DigestStorage(Path("./digests"))
generator = DigestGenerator(storage)
scheduler = ArxivScheduler(generator, schedule_hour=6)  # 6 AM UTC

async def run_scheduler():
    # Start scheduler (runs daily at 6 AM UTC)
    scheduler.start(config)

    # Keep running
    try:
        while True:
            await asyncio.sleep(3600)
    except KeyboardInterrupt:
        scheduler.stop()

asyncio.run(run_scheduler())

GitHub Actions Cron Usage

Send daily digest emails using GitHub Actions. The digest runner supports:

  • Configurable time windows (24h, 48h, 7d)
  • Idempotent execution (won't re-send on workflow reruns)
  • Multiple LLM providers
  • SMTP email delivery
  • Structured JSON logging

Quick Setup

  1. Add repository secrets (Settings > Secrets and variables > Actions):

    Secret Required Description
    DIGEST_RECIPIENTS Yes Comma-separated email addresses
    SMTP_HOST Yes SMTP server hostname
    SMTP_USER No SMTP username
    SMTP_PASS No SMTP password
    ANTHROPIC_API_KEY Yes* Anthropic API key
    OPENAI_API_KEY Alt OpenAI API key
    GOOGLE_API_KEY Alt Google API key

    *Required if using Anthropic (default). Use OpenAI or Google key with corresponding LLM_PROVIDER.

  2. Add repository variables (optional, for customization):

    Variable Default Description
    DIGEST_INTERESTS machine learning... Research interests for search & ranking
    DIGEST_SUBJECT Daily Research Digest - {date} Email subject
    DIGEST_TZ UTC Timezone
    DIGEST_WINDOW 24h Time window
    LLM_PROVIDER anthropic LLM provider
  3. Enable the workflow: The .github/workflows/digest.yml file runs daily at 6 AM UTC.

Manual Trigger

You can manually trigger the digest from the Actions tab using "Run workflow".

CLI Usage

Run the digest sender locally:

# Set required environment variables
export DIGEST_RECIPIENTS="you@example.com"
export DIGEST_INTERESTS="machine learning, AI agents, transformers"
export SMTP_HOST="smtp.gmail.com"
export SMTP_USER="your-email@gmail.com"
export SMTP_PASS="your-app-password"
export ANTHROPIC_API_KEY="your-api-key"

# Run the digest sender
python -m daily_research_digest.digest_send

Environment Variables Reference

Variable Required Default Description
DIGEST_RECIPIENTS Yes - Comma-separated email addresses
DIGEST_INTERESTS Yes - Research interests for search & ranking
DIGEST_SUBJECT No Daily Research Digest - {date} Email subject (supports {date})
DIGEST_FROM No noreply@example.com Sender address
DIGEST_TZ No UTC Timezone for window calculation
DIGEST_WINDOW No 24h Time window (24h, 1d, 48h, 7d)
DIGEST_MAX_PAPERS No 50 Max papers to fetch per source
DIGEST_TOP_N No 10 Top papers in digest
SMTP_HOST Yes - SMTP server hostname
SMTP_PORT No 587 SMTP server port
SMTP_USER No - SMTP username
SMTP_PASS No - SMTP password
SMTP_TLS No true Use TLS (true/false)
LLM_PROVIDER No anthropic anthropic, openai, or google
ANTHROPIC_API_KEY * - Required for anthropic provider
OPENAI_API_KEY * - Required for openai provider
GOOGLE_API_KEY * - Required for google provider

Configuration

DigestConfig

  • categories: List of category codes for reference (e.g., ["cs.AI", "cs.LG"])
  • interests: Research interests description for search and ranking
  • max_papers: Maximum papers to fetch per source (default: 50)
  • top_n: Number of top papers to include in digest (default: 10)
  • llm_provider: One of "anthropic", "openai", or "google"
  • priority_authors: List of author names to boost in ranking
  • author_boost: Multiplier for priority author papers (default: 1.5)
  • API keys for your chosen provider

LLM Providers

The package supports multiple LLM providers for paper ranking:

Provider Model Package Required
anthropic claude-3-haiku-20240307 langchain-anthropic
openai gpt-3.5-turbo langchain-openai
google gemini-1.5-flash langchain-google-genai

Each uses fast, cost-effective models optimized for ranking tasks.

Digest Format

Digests are saved as JSON files with the following structure:

{
  "date": "2024-01-15",
  "generated_at": "2024-01-15T06:00:00Z",
  "categories": ["cs.AI", "cs.CL"],
  "interests": "AI agents, LLMs",
  "total_papers_fetched": 50,
  "papers": [
    {
      "arxiv_id": "2401.12345",
      "title": "Paper Title",
      "abstract": "Abstract text...",
      "authors": ["Author One", "Author Two"],
      "categories": ["cs.AI", "cs.CL"],
      "published": "2024-01-14T00:00:00Z",
      "updated": "2024-01-14T00:00:00Z",
      "link": "https://arxiv.org/abs/2401.12345",
      "relevance_score": 9.0,
      "relevance_reason": "Directly addresses AI agent architectures",
      "author_h_indices": [45, 32, 28],
      "huggingface_upvotes": 156,
      "quality_score": 11.2
    }
  ]
}

Paper Fields

Field Type Description
arxiv_id string Paper identifier (arXiv ID or source-specific)
title string Paper title
abstract string Paper abstract
authors string[] List of author names
categories string[] Paper categories/fields
published string Publication date (ISO format)
link string URL to paper
relevance_score float LLM-assigned relevance (1-10)
relevance_reason string LLM explanation for score
author_h_indices int[] h-index for each author (from Semantic Scholar)
huggingface_upvotes int Community upvotes (from HuggingFace)
quality_score float Final combined score

API Reference

DigestGenerator

Generates paper digests using the quality scoring pipeline.

storage = DigestStorage(Path("./digests"))
generator = DigestGenerator(storage)
result = await generator.generate(config)

Quality Scoring

Compute quality scores manually:

from daily_research_digest import compute_quality_scores

# papers is a list of Paper objects with relevance_score set
compute_quality_scores(papers)

# Each paper now has quality_score populated
for paper in papers:
    print(f"{paper.quality_score:.1f} - {paper.title}")

SemanticScholarClient

Fetches papers from Semantic Scholar with author h-index.

from daily_research_digest import SemanticScholarClient

client = SemanticScholarClient(api_key="optional-api-key")
papers = await client.fetch_papers(
    query="large language models",
    limit=50,
    fields_of_study=["Computer Science"]
)

for paper in papers:
    if paper.author_h_indices:
        avg_h = sum(paper.author_h_indices) / len(paper.author_h_indices)
        print(f"{paper.title} - avg h-index: {avg_h:.0f}")

HuggingFaceClient

Fetches trending papers from HuggingFace Daily Papers with upvotes.

from daily_research_digest import HuggingFaceClient

client = HuggingFaceClient()
papers = await client.fetch_papers(limit=50)

for paper in papers:
    if paper.huggingface_upvotes:
        print(f"{paper.title} - {paper.huggingface_upvotes} upvotes")

DigestStorage

Manages digest persistence.

storage = DigestStorage(Path("./digests"))
storage.save_digest(digest)
digest = storage.get_digest("2024-01-15")
dates = storage.list_digests(limit=30)

ArxivScheduler

Schedules automated digest generation.

scheduler = ArxivScheduler(generator, schedule_hour=6)
scheduler.start(config)
scheduler.stop()

Development

# Clone repository
git clone https://github.com/LevRoz630/daily-research-digest.git
cd daily-research-digest

# Install with dev dependencies
pip install -e ".[dev,all]"

# Run tests
pytest

# Format code
black daily_research_digest tests

# Lint
ruff daily_research_digest tests

# Type check
mypy daily_research_digest

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daily_research_digest-0.2.0.tar.gz (44.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

daily_research_digest-0.2.0-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file daily_research_digest-0.2.0.tar.gz.

File metadata

  • Download URL: daily_research_digest-0.2.0.tar.gz
  • Upload date:
  • Size: 44.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for daily_research_digest-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3a467774c50aa98bf7a7a69c45f7ce68be541a2afe7c7191c08640316c30e4e6
MD5 ceb44129369c73e77419a85b1f68f2aa
BLAKE2b-256 e37996e0f57131e8cfa04fff331fc8faae0e441ab1121c9c1c3302fc2664c594

See more details on using hashes here.

Provenance

The following attestation bundles were made for daily_research_digest-0.2.0.tar.gz:

Publisher: publish.yml on LevRoz630/daily-research-digest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file daily_research_digest-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for daily_research_digest-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd76480435324f5f8ce3c2af4f2afb43bb79dab4bc76df1e6f292ff67c9c79c2
MD5 7fd6545d1c3bb0dd95b202c8b71a6685
BLAKE2b-256 275cbd1018539867216a89a95579cb3899a837af0eddb046ab58a111ef7e7286

See more details on using hashes here.

Provenance

The following attestation bundles were made for daily_research_digest-0.2.0-py3-none-any.whl:

Publisher: publish.yml on LevRoz630/daily-research-digest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page