Skip to main content

The Archivist - Documentation enforcement, validation, and synchronization CLI

Project description

The Archivist

"Code is ephemeral; Documentation is the contract. If the contract is broken, the system is broken."

The Archivist is a documentation enforcement, validation, and synchronization CLI tool. It does not suggest; it enforces.

Features

  • Universal Header Validation - Enforces mandatory YAML frontmatter on all markdown documents
  • Link Checking - Detects broken internal links and anchors
  • Staleness Detection - Flags documentation that has drifted behind source code
  • Document Indexing - Pushes document vectors to The Cortex for semantic search
  • Document Scaffolding - Generates properly formatted documents with headers

Installation

pip install aperion-archivist

Or install from source:

git clone https://github.com/aperion/archivist.git
cd archivist
pip install -e ".[dev]"

Quick Start

Validate Documentation

# Check all docs in ./docs directory
archivist check --doc-root docs

# Strict mode (warnings = errors)
archivist check --doc-root docs --strict

# JSON output for CI
archivist check --doc-root docs --format json

Detect Stale Documentation

# Compare source (src/) against docs (docs/)
archivist stale --source-root src --doc-root docs

# Use content hashing for accurate detection (avoids false positives)
archivist stale --source-root src --doc-root docs --use-hashing

Sync Documentation

# Dry run - see what would be updated
archivist sync --source-root src --doc-root docs --dry-run

# Actually touch stale docs (marks for review)
archivist sync --source-root src --doc-root docs

Index to The Cortex

# Parse docs and push to vector store
archivist index --doc-root docs --cortex-url http://localhost:4949

# Incremental indexing (only changed docs)
archivist index --doc-root docs --incremental

# Dry run
archivist index --doc-root docs --dry-run

Fix Missing Headers

# Auto-add Universal Headers to all docs missing them
archivist fix --doc-root docs --owner "team-alpha"

# Dry run - see what would be fixed
archivist fix --doc-root docs --dry-run

Global Options

# Verbose output (debug logging)
archivist --verbose check --doc-root docs

# Quiet mode (errors only)
archivist --quiet check --doc-root docs

# JSON log format (for log aggregators)
archivist --log-format json check --doc-root docs

Scaffold New Documents

# Create a new document with proper header
archivist scaffold docs/new-feature.md \
  --title "New Feature Guide" \
  --owner "team-alpha" \
  --category "guides" \
  --tags "feature,tutorial"

Universal Header Format

Every markdown document MUST have a YAML frontmatter block with required fields:

---
title: Document Title
last_updated: 2026-01-15
owner: team-alpha
category: guides        # optional
tags:                   # optional
  - api
  - authentication
status: published       # optional
---

# Document Title

Content starts here...

Required Fields (Default)

Field Description
title Document title
last_updated Last update date (YYYY-MM-DD format)
owner Team or individual responsible

Configuration

Create archivist.toml in your project root:

[header]
# Required fields for Universal Header
required_fields = ["title", "last_updated", "owner", "category"]

[exclude]
# Directories to skip
patterns = [".venv", "node_modules", ".git", "__pycache__"]

[[mapping]]
# Map source files to their documentation
source_pattern = "src/**/*.py"
doc_pattern = "docs/api/{name}.md"

[[mapping]]
source_pattern = "lib/**/*.ts"
doc_pattern = "docs/lib/{name}.md"

[cortex]
url = "http://localhost:4949"
collection = "documentation"

CI/CD Integration

GitHub Actions

name: Documentation Check

on:
  pull_request:
    paths:
      - 'docs/**'
      - 'src/**'

jobs:
  doc-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Archivist
        run: pip install aperion-archivist

      - name: Validate Documentation
        run: archivist check --doc-root docs --strict --format json

      - name: Check for Stale Docs
        run: archivist stale --source-root src --doc-root docs

Pre-commit Hook

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: archivist-check
        name: Validate Documentation
        entry: archivist check --doc-root docs
        language: system
        types: [markdown]
        pass_filenames: false

Makefile

.PHONY: docs-check docs-stale docs-index

docs-check:
	archivist check --doc-root docs --strict

docs-stale:
	archivist stale --source-root src --doc-root docs

docs-index:
	archivist index --doc-root docs --cortex-url $(CORTEX_URL)

docs: docs-check docs-stale

Exit Codes

The Archivist uses specific exit codes for CI/CD scripting:

Code Name Meaning
0 SUCCESS All checks passed
1 VALIDATION_FAILED One or more documents failed validation
2 HEADER_MISSING Universal Header missing from documents
3 LINK_BROKEN Broken links detected
4 SCHEMA_INVALID Frontmatter fails JSON Schema validation
5 STALE_DETECTED Documentation is stale (source newer)
10 CONFIG_ERROR Configuration file invalid or missing
11 PATH_NOT_FOUND Specified path does not exist
20 CORTEX_ERROR Failed to communicate with Cortex

Using Exit Codes in CI

archivist check --doc-root docs
exit_code=$?

case $exit_code in
  0) echo "All checks passed" ;;
  2) echo "Missing headers - run 'archivist fix'" ;;
  3) echo "Broken links found" ;;
  4) echo "Schema validation failed" ;;
  *) echo "Validation failed with code $exit_code" ;;
esac

Python Library API

The Archivist can be used as a library in addition to CLI:

High-Level Functions

from pathlib import Path
from aperion_archivist import validate_docs, check_links, detect_staleness

# Validate all documentation
result = validate_docs(Path("docs"))
if not result.passed:
    for issue in result.issues:
        print(f"{issue.file}: {issue.message}")

# Check for broken links
link_result = check_links(Path("docs"))
print(f"Broken links: {link_result.broken_link_count}")

# Detect stale documentation
stale_result = detect_staleness(
    source_root=Path("src"),
    doc_root=Path("docs"),
)
for doc in stale_result.stale_docs:
    print(f"STALE: {doc.doc_path} ({doc.drift_days:.1f} days)")

Cortex Integration (Async-First Architecture)

The Cortex client uses an async-first design to prevent blocking the event loop when used in async contexts (FastAPI, agents, pipelines).

Async Usage (Recommended for Integration)

import asyncio
from aperion_archivist.integrations import AsyncCortexClient

async def index_docs():
    async with AsyncCortexClient(
        base_url="http://localhost:4949",
        api_key="your-key",
    ) as client:
        # Health check
        if not await client.health_check():
            raise RuntimeError("Cortex unavailable")

        # Push chunks
        result = await client.push_chunks(chunks)
        print(f"Indexed {result['indexed']} chunks")

        # Search
        results = await client.search("authentication", top_k=5)

asyncio.run(index_docs())

Shared Client for Connection Pooling

For high-throughput applications, set a shared client at startup:

import httpx
from aperion_archivist.integrations import AsyncCortexClient

# At application startup
client = httpx.AsyncClient(timeout=30.0)
AsyncCortexClient.set_shared_client(client)

# All CortexClient instances now share connections
async def handler():
    cortex = AsyncCortexClient()  # Uses shared client
    await cortex.push_chunks(chunks)

Sync Usage (CLI/Scripts)

A synchronous wrapper is provided for non-async contexts:

from aperion_archivist.integrations import CortexClient

# Sync context manager
with CortexClient() as client:
    result = client.push_chunks(chunks)
    print(f"Indexed: {result['indexed']}")

Project Structure

aperion-doc-index/
├── src/aperion_archivist/
│   ├── core/
│   │   ├── validator.py   # Universal Header enforcement
│   │   ├── linker.py      # Broken link detection
│   │   └── scanner.py     # Staleness detection
│   ├── generation/
│   │   ├── templates.py   # Jinja2 templates
│   │   └── parser.py      # Markdown chunking
│   ├── cli/
│   │   └── main.py        # CLI entry point
│   └── integrations/
│       └── cortex.py      # Vector store client
├── tests/
├── pyproject.toml
├── Dockerfile
└── README.md

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Type check
mypy src

# Lint
ruff check src tests

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aperion_archivist-1.2.1.tar.gz (76.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aperion_archivist-1.2.1-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file aperion_archivist-1.2.1.tar.gz.

File metadata

  • Download URL: aperion_archivist-1.2.1.tar.gz
  • Upload date:
  • Size: 76.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aperion_archivist-1.2.1.tar.gz
Algorithm Hash digest
SHA256 6c855181d392def1fca4f67525874700e9c4e2b8822cb80783cb42d83311661d
MD5 9a3cd1dd49a3af6159f3d759f6a74ee0
BLAKE2b-256 881b9b38c2b4b2c6d7dca0c57cc6a382b1655396212313cc34b679b4f1856d33

See more details on using hashes here.

Provenance

The following attestation bundles were made for aperion_archivist-1.2.1.tar.gz:

Publisher: release.yml on invictustitan2/aperion-doc-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aperion_archivist-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for aperion_archivist-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3664661ebcd19a5c14b13a6ec6572a896ed2e4eaf2dda838d3a8e9874bd15053
MD5 94775cb8fcc2192e1772ab11a32f2e17
BLAKE2b-256 9c35688c81e2c357918ccbd8024598fbc8ad394dde388b52be8a5628100cda53

See more details on using hashes here.

Provenance

The following attestation bundles were made for aperion_archivist-1.2.1-py3-none-any.whl:

Publisher: release.yml on invictustitan2/aperion-doc-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page