The Archivist - Documentation enforcement, validation, and synchronization CLI
Project description
The Archivist
"Code is ephemeral; Documentation is the contract. If the contract is broken, the system is broken."
The Archivist is a documentation enforcement, validation, and synchronization CLI tool. It does not suggest; it enforces.
Features
- Universal Header Validation - Enforces mandatory YAML frontmatter on all markdown documents
- Link Checking - Detects broken internal links and anchors
- Staleness Detection - Flags documentation that has drifted behind source code
- Document Indexing - Pushes document vectors to The Cortex for semantic search
- Document Scaffolding - Generates properly formatted documents with headers
Installation
pip install aperion-archivist
Or install from source:
git clone https://github.com/aperion/archivist.git
cd archivist
pip install -e ".[dev]"
Quick Start
Validate Documentation
# Check all docs in ./docs directory
archivist check --doc-root docs
# Strict mode (warnings = errors)
archivist check --doc-root docs --strict
# JSON output for CI
archivist check --doc-root docs --format json
Detect Stale Documentation
# Compare source (src/) against docs (docs/)
archivist stale --source-root src --doc-root docs
# Use content hashing for accurate detection (avoids false positives)
archivist stale --source-root src --doc-root docs --use-hashing
Sync Documentation
# Dry run - see what would be updated
archivist sync --source-root src --doc-root docs --dry-run
# Actually touch stale docs (marks for review)
archivist sync --source-root src --doc-root docs
Index to The Cortex
# Parse docs and push to vector store
archivist index --doc-root docs --cortex-url http://localhost:4949
# Incremental indexing (only changed docs)
archivist index --doc-root docs --incremental
# Dry run
archivist index --doc-root docs --dry-run
Fix Missing Headers
# Auto-add Universal Headers to all docs missing them
archivist fix --doc-root docs --owner "team-alpha"
# Dry run - see what would be fixed
archivist fix --doc-root docs --dry-run
Global Options
# Verbose output (debug logging)
archivist --verbose check --doc-root docs
# Quiet mode (errors only)
archivist --quiet check --doc-root docs
# JSON log format (for log aggregators)
archivist --log-format json check --doc-root docs
Scaffold New Documents
# Create a new document with proper header
archivist scaffold docs/new-feature.md \
--title "New Feature Guide" \
--owner "team-alpha" \
--category "guides" \
--tags "feature,tutorial"
Universal Header Format
Every markdown document MUST have a YAML frontmatter block with required fields:
---
title: Document Title
last_updated: 2026-01-15
owner: team-alpha
category: guides # optional
tags: # optional
- api
- authentication
status: published # optional
---
# Document Title
Content starts here...
Required Fields (Default)
| Field | Description |
|---|---|
title |
Document title |
last_updated |
Last update date (YYYY-MM-DD format) |
owner |
Team or individual responsible |
Configuration
Create archivist.toml in your project root:
[header]
# Required fields for Universal Header
required_fields = ["title", "last_updated", "owner", "category"]
[exclude]
# Directories to skip
patterns = [".venv", "node_modules", ".git", "__pycache__"]
[[mapping]]
# Map source files to their documentation
source_pattern = "src/**/*.py"
doc_pattern = "docs/api/{name}.md"
[[mapping]]
source_pattern = "lib/**/*.ts"
doc_pattern = "docs/lib/{name}.md"
[cortex]
url = "http://localhost:4949"
collection = "documentation"
CI/CD Integration
GitHub Actions
name: Documentation Check
on:
pull_request:
paths:
- 'docs/**'
- 'src/**'
jobs:
doc-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Archivist
run: pip install aperion-archivist
- name: Validate Documentation
run: archivist check --doc-root docs --strict --format json
- name: Check for Stale Docs
run: archivist stale --source-root src --doc-root docs
Pre-commit Hook
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: archivist-check
name: Validate Documentation
entry: archivist check --doc-root docs
language: system
types: [markdown]
pass_filenames: false
Makefile
.PHONY: docs-check docs-stale docs-index
docs-check:
archivist check --doc-root docs --strict
docs-stale:
archivist stale --source-root src --doc-root docs
docs-index:
archivist index --doc-root docs --cortex-url $(CORTEX_URL)
docs: docs-check docs-stale
Exit Codes
The Archivist uses specific exit codes for CI/CD scripting:
| Code | Name | Meaning |
|---|---|---|
| 0 | SUCCESS | All checks passed |
| 1 | VALIDATION_FAILED | One or more documents failed validation |
| 2 | HEADER_MISSING | Universal Header missing from documents |
| 3 | LINK_BROKEN | Broken links detected |
| 4 | SCHEMA_INVALID | Frontmatter fails JSON Schema validation |
| 5 | STALE_DETECTED | Documentation is stale (source newer) |
| 10 | CONFIG_ERROR | Configuration file invalid or missing |
| 11 | PATH_NOT_FOUND | Specified path does not exist |
| 20 | CORTEX_ERROR | Failed to communicate with Cortex |
Using Exit Codes in CI
archivist check --doc-root docs
exit_code=$?
case $exit_code in
0) echo "All checks passed" ;;
2) echo "Missing headers - run 'archivist fix'" ;;
3) echo "Broken links found" ;;
4) echo "Schema validation failed" ;;
*) echo "Validation failed with code $exit_code" ;;
esac
Python Library API
The Archivist can be used as a library in addition to CLI:
High-Level Functions
from pathlib import Path
from aperion_archivist import validate_docs, check_links, detect_staleness
# Validate all documentation
result = validate_docs(Path("docs"))
if not result.passed:
for issue in result.issues:
print(f"{issue.file}: {issue.message}")
# Check for broken links
link_result = check_links(Path("docs"))
print(f"Broken links: {link_result.broken_link_count}")
# Detect stale documentation
stale_result = detect_staleness(
source_root=Path("src"),
doc_root=Path("docs"),
)
for doc in stale_result.stale_docs:
print(f"STALE: {doc.doc_path} ({doc.drift_days:.1f} days)")
Cortex Integration (Async-First Architecture)
The Cortex client uses an async-first design to prevent blocking the event loop when used in async contexts (FastAPI, agents, pipelines).
Async Usage (Recommended for Integration)
import asyncio
from aperion_archivist.integrations import AsyncCortexClient
async def index_docs():
async with AsyncCortexClient(
base_url="http://localhost:4949",
api_key="your-key",
) as client:
# Health check
if not await client.health_check():
raise RuntimeError("Cortex unavailable")
# Push chunks
result = await client.push_chunks(chunks)
print(f"Indexed {result['indexed']} chunks")
# Search
results = await client.search("authentication", top_k=5)
asyncio.run(index_docs())
Shared Client for Connection Pooling
For high-throughput applications, set a shared client at startup:
import httpx
from aperion_archivist.integrations import AsyncCortexClient
# At application startup
client = httpx.AsyncClient(timeout=30.0)
AsyncCortexClient.set_shared_client(client)
# All CortexClient instances now share connections
async def handler():
cortex = AsyncCortexClient() # Uses shared client
await cortex.push_chunks(chunks)
Sync Usage (CLI/Scripts)
A synchronous wrapper is provided for non-async contexts:
from aperion_archivist.integrations import CortexClient
# Sync context manager
with CortexClient() as client:
result = client.push_chunks(chunks)
print(f"Indexed: {result['indexed']}")
Project Structure
aperion-doc-index/
├── src/aperion_archivist/
│ ├── core/
│ │ ├── validator.py # Universal Header enforcement
│ │ ├── linker.py # Broken link detection
│ │ └── scanner.py # Staleness detection
│ ├── generation/
│ │ ├── templates.py # Jinja2 templates
│ │ └── parser.py # Markdown chunking
│ ├── cli/
│ │ └── main.py # CLI entry point
│ └── integrations/
│ └── cortex.py # Vector store client
├── tests/
├── pyproject.toml
├── Dockerfile
└── README.md
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Type check
mypy src
# Lint
ruff check src tests
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aperion_archivist-1.2.1.tar.gz.
File metadata
- Download URL: aperion_archivist-1.2.1.tar.gz
- Upload date:
- Size: 76.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c855181d392def1fca4f67525874700e9c4e2b8822cb80783cb42d83311661d
|
|
| MD5 |
9a3cd1dd49a3af6159f3d759f6a74ee0
|
|
| BLAKE2b-256 |
881b9b38c2b4b2c6d7dca0c57cc6a382b1655396212313cc34b679b4f1856d33
|
Provenance
The following attestation bundles were made for aperion_archivist-1.2.1.tar.gz:
Publisher:
release.yml on invictustitan2/aperion-doc-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aperion_archivist-1.2.1.tar.gz -
Subject digest:
6c855181d392def1fca4f67525874700e9c4e2b8822cb80783cb42d83311661d - Sigstore transparency entry: 941976517
- Sigstore integration time:
-
Permalink:
invictustitan2/aperion-doc-index@f74e9d133bd96d019561888ab20449e473ac28ea -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/invictustitan2
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f74e9d133bd96d019561888ab20449e473ac28ea -
Trigger Event:
push
-
Statement type:
File details
Details for the file aperion_archivist-1.2.1-py3-none-any.whl.
File metadata
- Download URL: aperion_archivist-1.2.1-py3-none-any.whl
- Upload date:
- Size: 48.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3664661ebcd19a5c14b13a6ec6572a896ed2e4eaf2dda838d3a8e9874bd15053
|
|
| MD5 |
94775cb8fcc2192e1772ab11a32f2e17
|
|
| BLAKE2b-256 |
9c35688c81e2c357918ccbd8024598fbc8ad394dde388b52be8a5628100cda53
|
Provenance
The following attestation bundles were made for aperion_archivist-1.2.1-py3-none-any.whl:
Publisher:
release.yml on invictustitan2/aperion-doc-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aperion_archivist-1.2.1-py3-none-any.whl -
Subject digest:
3664661ebcd19a5c14b13a6ec6572a896ed2e4eaf2dda838d3a8e9874bd15053 - Sigstore transparency entry: 941976561
- Sigstore integration time:
-
Permalink:
invictustitan2/aperion-doc-index@f74e9d133bd96d019561888ab20449e473ac28ea -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/invictustitan2
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f74e9d133bd96d019561888ab20449e473ac28ea -
Trigger Event:
push
-
Statement type: