Site-agnostic specification framework for scraping automation engineers

These details have not been verified by PyPI

Project links

Project description

scraping-spec-kit

A comprehensive toolkit for web scraping specifications and automation

What This Is

A governance and testing framework that helps automation engineers:

Standardize scraper configurations
Manage baselines and detect regressions
Enforce best practices via constitutional rules
Track versions and maintain audit trails

This is NOT a scraping library. It's a spec kit that works with any scraper you build.

Installation

Option 1: Install with pip

pip install scraper-spec

Option 2: Install with uv (Recommended)

# Install globally as a tool
uv tool install git+https://github.com/rainbowgore/scraping-spec-kit.git

# Or run directly without installing
uvx --from git+https://github.com/rainbowgore/scraping-spec-kit.git scraper-spec --help

Quick Start

# Initialize framework in your project
cd your-scraping-project
scraper-spec setup

# Create a site spec
scraper-spec init example-site

# Edit specs/example-site.yaml with your selectors

# Create baseline
scraper-spec baseline example-site "test query"

# Test for regressions
scraper-spec test example-site

# Release stable version
scraper-spec release example-site

Available Commands

Setup & Initialization

scraper-spec setup - Initialize framework structure in current directory
scraper-spec check - Check environment and framework setup
scraper-spec init <site-name> - Create a new site specification file

Development & Testing

scraper-spec discover <site-name> - Start site discovery (creates specs/<site>.discover.yaml with candidates)
scraper-spec baseline <site-name> "<query>" - Create baseline artifacts
scraper-spec test <site-name> - Run regression test against baseline
scraper-spec debug <site-name> "<query>" - Generate debug artifacts

Version Control

scraper-spec release <site-name> - Tag stable version and update changelog
scraper-spec rollback <site-name> - Restore last stable spec and baseline
scraper-spec rebaseline <site-name> "<query>" - Promote debug run to baseline

MCP Integration (Stubs)

scraper-spec push <site-name> - Push spec to MCP context
scraper-spec pull <site-name> - Pull spec from MCP context

How It Works

The framework enforces a strict spec lifecycle through multiple enforcement mechanisms:

1. Setup Gate

Only setup and check commands work before .scraper-spec/ exists
All other commands hard-fail with clear error: "Run: scraper-spec setup"
This ensures proper initialization before any work begins

2. Path Rules (Constitution Compliance)

All file writes must be in allowed directories: specs/, baselines/, logs/, docs/, framework/
Commands abort with specific error if path violates rules
Example: "ERROR: Output path X violates constitution (top-level 'bad-dir' not allowed)"

3. Schema Validation

All JSON/YAML artifacts are validated against templates on write:
- specs/*.yaml → validated against selectors-template.yaml
- baselines/*.json → validated against baseline-template.json
- logs/*.json (runtime and debug) → validated against log-template.json / debug-log-template.json
- Diffs → validated against diff-template.json
HTML snapshots are advisory and not schema-validated
If validation fails, command aborts with error

4. Critical Path Enforcement

Plans must include all phases defined in .scraper-spec/config.yaml
Default abstract phases: Acquire → Identify → Collect → Extract
Both presence AND order are validated
Configurable for different scraping approaches (web, API, feed, file)

5. Versioning & Release Guards

release command bumps spec version and updates changelog
baseline command blocks if baseline exists (requires SCRAPER_SPEC_CONFIRM=1 or use rebaseline)
This prevents accidental overwrites of golden data

6. Regression Testing Loop

baseline creates golden artifacts (expected.json, snapshot.html, screenshots, log.json)
test compares current outputs to baseline and logs diffs
Structured diff files show exactly what changed

7. Artifact Completeness Checks

After baseline: framework verifies expected.json, snapshot.html, screenshot.png, and log.json exist; aborts if any missing
After test: framework verifies the diff JSON exists; aborts if missing

Result: Every project follows the same reproducible loop:

setup → check → init → edit spec → baseline → test → release

This is why we call it a spec lifecycle enforcer, not a scraper library.

Framework Structure

After running scraper-spec setup, you'll have:

your-project/
├── .scraper-spec/           # Framework templates and constitution
│   ├── commands/            # Command documentation
│   ├── memory/              # Constitutional rules
│   └── templates/           # File templates
├── specs/                   # Site specifications (.yaml files)
│   └── SCRAPER_SPEC.md     # Framework charter
├── baselines/               # Baseline artifacts
│   └── screenshots/         # Visual baselines
├── logs/                    # Execution logs
│   ├── debug/              # Debug artifacts
│   └── regressions/        # Test results
├── docs/                    # Documentation
│   └── CHANGELOG.md        # Version history
└── framework/               # Your scraper implementation
    └── plan.md             # Development plan

Workflow

1. Discover Phase

Explore the target site and identify selectors:

scraper-spec discover example-site

2. Specify Phase

Edit the generated spec file with your selectors:

# specs/example-site.yaml
site_config:
  base_url: "https://example.com"

selectors:
  search_box: "#search"
  results_list: ".results"
  result_item: ".result-item"

3. Baseline Phase

Create baseline artifacts with a test query:

scraper-spec baseline example-site "test query"

4. Test Phase

Run regression tests to detect changes:

scraper-spec test example-site

5. Release Phase

Lock stable versions:

scraper-spec release example-site

Development

Prerequisites

Python 3.8+
pip

Installing from Source

git clone https://github.com/rainbowgore/scraping-spec-kit.git
cd scraping-spec-kit
pip install -e .

Running Tests

python -m pytest tests/

Requirements

jsonschema>=4.0.0 - For validating JSON artifacts
PyYAML>=6.0 - For parsing YAML specifications

Use Cases

Automation Engineers

Standardize scraper configurations across projects
Detect breaking changes in target sites
Maintain audit trails for compliance
Enforce coding standards via constitutional rules

Quality Assurance

Visual regression testing with screenshots
Automated baseline comparison
Historical tracking of site changes

Team Collaboration

Shared specification format
Version-controlled baselines
Clear documentation of selectors and expectations

Constitutional Rules

The framework enforces rules defined in .scraper-spec/memory/constitution.md:

Allowed top-level directories: specs/, baselines/, logs/, docs/, framework/
All artifacts must comply with templates
Version control is enforced for releases
Critical path phases must be preserved in all plans (configurable in .scraper-spec/config.yaml)

Critical Path Enforcement

The framework validates that all implementation plans follow a defined critical path. By default, the abstract phases are:

Acquire - Connect to data source (web page, API, feed, file)
Identify - Locate target elements or endpoints
Collect - Gather the raw data
Extract - Transform raw data into structured output

These phases are tool-agnostic and work for any scraping approach (browser automation, API crawling, feed parsing, etc.).

Customization: Edit .scraper-spec/config.yaml to define your own phases:

critical_path:
  phases:
    - Connect
    - Discover
    - Retrieve
    - Transform
    - Validate

Advanced Features

Debug Mode

Generate detailed debug artifacts:

scraper-spec debug example-site "test query"

Rollback

Restore previous stable versions:

scraper-spec rollback example-site

Rebaseline

Promote a successful debug run to baseline:

SCRAPER_SPEC_CONFIRM=1 scraper-spec rebaseline example-site "test query"

Troubleshooting

Framework not initialized errors

Problem: ERROR: Framework not initialized in this directory

Solution: Run scraper-spec setup to initialize the framework structure.

Template loading warnings during setup

Problem: See "Error loading template" warnings when running setup

Solution: These are harmless. The validator tries to load templates before they're copied. The setup will complete successfully.

Path violation errors

Problem: ERROR: Output path X violates constitution (top-level 'Y' not allowed)

Solution: Ensure your output path starts with one of the allowed directories:

specs/ - for site specifications
baselines/ - for golden artifacts
logs/ - for execution logs
docs/ - for documentation
framework/ - for implementation

Baseline already exists

Problem: ERROR: Baseline exists at baselines/<site>.expected.json

Solution:

Use scraper-spec rebaseline <site> "<query>" to promote a debug run, OR
Set SCRAPER_SPEC_CONFIRM=1 environment variable to force overwrite

Missing templates after installation

Problem: Commands fail with template not found errors

Solution:

Reinstall the package: pip install --force-reinstall scraper-spec
Re-run scraper-spec setup in your project

Critical path validation failures

Problem: ERROR: Plan violates constitution (phase 'X' missing from critical path)

Solution: Ensure your framework/plan.md includes all phases from .scraper-spec/config.yaml in order. Default phases: Acquire → Identify → Collect → Extract

Empty logs/ directory

Note: The logs/ directory is empty until you run commands. Logs populate after:

baseline → creates logs/<site>_<timestamp>.log.json
test → creates logs/regressions/<site>_<query>.diff.json
debug → creates files in logs/debug/

Documentation

Framework Charter - After running setup
Command Reference - After running setup
Constitutional Rules - After running setup
Critical Path Guide - Detailed critical path documentation

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

MIT License - See LICENSE file for details

Support

Issues: https://github.com/rainbowgore/scraping-spec-kit/issues
Documentation: https://github.com/rainbowgore/scraping-spec-kit/blob/main/README.md

Acknowledgments

Built for automation engineers who need standardized, testable scraper specifications.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_spec-0.1.0.tar.gz (27.7 kB view details)

Uploaded Oct 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraper_spec-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Oct 2, 2025 Python 3

File details

Details for the file scraper_spec-0.1.0.tar.gz.

File metadata

Download URL: scraper_spec-0.1.0.tar.gz
Upload date: Oct 2, 2025
Size: 27.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.9

File hashes

Hashes for scraper_spec-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b36ac000e2fc6f7617961d4e31147156cb1cc7ef011aa956ba45fd0be9400e6e`
MD5	`70c5e246b4925c4c127e8e11df2b2ce8`
BLAKE2b-256	`c77865a22070d32b41bcf0bae1b846d67d7318ba815f573d6082a8541b8ae075`

See more details on using hashes here.

File details

Details for the file scraper_spec-0.1.0-py3-none-any.whl.

File metadata

Download URL: scraper_spec-0.1.0-py3-none-any.whl
Upload date: Oct 2, 2025
Size: 25.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.9

File hashes

Hashes for scraper_spec-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec089139a3cc936bd5a28b40e8a61adddb0316ae3a70c89615cdb558a763d0f1`
MD5	`a706d7c1f600ef767ed09cd0cf7ed572`
BLAKE2b-256	`ae820f826a53d2de792b543a398d09c74e51cb0a510aea0b526896c156ef9107`

See more details on using hashes here.

scraper-spec 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scraping-spec-kit

What This Is

Installation

Option 1: Install with pip

Option 2: Install with uv (Recommended)

Quick Start

Available Commands

Setup & Initialization

Development & Testing

Version Control

MCP Integration (Stubs)

How It Works

1. Setup Gate

2. Path Rules (Constitution Compliance)

3. Schema Validation

4. Critical Path Enforcement

5. Versioning & Release Guards

6. Regression Testing Loop

7. Artifact Completeness Checks

Framework Structure

Workflow

1. Discover Phase

2. Specify Phase

3. Baseline Phase

4. Test Phase

5. Release Phase

Development

Prerequisites

Installing from Source

Running Tests

Requirements

Use Cases

Automation Engineers

Quality Assurance

Team Collaboration

Constitutional Rules

Critical Path Enforcement

Advanced Features

Debug Mode

Rollback

Rebaseline

Troubleshooting

Framework not initialized errors

Template loading warnings during setup

Path violation errors

Baseline already exists

Missing templates after installation

Critical path validation failures

Empty logs/ directory

Documentation

Contributing

License

Support

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes