Skip to main content

Site-agnostic specification framework for scraping automation engineers

Project description

scraping-spec-kit

A comprehensive toolkit for web scraping specifications and automation

What This Is

A governance and testing framework that helps automation engineers:

  • Standardize scraper configurations
  • Manage baselines and detect regressions
  • Enforce best practices via constitutional rules
  • Track versions and maintain audit trails

This is NOT a scraping library. It's a spec kit that works with any scraper you build.

Installation

Option 1: Install with pip

pip install scraper-spec

Option 2: Install with uv (Recommended)

# Install globally as a tool
uv tool install git+https://github.com/rainbowgore/scraping-spec-kit.git

# Or run directly without installing
uvx --from git+https://github.com/rainbowgore/scraping-spec-kit.git scraper-spec --help

Quick Start

# Initialize framework in your project
cd your-scraping-project
scraper-spec setup

# Create a site spec
scraper-spec init example-site

# Edit specs/example-site.yaml with your selectors

# Create baseline
scraper-spec baseline example-site "test query"

# Test for regressions
scraper-spec test example-site

# Release stable version
scraper-spec release example-site

Available Commands

Setup & Initialization

  • scraper-spec setup - Initialize framework structure in current directory
  • scraper-spec check - Check environment and framework setup
  • scraper-spec init <site-name> - Create a new site specification file

Development & Testing

  • scraper-spec discover <site-name> - Start site discovery (creates specs/<site>.discover.yaml with candidates)
  • scraper-spec baseline <site-name> "<query>" - Create baseline artifacts
  • scraper-spec test <site-name> - Run regression test against baseline
  • scraper-spec debug <site-name> "<query>" - Generate debug artifacts

Version Control

  • scraper-spec release <site-name> - Tag stable version and update changelog
  • scraper-spec rollback <site-name> - Restore last stable spec and baseline
  • scraper-spec rebaseline <site-name> "<query>" - Promote debug run to baseline

MCP Integration (Stubs)

  • scraper-spec push <site-name> - Push spec to MCP context
  • scraper-spec pull <site-name> - Pull spec from MCP context

How It Works

The framework enforces a strict spec lifecycle through multiple enforcement mechanisms:

1. Setup Gate

  • Only setup and check commands work before .scraper-spec/ exists
  • All other commands hard-fail with clear error: "Run: scraper-spec setup"
  • This ensures proper initialization before any work begins

2. Path Rules (Constitution Compliance)

  • All file writes must be in allowed directories: specs/, baselines/, logs/, docs/, framework/
  • Commands abort with specific error if path violates rules
  • Example: "ERROR: Output path X violates constitution (top-level 'bad-dir' not allowed)"

3. Schema Validation

  • All JSON/YAML artifacts are validated against templates on write:
    • specs/*.yaml → validated against selectors-template.yaml
    • baselines/*.json → validated against baseline-template.json
    • logs/*.json (runtime and debug) → validated against log-template.json / debug-log-template.json
    • Diffs → validated against diff-template.json
  • HTML snapshots are advisory and not schema-validated
  • If validation fails, command aborts with error

4. Critical Path Enforcement

  • Plans must include all phases defined in .scraper-spec/config.yaml
  • Default abstract phases: Acquire → Identify → Collect → Extract
  • Both presence AND order are validated
  • Configurable for different scraping approaches (web, API, feed, file)

5. Versioning & Release Guards

  • release command bumps spec version and updates changelog
  • baseline command blocks if baseline exists (requires SCRAPER_SPEC_CONFIRM=1 or use rebaseline)
  • This prevents accidental overwrites of golden data

6. Regression Testing Loop

  • baseline creates golden artifacts (expected.json, snapshot.html, screenshots, log.json)
  • test compares current outputs to baseline and logs diffs
  • Structured diff files show exactly what changed

7. Artifact Completeness Checks

  • After baseline: framework verifies expected.json, snapshot.html, screenshot.png, and log.json exist; aborts if any missing
  • After test: framework verifies the diff JSON exists; aborts if missing

Result: Every project follows the same reproducible loop:

setup → check → init → edit spec → baseline → test → release

This is why we call it a spec lifecycle enforcer, not a scraper library.

Framework Structure

After running scraper-spec setup, you'll have:

your-project/
├── .scraper-spec/           # Framework templates and constitution
│   ├── commands/            # Command documentation
│   ├── memory/              # Constitutional rules
│   └── templates/           # File templates
├── specs/                   # Site specifications (.yaml files)
│   └── SCRAPER_SPEC.md     # Framework charter
├── baselines/               # Baseline artifacts
│   └── screenshots/         # Visual baselines
├── logs/                    # Execution logs
│   ├── debug/              # Debug artifacts
│   └── regressions/        # Test results
├── docs/                    # Documentation
│   └── CHANGELOG.md        # Version history
└── framework/               # Your scraper implementation
    └── plan.md             # Development plan

Workflow

1. Discover Phase

Explore the target site and identify selectors:

scraper-spec discover example-site

2. Specify Phase

Edit the generated spec file with your selectors:

# specs/example-site.yaml
site_config:
  base_url: "https://example.com"

selectors:
  search_box: "#search"
  results_list: ".results"
  result_item: ".result-item"

3. Baseline Phase

Create baseline artifacts with a test query:

scraper-spec baseline example-site "test query"

4. Test Phase

Run regression tests to detect changes:

scraper-spec test example-site

5. Release Phase

Lock stable versions:

scraper-spec release example-site

Development

Prerequisites

  • Python 3.8+
  • pip

Installing from Source

git clone https://github.com/rainbowgore/scraping-spec-kit.git
cd scraping-spec-kit
pip install -e .

Running Tests

python -m pytest tests/

Requirements

  • jsonschema>=4.0.0 - For validating JSON artifacts
  • PyYAML>=6.0 - For parsing YAML specifications

Use Cases

Automation Engineers

  • Standardize scraper configurations across projects
  • Detect breaking changes in target sites
  • Maintain audit trails for compliance
  • Enforce coding standards via constitutional rules

Quality Assurance

  • Visual regression testing with screenshots
  • Automated baseline comparison
  • Historical tracking of site changes

Team Collaboration

  • Shared specification format
  • Version-controlled baselines
  • Clear documentation of selectors and expectations

Constitutional Rules

The framework enforces rules defined in .scraper-spec/memory/constitution.md:

  • Allowed top-level directories: specs/, baselines/, logs/, docs/, framework/
  • All artifacts must comply with templates
  • Version control is enforced for releases
  • Critical path phases must be preserved in all plans (configurable in .scraper-spec/config.yaml)

Critical Path Enforcement

The framework validates that all implementation plans follow a defined critical path. By default, the abstract phases are:

  1. Acquire - Connect to data source (web page, API, feed, file)
  2. Identify - Locate target elements or endpoints
  3. Collect - Gather the raw data
  4. Extract - Transform raw data into structured output

These phases are tool-agnostic and work for any scraping approach (browser automation, API crawling, feed parsing, etc.).

Customization: Edit .scraper-spec/config.yaml to define your own phases:

critical_path:
  phases:
    - Connect
    - Discover
    - Retrieve
    - Transform
    - Validate

Advanced Features

Debug Mode

Generate detailed debug artifacts:

scraper-spec debug example-site "test query"

Rollback

Restore previous stable versions:

scraper-spec rollback example-site

Rebaseline

Promote a successful debug run to baseline:

SCRAPER_SPEC_CONFIRM=1 scraper-spec rebaseline example-site "test query"

Troubleshooting

Framework not initialized errors

Problem: ERROR: Framework not initialized in this directory

Solution: Run scraper-spec setup to initialize the framework structure.

Template loading warnings during setup

Problem: See "Error loading template" warnings when running setup

Solution: These are harmless. The validator tries to load templates before they're copied. The setup will complete successfully.

Path violation errors

Problem: ERROR: Output path X violates constitution (top-level 'Y' not allowed)

Solution: Ensure your output path starts with one of the allowed directories:

  • specs/ - for site specifications
  • baselines/ - for golden artifacts
  • logs/ - for execution logs
  • docs/ - for documentation
  • framework/ - for implementation

Baseline already exists

Problem: ERROR: Baseline exists at baselines/<site>.expected.json

Solution:

  • Use scraper-spec rebaseline <site> "<query>" to promote a debug run, OR
  • Set SCRAPER_SPEC_CONFIRM=1 environment variable to force overwrite

Missing templates after installation

Problem: Commands fail with template not found errors

Solution:

  1. Reinstall the package: pip install --force-reinstall scraper-spec
  2. Re-run scraper-spec setup in your project

Critical path validation failures

Problem: ERROR: Plan violates constitution (phase 'X' missing from critical path)

Solution: Ensure your framework/plan.md includes all phases from .scraper-spec/config.yaml in order. Default phases: Acquire → Identify → Collect → Extract

Empty logs/ directory

Note: The logs/ directory is empty until you run commands. Logs populate after:

  • baseline → creates logs/<site>_<timestamp>.log.json
  • test → creates logs/regressions/<site>_<query>.diff.json
  • debug → creates files in logs/debug/

Documentation

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

License

MIT License - See LICENSE file for details

Support

Acknowledgments

Built for automation engineers who need standardized, testable scraper specifications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_spec-0.1.0.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scraper_spec-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file scraper_spec-0.1.0.tar.gz.

File metadata

  • Download URL: scraper_spec-0.1.0.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.9

File hashes

Hashes for scraper_spec-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b36ac000e2fc6f7617961d4e31147156cb1cc7ef011aa956ba45fd0be9400e6e
MD5 70c5e246b4925c4c127e8e11df2b2ce8
BLAKE2b-256 c77865a22070d32b41bcf0bae1b846d67d7318ba815f573d6082a8541b8ae075

See more details on using hashes here.

File details

Details for the file scraper_spec-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scraper_spec-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.9

File hashes

Hashes for scraper_spec-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec089139a3cc936bd5a28b40e8a61adddb0316ae3a70c89615cdb558a763d0f1
MD5 a706d7c1f600ef767ed09cd0cf7ed572
BLAKE2b-256 ae820f826a53d2de792b543a398d09c74e51cb0a510aea0b526896c156ef9107

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page