Site-agnostic specification framework for scraping automation engineers
Project description
scraping-spec-kit
A comprehensive toolkit for web scraping specifications and automation
What This Is
A governance and testing framework that helps automation engineers:
- Standardize scraper configurations
- Manage baselines and detect regressions
- Enforce best practices via constitutional rules
- Track versions and maintain audit trails
This is NOT a scraping library. It's a spec kit that works with any scraper you build.
Installation
Option 1: Install with pip
pip install scraper-spec
Option 2: Install with uv (Recommended)
# Install globally as a tool
uv tool install git+https://github.com/rainbowgore/scraping-spec-kit.git
# Or run directly without installing
uvx --from git+https://github.com/rainbowgore/scraping-spec-kit.git scraper-spec --help
Quick Start
# Initialize framework in your project
cd your-scraping-project
scraper-spec setup
# Create a site spec
scraper-spec init example-site
# Edit specs/example-site.yaml with your selectors
# Create baseline
scraper-spec baseline example-site "test query"
# Test for regressions
scraper-spec test example-site
# Release stable version
scraper-spec release example-site
Available Commands
Setup & Initialization
scraper-spec setup- Initialize framework structure in current directoryscraper-spec check- Check environment and framework setupscraper-spec init <site-name>- Create a new site specification file
Development & Testing
scraper-spec discover <site-name>- Start site discovery (createsspecs/<site>.discover.yamlwith candidates)scraper-spec baseline <site-name> "<query>"- Create baseline artifactsscraper-spec test <site-name>- Run regression test against baselinescraper-spec debug <site-name> "<query>"- Generate debug artifacts
Version Control
scraper-spec release <site-name>- Tag stable version and update changelogscraper-spec rollback <site-name>- Restore last stable spec and baselinescraper-spec rebaseline <site-name> "<query>"- Promote debug run to baseline
MCP Integration (Stubs)
scraper-spec push <site-name>- Push spec to MCP contextscraper-spec pull <site-name>- Pull spec from MCP context
How It Works
The framework enforces a strict spec lifecycle through multiple enforcement mechanisms:
1. Setup Gate
- Only
setupandcheckcommands work before.scraper-spec/exists - All other commands hard-fail with clear error: "Run: scraper-spec setup"
- This ensures proper initialization before any work begins
2. Path Rules (Constitution Compliance)
- All file writes must be in allowed directories:
specs/,baselines/,logs/,docs/,framework/ - Commands abort with specific error if path violates rules
- Example:
"ERROR: Output path X violates constitution (top-level 'bad-dir' not allowed)"
3. Schema Validation
- All JSON/YAML artifacts are validated against templates on write:
specs/*.yaml→ validated againstselectors-template.yamlbaselines/*.json→ validated againstbaseline-template.jsonlogs/*.json(runtime and debug) → validated againstlog-template.json/debug-log-template.json- Diffs → validated against
diff-template.json
- HTML snapshots are advisory and not schema-validated
- If validation fails, command aborts with error
4. Critical Path Enforcement
- Plans must include all phases defined in
.scraper-spec/config.yaml - Default abstract phases:
Acquire → Identify → Collect → Extract - Both presence AND order are validated
- Configurable for different scraping approaches (web, API, feed, file)
5. Versioning & Release Guards
releasecommand bumps spec version and updates changelogbaselinecommand blocks if baseline exists (requiresSCRAPER_SPEC_CONFIRM=1or userebaseline)- This prevents accidental overwrites of golden data
6. Regression Testing Loop
baselinecreates golden artifacts (expected.json, snapshot.html, screenshots, log.json)testcompares current outputs to baseline and logs diffs- Structured diff files show exactly what changed
7. Artifact Completeness Checks
- After
baseline: framework verifiesexpected.json,snapshot.html,screenshot.png, andlog.jsonexist; aborts if any missing - After
test: framework verifies the diff JSON exists; aborts if missing
Result: Every project follows the same reproducible loop:
setup → check → init → edit spec → baseline → test → release
This is why we call it a spec lifecycle enforcer, not a scraper library.
Framework Structure
After running scraper-spec setup, you'll have:
your-project/
├── .scraper-spec/ # Framework templates and constitution
│ ├── commands/ # Command documentation
│ ├── memory/ # Constitutional rules
│ └── templates/ # File templates
├── specs/ # Site specifications (.yaml files)
│ └── SCRAPER_SPEC.md # Framework charter
├── baselines/ # Baseline artifacts
│ └── screenshots/ # Visual baselines
├── logs/ # Execution logs
│ ├── debug/ # Debug artifacts
│ └── regressions/ # Test results
├── docs/ # Documentation
│ └── CHANGELOG.md # Version history
└── framework/ # Your scraper implementation
└── plan.md # Development plan
Workflow
1. Discover Phase
Explore the target site and identify selectors:
scraper-spec discover example-site
2. Specify Phase
Edit the generated spec file with your selectors:
# specs/example-site.yaml
site_config:
base_url: "https://example.com"
selectors:
search_box: "#search"
results_list: ".results"
result_item: ".result-item"
3. Baseline Phase
Create baseline artifacts with a test query:
scraper-spec baseline example-site "test query"
4. Test Phase
Run regression tests to detect changes:
scraper-spec test example-site
5. Release Phase
Lock stable versions:
scraper-spec release example-site
Development
Prerequisites
- Python 3.8+
- pip
Installing from Source
git clone https://github.com/rainbowgore/scraping-spec-kit.git
cd scraping-spec-kit
pip install -e .
Running Tests
python -m pytest tests/
Requirements
jsonschema>=4.0.0- For validating JSON artifactsPyYAML>=6.0- For parsing YAML specifications
Use Cases
Automation Engineers
- Standardize scraper configurations across projects
- Detect breaking changes in target sites
- Maintain audit trails for compliance
- Enforce coding standards via constitutional rules
Quality Assurance
- Visual regression testing with screenshots
- Automated baseline comparison
- Historical tracking of site changes
Team Collaboration
- Shared specification format
- Version-controlled baselines
- Clear documentation of selectors and expectations
Constitutional Rules
The framework enforces rules defined in .scraper-spec/memory/constitution.md:
- Allowed top-level directories:
specs/,baselines/,logs/,docs/,framework/ - All artifacts must comply with templates
- Version control is enforced for releases
- Critical path phases must be preserved in all plans (configurable in
.scraper-spec/config.yaml)
Critical Path Enforcement
The framework validates that all implementation plans follow a defined critical path. By default, the abstract phases are:
- Acquire - Connect to data source (web page, API, feed, file)
- Identify - Locate target elements or endpoints
- Collect - Gather the raw data
- Extract - Transform raw data into structured output
These phases are tool-agnostic and work for any scraping approach (browser automation, API crawling, feed parsing, etc.).
Customization: Edit .scraper-spec/config.yaml to define your own phases:
critical_path:
phases:
- Connect
- Discover
- Retrieve
- Transform
- Validate
Advanced Features
Debug Mode
Generate detailed debug artifacts:
scraper-spec debug example-site "test query"
Rollback
Restore previous stable versions:
scraper-spec rollback example-site
Rebaseline
Promote a successful debug run to baseline:
SCRAPER_SPEC_CONFIRM=1 scraper-spec rebaseline example-site "test query"
Troubleshooting
Framework not initialized errors
Problem: ERROR: Framework not initialized in this directory
Solution: Run scraper-spec setup to initialize the framework structure.
Template loading warnings during setup
Problem: See "Error loading template" warnings when running setup
Solution: These are harmless. The validator tries to load templates before they're copied. The setup will complete successfully.
Path violation errors
Problem: ERROR: Output path X violates constitution (top-level 'Y' not allowed)
Solution: Ensure your output path starts with one of the allowed directories:
specs/- for site specificationsbaselines/- for golden artifactslogs/- for execution logsdocs/- for documentationframework/- for implementation
Baseline already exists
Problem: ERROR: Baseline exists at baselines/<site>.expected.json
Solution:
- Use
scraper-spec rebaseline <site> "<query>"to promote a debug run, OR - Set
SCRAPER_SPEC_CONFIRM=1environment variable to force overwrite
Missing templates after installation
Problem: Commands fail with template not found errors
Solution:
- Reinstall the package:
pip install --force-reinstall scraper-spec - Re-run
scraper-spec setupin your project
Critical path validation failures
Problem: ERROR: Plan violates constitution (phase 'X' missing from critical path)
Solution: Ensure your framework/plan.md includes all phases from .scraper-spec/config.yaml in order.
Default phases: Acquire → Identify → Collect → Extract
Empty logs/ directory
Note: The logs/ directory is empty until you run commands. Logs populate after:
baseline→ createslogs/<site>_<timestamp>.log.jsontest→ createslogs/regressions/<site>_<query>.diff.jsondebug→ creates files inlogs/debug/
Documentation
- Framework Charter - After running
setup - Command Reference - After running
setup - Constitutional Rules - After running
setup - Critical Path Guide - Detailed critical path documentation
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
License
MIT License - See LICENSE file for details
Support
- Issues: https://github.com/rainbowgore/scraping-spec-kit/issues
- Documentation: https://github.com/rainbowgore/scraping-spec-kit/blob/main/README.md
Acknowledgments
Built for automation engineers who need standardized, testable scraper specifications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraper_spec-0.1.0.tar.gz.
File metadata
- Download URL: scraper_spec-0.1.0.tar.gz
- Upload date:
- Size: 27.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b36ac000e2fc6f7617961d4e31147156cb1cc7ef011aa956ba45fd0be9400e6e
|
|
| MD5 |
70c5e246b4925c4c127e8e11df2b2ce8
|
|
| BLAKE2b-256 |
c77865a22070d32b41bcf0bae1b846d67d7318ba815f573d6082a8541b8ae075
|
File details
Details for the file scraper_spec-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scraper_spec-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec089139a3cc936bd5a28b40e8a61adddb0316ae3a70c89615cdb558a763d0f1
|
|
| MD5 |
a706d7c1f600ef767ed09cd0cf7ed572
|
|
| BLAKE2b-256 |
ae820f826a53d2de792b543a398d09c74e51cb0a510aea0b526896c156ef9107
|