Lightweight web crawler library that converts HTML to Markdown

These details have not been verified by PyPI

Project links

Homepage

Project description

docu-crawler: Python Web Crawler for HTML to Markdown Conversion

Fast, lightweight Python library for crawling websites and converting HTML to Markdown. Perfect for documentation extraction, content migration, and offline reading.

docu-crawler is a production-ready Python web crawler that extracts, converts, and stores web content efficiently. Crawl documentation sites, migrate content, and create offline documentation with minimal dependencies.

What is docu-crawler?

docu-crawler is a specialized Python web crawler library designed for:

Web Crawling: Systematically crawl websites while respecting robots.txt
HTML to Markdown Conversion: Convert HTML pages to clean, readable Markdown format
Content Extraction: Extract and preserve website structure and content
Multi-Cloud Storage: Store crawled content locally or in cloud storage (AWS S3, Google Cloud Storage, Azure Blob Storage, SFTP)

Why Choose docu-crawler?

✅ Minimal Dependencies: Only requires requests and beautifulsoup4 for core functionality
✅ Easy to Use: Simple Python API and CLI interface
✅ Production Ready: Built-in retry logic, rate limiting, and error handling
✅ Extensible: Plugin-based storage system, easy to add custom backends
✅ Cross-Platform: Works on Linux, Windows, and macOS

Key Features

🚀 Lightweight: Minimal dependencies (only requests and beautifulsoup4 required)
☁️ Multi-Cloud Storage: Support for local filesystem, AWS S3, Google Cloud Storage, Azure Blob Storage, and SFTP
🔄 Flexible API: Use as a Python library or CLI tool
📝 HTML to Markdown: Intelligent conversion preserving structure and formatting
🤖 Robots.txt Support: Respects robots.txt and crawl-delay directives
⚡ Performance: Configurable rate limiting and retry logic
🛡️ Cross-Platform: Works on Linux, Windows, and macOS
📦 Optional Dependencies: Install only what you need

Installation

Basic Installation

pip install docu-crawler

This installs only the core dependencies (requests and beautifulsoup4).

With Optional Features

pip install docu-crawler[yaml]      # YAML config file support
pip install docu-crawler[s3]        # AWS S3 storage
pip install docu-crawler[gcs]       # Google Cloud Storage
pip install docu-crawler[azure]     # Azure Blob Storage
pip install docu-crawler[sftp]      # SFTP storage
pip install docu-crawler[async]      # Async support
pip install docu-crawler[all]        # Install everything

Quick Start Guide

Python Library Usage

from docu_crawler import crawl_to_local

result = crawl_to_local("https://docs.example.com", output_dir="my_docs")
print(f"Crawled {result['pages_crawled']} pages")

Command Line Interface

docu-crawler https://docs.example.com --output my-docs --delay 2 --max-pages 100

Usage Examples

Basic Web Crawling

from docu_crawler import DocuCrawler

crawler = DocuCrawler(
    start_url="https://docs.example.com",
    output_dir="downloaded_docs",
    delay=1.0,
    max_pages=100
)
crawler.crawl()

Crawl to Cloud Storage

from docu_crawler import crawl_to_s3, crawl_to_gcs, crawl_to_azure, crawl_to_sftp

# AWS S3
crawl_to_s3(url="https://docs.example.com", bucket="my-bucket", region="us-east-1")

# Google Cloud Storage
crawl_to_gcs(url="https://docs.example.com", bucket="my-bucket", project="my-project")

# Azure Blob Storage
crawl_to_azure(url="https://docs.example.com", container="my-container")

# SFTP
crawl_to_sftp(url="https://docs.example.com", host="sftp.example.com", user="username")

Advanced Usage with Callbacks

from docu_crawler import crawl

def on_page_crawled(url, page_count):
    print(f"Page {page_count}: {url}")

def on_error(url, error):
    print(f"Error crawling {url}: {error}")

result = crawl(
    url="https://docs.example.com",
    output_dir="my_docs",
    on_page_crawled=on_page_crawled,
    on_error=on_error
)

Storage Backends

Local Filesystem (Default)

No additional dependencies required. Files are saved to the specified output directory.

from docu_crawler import crawl_to_local
result = crawl_to_local("https://docs.example.com", output_dir="docs")

AWS S3

Requires: pip install docu-crawler[s3]

from docu_crawler import crawl_to_s3

result = crawl_to_s3(
    url="https://docs.example.com",
    bucket="my-bucket",
    region="us-east-1"
)

Credentials via: Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), AWS credentials file, or IAM roles.

Google Cloud Storage

Requires: pip install docu-crawler[gcs]

from docu_crawler import crawl_to_gcs

result = crawl_to_gcs(
    url="https://docs.example.com",
    bucket="my-bucket",
    project="my-project"
)

Credentials via: GOOGLE_APPLICATION_CREDENTIALS environment variable or service account key file.

Azure Blob Storage

Requires: pip install docu-crawler[azure]

from docu_crawler import crawl_to_azure

result = crawl_to_azure(
    url="https://docs.example.com",
    container="my-container"
)

Credentials via: AZURE_STORAGE_CONNECTION_STRING environment variable or connection string parameter.

SFTP

Requires: pip install docu-crawler[sftp]

from docu_crawler import crawl_to_sftp

result = crawl_to_sftp(
    url="https://docs.example.com",
    host="sftp.example.com",
    user="username",
    password="password"  # or key_file="/path/to/key"
)

API Reference

DocuCrawler Class

Main crawler class for programmatic use.

from docu_crawler import DocuCrawler

crawler = DocuCrawler(
    start_url: str,
    output_dir: str = "downloaded_docs",
    delay: float = 1.0,
    max_pages: int = 0,
    timeout: int = 10,
    storage_config: Optional[Dict[str, Any]] = None
)
crawler.crawl()

Convenience Functions

crawl(url, output_dir, delay, max_pages, timeout, storage_config, on_page_crawled, on_error) - General-purpose crawl function
crawl_to_local(url, output_dir, **kwargs) - Crawl to local filesystem
crawl_to_s3(url, bucket, region=None, **kwargs) - Crawl to AWS S3
crawl_to_gcs(url, bucket, project=None, credentials=None, **kwargs) - Crawl to Google Cloud Storage
crawl_to_azure(url, container, connection_string=None, **kwargs) - Crawl to Azure Blob Storage
crawl_to_sftp(url, host, user, password=None, port=22, key_file=None, remote_path='', **kwargs) - Crawl via SFTP

Configuration

Environment Variables

AWS S3: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
GCS: GOOGLE_APPLICATION_CREDENTIALS
Azure: AZURE_STORAGE_CONNECTION_STRING
SFTP: SFTP_PASSWORD, SFTP_KEY_FILE

Configuration File

Create crawler_config.yaml:

url: https://docs.example.com
output: downloaded_docs
delay: 1.0
max_pages: 0
timeout: 10
log_level: INFO
storage_type: s3
s3_bucket: my-bucket
s3_region: us-east-1

Config file locations (checked in order):

./crawler_config.yaml
./config/crawler_config.yaml
~/.config/docu-crawler/config.yaml
/etc/docu-crawler/config.yaml

Advanced Features

Robots.txt Support

Automatically respects robots.txt files and crawl-delay directives.

Rate Limiting

Built-in rate limiting to respect server limits:

from docu_crawler.utils.rate_limiter import RateLimiter

limiter = RateLimiter(rate=10, per=60)  # 10 requests per minute
limiter.wait_if_needed(domain="example.com")

Retry Logic

Automatic retry with exponential backoff:

from docu_crawler.utils.retry import retry_with_backoff

@retry_with_backoff(max_retries=3, initial_delay=1.0)
def fetch_url(url):
    pass

Use Cases

Documentation Sites

Crawl and convert documentation websites to Markdown for offline reading or migration.

Content Migration

Migrate content from one platform to another by converting HTML to Markdown.

Offline Documentation

Create offline versions of online documentation for local access.

Content Analysis

Extract and analyze web content programmatically.

Frequently Asked Questions

How do I install docu-crawler?

pip install docu-crawler

For all features: pip install docu-crawler[all]

What Python versions are supported?

Python 3.9 and above.

Can I use it without cloud storage?

Yes! Local filesystem storage is the default and requires no additional dependencies.

Does it respect robots.txt?

Yes, docu-crawler automatically respects robots.txt files and crawl-delay directives.

Can I customize the HTML to Markdown conversion?

Yes, the HTML processor can be extended and customized for your specific needs.

Is it fast?

docu-crawler is designed for efficiency with configurable rate limiting and retry logic.

Troubleshooting

Import Error: Install docu-crawler: pip install docu-crawler

Storage Backend Error: Install the required backend: pip install docu-crawler[s3]

YAML Config Error: Install YAML support: pip install docu-crawler[yaml]

SSL Certificate Error: The crawler uses proper SSL verification by default.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Changelog

See CHANGELOG.md for version history.

License

MIT License - see LICENSE file for details.

Related Projects

Scrapy - Full-featured web scraping framework
Beautiful Soup - HTML parsing library
Markdownify - HTML to Markdown converter

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Nov 23, 2025

0.1.1

Mar 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docu_crawler-1.0.0.tar.gz (28.1 kB view details)

Uploaded Nov 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docu_crawler-1.0.0-py3-none-any.whl (36.7 kB view details)

Uploaded Nov 23, 2025 Python 3

File details

Details for the file docu_crawler-1.0.0.tar.gz.

File metadata

Download URL: docu_crawler-1.0.0.tar.gz
Upload date: Nov 23, 2025
Size: 28.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for docu_crawler-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`02a63efad63a516364b3a33108080094c002b136bc8005d30835c7bc30ad3b43`
MD5	`e1af875c5c16250524a9ff28dfc84567`
BLAKE2b-256	`06d5a30d47354c4aa5443157db6e6885425c03387c03d08bb3738ff98b8c2247`

See more details on using hashes here.

File details

Details for the file docu_crawler-1.0.0-py3-none-any.whl.

File metadata

Download URL: docu_crawler-1.0.0-py3-none-any.whl
Upload date: Nov 23, 2025
Size: 36.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for docu_crawler-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`576c6d1d9afead442bf3378ef2af93c8c0136ccc4a625aa237135f786ca8cb71`
MD5	`8973d79e530b6671320ee5dac2c91197`
BLAKE2b-256	`14cb1fcf73bef1914f746659d969c968ca76330ce68ad28e0151a25d94b33114`

See more details on using hashes here.

docu-crawler 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docu-crawler: Python Web Crawler for HTML to Markdown Conversion

What is docu-crawler?

Why Choose docu-crawler?

Key Features

Installation

Basic Installation

With Optional Features

Quick Start Guide

Python Library Usage

Command Line Interface

Usage Examples

Basic Web Crawling

Crawl to Cloud Storage

Advanced Usage with Callbacks

Storage Backends

Local Filesystem (Default)

AWS S3

Google Cloud Storage

Azure Blob Storage

SFTP

API Reference

DocuCrawler Class

Convenience Functions

Configuration

Environment Variables

Configuration File

Advanced Features

Robots.txt Support

Rate Limiting

Retry Logic

Use Cases

Documentation Sites

Content Migration

Offline Documentation

Content Analysis

Frequently Asked Questions

How do I install docu-crawler?

What Python versions are supported?

Can I use it without cloud storage?

Does it respect robots.txt?

Can I customize the HTML to Markdown conversion?

Is it fast?

Troubleshooting

Contributing

Changelog

License

Links

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes