Skip to main content

Sitemap crawler and URL discovery tool built with Rust (256 workers, sharded frontier, WAL persistence, optional Redis)

Project description

Rust Sitemap Crawler

Concurrent web crawler in Rust. 256 concurrent workers, sharded frontier, persistent state with WAL, distributed crawling with Redis.

Available as both a standalone CLI tool and a Python package.

Install

Python Package (Recommended)

pip install rustmapper

From Source (Rust)

cargo build --release

Usage

Python Package

Once installed via pip, use the rustmapper command:

# Basic crawl
rustmapper crawl --start-url example.com

# With options
rustmapper crawl --start-url example.com --workers 128 --timeout 10

# Resume
rustmapper resume --data-dir ./data

# Export sitemap
rustmapper export-sitemap --data-dir ./data --output sitemap.xml

Python API

from rustmapper import Crawler

# Create a crawler instance
crawler = Crawler(
    start_url="https://example.com",
    data_dir="./data",
    workers=256,
    timeout=20,
    ignore_robots=False
)

# Start crawling
results = crawler.crawl()
print(f"Discovered: {results['discovered']}, Processed: {results['processed']}")

# Export to sitemap
crawler.export_sitemap(
    output="sitemap.xml",
    include_lastmod=True,
    include_changefreq=True,
    default_priority=0.5
)

Rust CLI (from source)

# Basic crawl
cargo run --release -- crawl --start-url example.com

# With options
cargo run --release -- crawl --start-url example.com --workers 128 --timeout 10

# Resume
cargo run --release -- resume --data-dir ./data

# Export sitemap
cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml

Options

Flag Default Description
--start-url required Starting URL
--workers 256 Concurrent requests
--timeout 20 Request timeout (seconds)
--data-dir ./data Storage location
--seeding-strategy all none/sitemap/ct/commoncrawl/all
--ignore-robots false Skip robots.txt
--enable-redis false Distributed mode
--redis-url - Redis connection

Seeding Strategies

  • none - Only start URL
  • sitemap - Discover from sitemap.xml
  • ct - Certificate Transparency logs (finds subdomains)
  • commoncrawl - Query Common Crawl index
  • all - Use all methods

Performance

Timing breakdown per URL:

  • Body download: 700-900ms (70-90%)
  • Network fetch: 50-550ms (10-20%)
  • Everything else: <50ms (<5%)

Throughput: 50-200 URLs/minute depending on page size. Network I/O bound.

Recommended settings:

# Focused crawl (skip subdomains)
--timeout 10 --seeding-strategy sitemap

# University sites (avoid internal hosts)
--timeout 5 --seeding-strategy sitemap --start-url www.university.edu

# Maximum discovery (all seeders)
--workers 256 --timeout 10 --seeding-strategy all

Output

JSONL (automatic): ./data/sitemap.jsonl

{"url":"https://example.com/","depth":0,"status_code":200,"content_length":1024,"title":"Example","link_count":5}

XML sitemap:

cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml

Distributed Crawling

# Instance 1
cargo run --release -- crawl --start-url example.com --enable-redis --redis-url redis://localhost:6379

# Instance 2
cargo run --release -- crawl --start-url example.com --enable-redis --redis-url redis://localhost:6379

Automatic URL deduplication, work stealing, distributed locks.

Architecture

  • Frontier: Sharded queues (14 shards), bloom filter dedup, per-host politeness
  • State: Embedded redb database + WAL for crash recovery
  • Governor: Adaptive concurrency control based on commit latency
  • Workers: Async task pool with semaphore-based backpressure

Troubleshooting

Issue Cause Solution
Slow crawling Normal - large pages take ~1s to download Network I/O bound, expected
Many timeouts Internal/unreachable hosts (CT log discovery) Reduce timeout: --timeout 5 or use --seeding-strategy sitemap
Out of memory Too many concurrent large pages Reduce workers: --workers 64
Stops unexpectedly Check if naturally completed (frontier empty) Use resume to continue

Testing

cargo test

Docs

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustmapper-0.1.3.tar.gz (100.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rustmapper-0.1.3-cp313-cp313-macosx_11_0_arm64.whl (2.7 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

File details

Details for the file rustmapper-0.1.3.tar.gz.

File metadata

  • Download URL: rustmapper-0.1.3.tar.gz
  • Upload date:
  • Size: 100.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.6

File hashes

Hashes for rustmapper-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a586e1e10931b088ccb5d88681510de744e04eb9ff5c9ac98b3764c1d702f649
MD5 c4bae416ea8c04284f1681f4a8efd39a
BLAKE2b-256 7ea00079bd52619db9f61b41cad5ea2ab62c4220ce7e0f20a82a30975bbcab18

See more details on using hashes here.

File details

Details for the file rustmapper-0.1.3-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustmapper-0.1.3-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7faf0e56b62d932018b0b3d07b370f9a385dce925975661a6dd3653924b56dce
MD5 75657292dda893fe0ad2902bb2c9e139
BLAKE2b-256 fad3d9f598a3aa0db2b6f60bf8e7f45954a9bfd75b819d1bad17e85b4f415ed5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page