Sitemap crawler and URL discovery tool built with Rust (256 workers, sharded frontier, WAL persistence, optional Redis)
Project description
Rust Sitemap Crawler
Concurrent web crawler in Rust. 256 concurrent workers, sharded frontier, persistent state with WAL, distributed crawling with Redis.
Available as both a standalone CLI tool and a Python package.
Install
Python Package (Recommended)
pip install rustmapper
From Source (Rust)
cargo build --release
Usage
Python Package
Once installed via pip, use the rustmapper command:
# Basic crawl
rustmapper crawl --start-url example.com
# With options
rustmapper crawl --start-url example.com --workers 128 --timeout 10
# Resume
rustmapper resume --data-dir ./data
# Export sitemap
rustmapper export-sitemap --data-dir ./data --output sitemap.xml
Python API
from rustmapper import Crawler
# Create a crawler instance
crawler = Crawler(
start_url="https://example.com",
data_dir="./data",
workers=256,
timeout=20,
ignore_robots=False
)
# Start crawling
results = crawler.crawl()
print(f"Discovered: {results['discovered']}, Processed: {results['processed']}")
# Export to sitemap
crawler.export_sitemap(
output="sitemap.xml",
include_lastmod=True,
include_changefreq=True,
default_priority=0.5
)
Rust CLI (from source)
# Basic crawl
cargo run --release -- crawl --start-url example.com
# With options
cargo run --release -- crawl --start-url example.com --workers 128 --timeout 10
# Resume
cargo run --release -- resume --data-dir ./data
# Export sitemap
cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml
Options
| Flag | Default | Description |
|---|---|---|
--start-url |
required | Starting URL |
--workers |
256 | Concurrent requests |
--timeout |
20 | Request timeout (seconds) |
--data-dir |
./data | Storage location |
--seeding-strategy |
all | none/sitemap/ct/commoncrawl/all |
--ignore-robots |
false | Skip robots.txt |
--enable-redis |
false | Distributed mode |
--redis-url |
- | Redis connection |
Seeding Strategies
none- Only start URLsitemap- Discover from sitemap.xmlct- Certificate Transparency logs (finds subdomains)commoncrawl- Query Common Crawl indexall- Use all methods
Performance
Timing breakdown per URL:
- Body download: 700-900ms (70-90%)
- Network fetch: 50-550ms (10-20%)
- Everything else: <50ms (<5%)
Throughput: 50-200 URLs/minute depending on page size. Network I/O bound.
Recommended settings:
# Focused crawl (skip subdomains)
--timeout 10 --seeding-strategy sitemap
# University sites (avoid internal hosts)
--timeout 5 --seeding-strategy sitemap --start-url www.university.edu
# Maximum discovery (all seeders)
--workers 256 --timeout 10 --seeding-strategy all
Output
JSONL (automatic): ./data/sitemap.jsonl
{"url":"https://example.com/","depth":0,"status_code":200,"content_length":1024,"title":"Example","link_count":5}
XML sitemap:
cargo run --release -- export-sitemap --data-dir ./data --output sitemap.xml
Distributed Crawling
# Instance 1
cargo run --release -- crawl --start-url example.com --enable-redis --redis-url redis://localhost:6379
# Instance 2
cargo run --release -- crawl --start-url example.com --enable-redis --redis-url redis://localhost:6379
Automatic URL deduplication, work stealing, distributed locks.
Architecture
- Frontier: Sharded queues (14 shards), bloom filter dedup, per-host politeness
- State: Embedded redb database + WAL for crash recovery
- Governor: Adaptive concurrency control based on commit latency
- Workers: Async task pool with semaphore-based backpressure
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Slow crawling | Normal - large pages take ~1s to download | Network I/O bound, expected |
| Many timeouts | Internal/unreachable hosts (CT log discovery) | Reduce timeout: --timeout 5 or use --seeding-strategy sitemap |
| Out of memory | Too many concurrent large pages | Reduce workers: --workers 64 |
| Stops unexpectedly | Check if naturally completed (frontier empty) | Use resume to continue |
Testing
cargo test
Docs
- PERFORMANCE_ANALYSIS.md - Detailed timing breakdown
- BOTTLENECK_SUMMARY.md - Where time is spent
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rustmapper-0.1.3.tar.gz.
File metadata
- Download URL: rustmapper-0.1.3.tar.gz
- Upload date:
- Size: 100.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a586e1e10931b088ccb5d88681510de744e04eb9ff5c9ac98b3764c1d702f649
|
|
| MD5 |
c4bae416ea8c04284f1681f4a8efd39a
|
|
| BLAKE2b-256 |
7ea00079bd52619db9f61b41cad5ea2ab62c4220ce7e0f20a82a30975bbcab18
|
File details
Details for the file rustmapper-0.1.3-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: rustmapper-0.1.3-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7faf0e56b62d932018b0b3d07b370f9a385dce925975661a6dd3653924b56dce
|
|
| MD5 |
75657292dda893fe0ad2902bb2c9e139
|
|
| BLAKE2b-256 |
fad3d9f598a3aa0db2b6f60bf8e7f45954a9bfd75b819d1bad17e85b4f415ed5
|