Python bindings around rust-scraper/scraper with PyO3
Project description
scraper-rs
Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.
Quick start
from scraper_rs import Document, first, prettify, select, select_first, xpath
html = """
<html><body>
<div class="item" data-id="1"><a href="/a">First</a></div>
<div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""
doc = Document(html)
print(doc.text) # "First Second"
items = doc.select(".item")
print(items[0].attr("data-id")) # "1"
print(items[0].to_dict()) # {"tag": "div", "text": "First", "html": "<a...>", ...}
first_link = doc.select_first("a[href]") # alias: doc.find(...)
print(first_link.text, first_link.attr("href")) # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first]) # ["/a"]
# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items]) # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href")) # "/a"
# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links]) # ["/a", "/b"]
print(first(html, "a[href]").text) # First
print(select_first(html, "a[href]").text) # First
print([link.text for link in xpath(html, "//div[@class='item']/a")]) # ["First", "Second"]
print(prettify(html))
For runnable samples, see examples/demo.py and examples/demo_prettify_url.py.
Quick URL prettify demo:
python examples/demo_prettify_url.py https://example.com --max-lines 80
Async usage
The scraper_rs.asyncio module exposes an async-first surface for coroutine code. AsyncDocument stores shareable HTML/text state instead of a thread-affine sync Document, and all selectors are awaitable for consistent async calling style:
import asyncio
from scraper_rs import asyncio as scraping_async
html = "<div class='item'><a href='/a'>First</a></div>"
async def main():
async with await scraping_async.parse(html) as doc:
items = await doc.select(".item")
first_link = await items[0].select_first("a[href]")
print(first_link.text) # First
links = await scraping_async.select(html, "a[href]")
print([link.attr("href") for link in links]) # ["/a"]
asyncio.run(main())
All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.).
AsyncDocument supports async with for automatic cleanup in coroutine code, and AsyncElement / AsyncDocument both expose async .prettify() helpers.
Large documents and memory safety
To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:
from scraper_rs import Document, select
doc = Document(html, max_size_bytes=5_000_000) # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)
If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:
# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]") # Will only find links in the first 100KB
# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)
Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.
API highlights
Document(html: str)/Document.from_html(html)parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query..select(css)→list[Element],.select_first(css)/.find(css)→ firstElement | None,.css(css)is an alias..xpath(expr)/.xpath_first(expr)evaluate XPath expressions that return element nodes..prettify()renders the current DOM as an indented string for readable output/debugging..textreturns normalized text;Document.htmlis the original input HTML;Element.htmlis inner HTML.scraper_rs.asyncioexposes asyncparse/select/xpathwrappers plus awaitableAsyncDocument/AsyncElementmethods.Elementexposes.tag,.text,.html,.attrsplus helpers.attr(name),.get(name, default),.to_dict().- Elements support nested CSS and XPath selection via
.select(css),.select_first(css),.find(css),.css(css),.xpath(expr),.xpath_first(expr). - Elements also expose
.prettify()to format element HTML with indentation. - Top-level helpers mirror the class methods:
parse(html),prettify(html),select(html, css),select_first(html, css)/first(html, css),xpath(html, expr),xpath_first(html, expr). max_size_byteslets you fail fast on oversized HTML; defaults to a 1 GiB limit.truncate_on_limitallows parsing a truncated version (limited tomax_size_bytes) of oversized HTML instead of raising an error.- Call
doc.close()(orwith Document(html) as doc: ...) to free parsed DOM resources when you're done. - In async workflows, use
async with await scraper_rs.asyncio.parse(html) as doc: ...for automaticAsyncDocumentcleanup.
Installation
Built wheels target abi3 (CPython 3.10+). To build locally:
# Install maturin (uv is used in this repo, but pip works too)
pip install maturin
# Build a wheel
maturin build --release --compatibility linux
# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl
If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).
Projects Using scraper-rs
- silkworm - Async web scraping framework on top of Rust
- silkworm-mcp - An MCP server for silkworm
Development
Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.
- Run tests:
just testoruv run pytest tests/ - Format code:
just fmt(orcargo fmt --allanduv run ruff format) - Lint Rust:
just lint(orcargo clippy --all-targets --all-features -- -D warnings) - The PyO3 module name is
scraper_rs; the Rust crate is built ascdylib.
Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraper_rust-0.4.3.tar.gz.
File metadata
- Download URL: scraper_rust-0.4.3.tar.gz
- Upload date:
- Size: 85.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac1fe51c580b48ba83b93459f3bc15b501fb43ac90e37a77e173b0c2f1a5115e
|
|
| MD5 |
06effbe17024139364340b34c3160339
|
|
| BLAKE2b-256 |
a045db04a7d6bdd61e1bb7c8a3296ff02346cd5ec7ea2edfcac0e44a159418bf
|
Provenance
The following attestation bundles were made for scraper_rust-0.4.3.tar.gz:
Publisher:
release.yml on RustedBytes/scraper-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_rust-0.4.3.tar.gz -
Subject digest:
ac1fe51c580b48ba83b93459f3bc15b501fb43ac90e37a77e173b0c2f1a5115e - Sigstore transparency entry: 1174385069
- Sigstore integration time:
-
Permalink:
RustedBytes/scraper-rs@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/RustedBytes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl.
File metadata
- Download URL: scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.14t, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4ba8c60e5382fb71cefb5f8ddcc28f46f70458c2a1d2979f4947ebdb6bf29d7
|
|
| MD5 |
224728002243c55e5ac95057f3611c56
|
|
| BLAKE2b-256 |
f0d27109a8b301af82450f887ef7469833c63c6bd582e4d42e6133e4571c35fa
|
Provenance
The following attestation bundles were made for scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl:
Publisher:
release.yml on RustedBytes/scraper-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl -
Subject digest:
b4ba8c60e5382fb71cefb5f8ddcc28f46f70458c2a1d2979f4947ebdb6bf29d7 - Sigstore transparency entry: 1174385606
- Sigstore integration time:
-
Permalink:
RustedBytes/scraper-rs@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/RustedBytes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl.
File metadata
- Download URL: scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.13t, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49de38641577cfbeb70e119b27808952eaceaac6b2af026ad27d27be9088a941
|
|
| MD5 |
97845437ee8bf8c5634c48d86782ec91
|
|
| BLAKE2b-256 |
9e0ae9b2c39b317ca72c71276c62c2afc356402c663f56de2d7352334ec4c690
|
Provenance
The following attestation bundles were made for scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl:
Publisher:
release.yml on RustedBytes/scraper-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl -
Subject digest:
49de38641577cfbeb70e119b27808952eaceaac6b2af026ad27d27be9088a941 - Sigstore transparency entry: 1174385172
- Sigstore integration time:
-
Permalink:
RustedBytes/scraper-rs@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/RustedBytes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scraper_rust-0.4.3-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: scraper_rust-0.4.3-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb4ab9a7d680b39d11ae30a85048f66c748073707abd2f35a3dd11b7a162c741
|
|
| MD5 |
ccefab18190a5014979180bc1800d719
|
|
| BLAKE2b-256 |
d69c0c458d01be561d7c31d146a3549cc1fa5534a1b4e9f65a1295fa42974bd4
|
Provenance
The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-win_amd64.whl:
Publisher:
release.yml on RustedBytes/scraper-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_rust-0.4.3-cp310-abi3-win_amd64.whl -
Subject digest:
eb4ab9a7d680b39d11ae30a85048f66c748073707abd2f35a3dd11b7a162c741 - Sigstore transparency entry: 1174385314
- Sigstore integration time:
-
Permalink:
RustedBytes/scraper-rs@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/RustedBytes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.6 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
593ef8ae1831d050c5a209d19f964ce0d5b208ee31db139e73b668d9f2840699
|
|
| MD5 |
74de3ab7039e5af62599681a2a03fe89
|
|
| BLAKE2b-256 |
ca5dab4c798ad7a5882a9e31183320da4a8679aeeb22edfd62432482cd927332
|
Provenance
The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on RustedBytes/scraper-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
593ef8ae1831d050c5a209d19f964ce0d5b208ee31db139e73b668d9f2840699 - Sigstore transparency entry: 1174385714
- Sigstore integration time:
-
Permalink:
RustedBytes/scraper-rs@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/RustedBytes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 2.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
918d951a69e76050a7709d48141c1430b5da2683d6cd742d6ef011cf39558a1b
|
|
| MD5 |
c7a1b9da216dd7ccce7f6840d0b83685
|
|
| BLAKE2b-256 |
13d913cd94be41c9804ccb96d0f17efacb2f1056db6f8a9156fea8a16f3d8899
|
Provenance
The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on RustedBytes/scraper-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
918d951a69e76050a7709d48141c1430b5da2683d6cd742d6ef011cf39558a1b - Sigstore transparency entry: 1174385443
- Sigstore integration time:
-
Permalink:
RustedBytes/scraper-rs@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/RustedBytes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2e6901f7a11e0b51c2454026362885792adb7affde4f8d7209bae186c051b3e
|
|
| MD5 |
718aa9d562aef4752f1e1bb9109e1a0e
|
|
| BLAKE2b-256 |
b0f3cebfca0f55af450ae1d80bd7e40999f51eb2abbc8371e60d7206a69ae4b5
|
Provenance
The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on RustedBytes/scraper-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl -
Subject digest:
e2e6901f7a11e0b51c2454026362885792adb7affde4f8d7209bae186c051b3e - Sigstore transparency entry: 1174385770
- Sigstore integration time:
-
Permalink:
RustedBytes/scraper-rs@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/RustedBytes
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f7885a6d60ce8acae10b1b3d881d9ea73ea98fc0 -
Trigger Event:
push
-
Statement type: