Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module exposes an async-first surface for coroutine code. AsyncDocument stores shareable HTML/text state instead of a thread-affine sync Document, and all selectors are awaitable for consistent async calling style:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). AsyncDocument supports async with for automatic cleanup in coroutine code, and AsyncElement / AsyncDocument both expose async .prettify() helpers.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers plus awaitable AsyncDocument / AsyncElement methods.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.4.3.tar.gz (85.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.4.3-cp310-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.4.3.tar.gz.

File metadata

  • Download URL: scraper_rust-0.4.3.tar.gz
  • Upload date:
  • Size: 85.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.4.3.tar.gz
Algorithm Hash digest
SHA256 ac1fe51c580b48ba83b93459f3bc15b501fb43ac90e37a77e173b0c2f1a5115e
MD5 06effbe17024139364340b34c3160339
BLAKE2b-256 a045db04a7d6bdd61e1bb7c8a3296ff02346cd5ec7ea2edfcac0e44a159418bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.3.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b4ba8c60e5382fb71cefb5f8ddcc28f46f70458c2a1d2979f4947ebdb6bf29d7
MD5 224728002243c55e5ac95057f3611c56
BLAKE2b-256 f0d27109a8b301af82450f887ef7469833c63c6bd582e4d42e6133e4571c35fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.3-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 49de38641577cfbeb70e119b27808952eaceaac6b2af026ad27d27be9088a941
MD5 97845437ee8bf8c5634c48d86782ec91
BLAKE2b-256 9e0ae9b2c39b317ca72c71276c62c2afc356402c663f56de2d7352334ec4c690

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.3-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.3-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 eb4ab9a7d680b39d11ae30a85048f66c748073707abd2f35a3dd11b7a162c741
MD5 ccefab18190a5014979180bc1800d719
BLAKE2b-256 d69c0c458d01be561d7c31d146a3549cc1fa5534a1b4e9f65a1295fa42974bd4

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 593ef8ae1831d050c5a209d19f964ce0d5b208ee31db139e73b668d9f2840699
MD5 74de3ab7039e5af62599681a2a03fe89
BLAKE2b-256 ca5dab4c798ad7a5882a9e31183320da4a8679aeeb22edfd62432482cd927332

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 918d951a69e76050a7709d48141c1430b5da2683d6cd742d6ef011cf39558a1b
MD5 c7a1b9da216dd7ccce7f6840d0b83685
BLAKE2b-256 13d913cd94be41c9804ccb96d0f17efacb2f1056db6f8a9156fea8a16f3d8899

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e2e6901f7a11e0b51c2454026362885792adb7affde4f8d7209bae186c051b3e
MD5 718aa9d562aef4752f1e1bb9109e1a0e
BLAKE2b-256 b0f3cebfca0f55af450ae1d80bd7e40999f51eb2abbc8371e60d7206a69ae4b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.3-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page