A high-performance HTML ad cleaner using Adblock rules (Pure Python + lxml).
Project description
Renovation-Ad
Renovation-Ad is a high-performance Python library designed to clean HTML by removing ad elements based on standard Adblock rules (e.g., EasyList).
Unlike other libraries that struggle with performance when handling tens of thousands of rules, Renovation-Ad utilizes a "Content-Aware Filtering" strategy combined with lxml to achieve extreme speeds—capable of processing complex pages with 13,000+ rules in under 0.2 seconds.
✨ Features
- Extreme Performance: Optimized with a DOM-content-aware pre-filter (Bloom Filter strategy).
- Lightweight: Pure Python rule engine. No Rust or C++ compiler required for installation.
- EasyList Support: Supports standard Adblock Plus / EasyList cosmetic rules (
##selector). - Domain Intelligence: Correctly handles domain-specific rules (
example.com##.ad) and exclusions (~example.com##.ad). - Flexible Input: Automatically handles rule lists from URLs, local files, or raw strings.
- Hybrid Parser: Uses
lxmlfor maximum speed with an automatic fallback toBeautifulSoup4.
🚀 Performance Comparison
In real-world testing on highly commercialized news pages (e.g., Yahoo News) with 13,000+ active rules:
| Method | Time |
|---|---|
Standard BeautifulSoup + Naive Loop |
~115.0 seconds |
| Renovation-Ad (LXML + Content-Aware) | 0.14 seconds |
Optimization: By scanning the DOM for existing IDs and Classes first, we reduce the number of CSS queries by over 98%.
📦 Installation
pip install renovation-ad
Note: lxml and cssselect are highly recommended for the best performance:
pip install lxml cssselect
🛠 Usage
Quick Start (Function Interface)
from renovation_ad import clean_html
rules = [
"https://easylist-downloads.adblockplus.org/easylist.txt", # Remote URL
"./my_custom_rules.txt", # Local file
"##.top-banner-ads" # Raw rule string
]
html_content = "<html><body><div class='top-banner-ads'>Ad</div><p>Content</p></body></html>"
page_url = "https://example.com/article"
cleaned_html = clean_html(html_content, page_url, rules)
Advanced Usage (Class Interface)
Initializing the Renovator once is more efficient if you are processing multiple pages with the same rule set.
from renovation_ad import Renovator
# Initialize and load rules (downloads and parses)
renovator = Renovator(
rules_list=["https://easylist-downloads.adblockplus.org/easylist.txt"],
dom_parser="lxml" # Default is lxml
)
# Clean multiple contents
html_1 = renovator.clean(raw_html_1, "https://site-a.com")
html_2 = renovator.clean(raw_html_2, "https://site-b.com")
🔍 How it Works
- Rule Parsing: The library parses EasyList files into an internal map of domain-specific and generic cosmetic rules.
- Content-Aware Filtering: Before running CSS selectors, Renovation-Ad scans the HTML for all present
idandclassattributes. - Selector Pruning: Rules targeting classes or IDs not present in the current document are skipped entirely.
- Batch Execution: Remaining selectors are bundled into large batches (e.g., 500 per group) and executed via
lxml's highly optimized C engine.
📜 Supported Rule Syntax
| Syntax | Description |
|---|---|
##.ad-class |
Hide all elements with class ad-class (Generic) |
###ad-id |
Hide element with ID ad-id |
example.com##.sidebar-ad |
Hide only on example.com |
~example.com##.global-ad |
Hide everywhere EXCEPT example.com |
domain1.com,domain2.com##.ad |
Hide on multiple specific domains |
🛠 Dependencies
requests: For fetching remote rule lists.lxml: For high-speed DOM manipulation.cssselect: For translating CSS selectors to XPath.beautifulsoup4: Provided as a fallback parser.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
🤝 Contributing
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file renovation_ad-0.1.1.tar.gz.
File metadata
- Download URL: renovation_ad-0.1.1.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3194eb487f1c2600f6ef0d380477a932cfe73de75506740b6c455909134db69d
|
|
| MD5 |
5cabb36cc7d6f07615ef41c92e05a871
|
|
| BLAKE2b-256 |
1633af50d623908c5b7ef14dfe036bc00db5a1385b1d86529acf93bbfbfdf68c
|
File details
Details for the file renovation_ad-0.1.1-py3-none-any.whl.
File metadata
- Download URL: renovation_ad-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7e41873b1b17816b0b124416df698abf78d41db276262716465379e7d751969
|
|
| MD5 |
c6055d8f0a869bce5a4118b2b5c703d8
|
|
| BLAKE2b-256 |
495038a471b958809c4165f02ff3f53696b5a5f53544ce1c01be8852fe5957d9
|