No project description provided
Project description
WordLift Python SDK
A Python toolkit for orchestrating WordLift imports: fetch URLs from sitemaps, Google Sheets, or explicit lists, filter out already imported pages, enqueue search console jobs, push RDF graphs, and call the WordLift APIs to import web pages.
Features
- URL sources: XML sitemaps (with optional regex filtering), Google Sheets (
urlcolumn), or Python lists. - Change detection: skips URLs that are already imported unless
OVERWRITEis enabled; re-imports whenlastmodis newer. - Web page imports: sends URLs to WordLift with embedding requests, output types, retry logic, and pluggable callbacks.
- Search Console refresh: triggers analytics imports when top queries are stale.
- Graph templates: renders
.ttl.liquidtemplates underdata/templateswith account data and uploads the resulting RDF graphs. - Extensible: override protocols via
WORDLIFT_OVERRIDE_DIRwithout changing the library code.
Installation
pip install wordlift-sdk
# or
poetry add wordlift-sdk
Requires Python 3.10–3.14.
Configuration
Settings are read in order: config/default.py (or a custom path you pass to ConfigurationProvider.create), environment variables, then (when available) Google Colab userdata.
Common options:
WORDLIFT_KEY(required): WordLift API key.API_URL: WordLift API base URL, defaults tohttps://api.wordlift.io.SITEMAP_URL: XML sitemap to crawl;SITEMAP_URL_PATTERNoptional regex to filter URLs.SHEETS_URL,SHEETS_NAME,SHEETS_SERVICE_ACCOUNT: use a Google Sheet as source; service account points to credentials file.URLS: list of URLs (e.g.,["https://example.com/a", "https://example.com/b"]).OVERWRITE: re-import URLs even if already present (defaultFalse).WEB_PAGE_IMPORT_WRITE_STRATEGY: WordLift write strategy (defaultcreateOrUpdateModel).EMBEDDING_PROPERTIES: list of schema properties to embed.WEB_PAGE_TYPES: output schema types, defaults to["http://schema.org/Article"].GOOGLE_SEARCH_CONSOLE: enable/disable Search Console handler (defaultTrue).CONCURRENCY: max concurrent handlers, defaults tomin(cpu_count(), 4).WORDLIFT_OVERRIDE_DIR: folder containing protocol overrides (defaultapp/overrides).
TLS/SSL
The SDK enforces SSL verification. On macOS it uses the system CA bundle when available and falls back to certifi if needed. You can override the CA bundle path explicitly in code:
from wordlift_sdk.client import ClientConfigurationFactory
from wordlift_sdk.structured_data import CreateRequest
factory = ClientConfigurationFactory(
key="your-api-key",
api_url="https://api.wordlift.io",
ssl_ca_cert="/path/to/ca.pem",
)
configuration = factory.create()
request = CreateRequest(
url="https://example.com",
target_type="Thing",
output_dir=Path("."),
base_name="structured-data",
jsonld_path=None,
yarrml_path=None,
api_key="your-api-key",
base_url=None,
ssl_ca_cert="/path/to/ca.pem",
debug=False,
headed=False,
timeout_ms=30000,
max_retries=2,
quality_check=True,
max_xhtml_chars=40000,
max_text_node_chars=400,
max_nesting_depth=2,
verbose=True,
validate=True,
wait_until="networkidle",
)
Example config/default.py:
WORDLIFT_KEY = "your-api-key"
SITEMAP_URL = "https://example.com/sitemap.xml"
SITEMAP_URL_PATTERN = r"^https://example.com/article/.*$"
GOOGLE_SEARCH_CONSOLE = True
WEB_PAGE_TYPES = ["http://schema.org/Article"]
EMBEDDING_PROPERTIES = [
"http://schema.org/headline",
"http://schema.org/abstract",
"http://schema.org/text",
]
Running the import workflow
import asyncio
from wordlift_sdk import run_kg_import_workflow
if __name__ == "__main__":
asyncio.run(run_kg_import_workflow())
The workflow:
- Renders and uploads RDF graphs from
data/templates/*.ttl.liquidusing account info. - Builds the configured URL source and filters out unchanged URLs (unless
OVERWRITE). - Sends each URL to WordLift for import with retries and optional Search Console refresh.
You can build components yourself when you need more control:
import asyncio
from wordlift_sdk.container.application_container import ApplicationContainer
async def main():
container = ApplicationContainer()
workflow = await container.create_kg_import_workflow()
await workflow.run()
asyncio.run(main())
Custom callbacks and overrides
Override the web page import callback by placing web_page_import_protocol.py with a WebPageImportProtocol class under WORDLIFT_OVERRIDE_DIR (default app/overrides). The callback receives a WebPageImportResponse and can push to graph_queue or entity_patch_queue.
Templates
Add .ttl.liquid files under data/templates. Templates render with account fields available (e.g., {{ account.dataset_uri }}) and are uploaded before URL handling begins.
Validation
SHACL validation utilities and generated Google Search Gallery shapes are included. When a feature includes both container types (for example ItemList, BreadcrumbList, QAPage, FAQPage, Quiz, ProfilePage, Product, Recipe, Course, Review) and their contained types (ListItem, Question, Answer, Comment, Offer, HowToStep, Person, Organization, Rating, AggregateRating), the generator scopes the contained constraints under the container properties to avoid enforcing them on unrelated nodes. Schema.org grammar checks are intentionally permissive and accept URL/text literals for all properties.
Testing
poetry install --with dev
poetry run pytest
Documentation
- Google Sheets Lookup: Utility for O(1) lookups from Google Sheets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wordlift_sdk-2.16.0.tar.gz.
File metadata
- Download URL: wordlift_sdk-2.16.0.tar.gz
- Upload date:
- Size: 247.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de187be3d851a5d3caa3e2be0f93da1473936e18b41fcc9ecb721f5a36e0976c
|
|
| MD5 |
452c4afd26e65b0d2607e2d8fdeb7a6b
|
|
| BLAKE2b-256 |
6844a56f0dd9ae7923a4bb6d5fb8cb55e0efd215a8666b34742846937ff6f6aa
|
Provenance
The following attestation bundles were made for wordlift_sdk-2.16.0.tar.gz:
Publisher:
ci.yml on wordlift/python-sdk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wordlift_sdk-2.16.0.tar.gz -
Subject digest:
de187be3d851a5d3caa3e2be0f93da1473936e18b41fcc9ecb721f5a36e0976c - Sigstore transparency entry: 919241647
- Sigstore integration time:
-
Permalink:
wordlift/python-sdk@701cf5cce74c8e7234903e225bf2e327cddf64fa -
Branch / Tag:
refs/tags/2.16.0 - Owner: https://github.com/wordlift
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@701cf5cce74c8e7234903e225bf2e327cddf64fa -
Trigger Event:
push
-
Statement type:
File details
Details for the file wordlift_sdk-2.16.0-py3-none-any.whl.
File metadata
- Download URL: wordlift_sdk-2.16.0-py3-none-any.whl
- Upload date:
- Size: 307.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c5c812d068989b9a171a05972258b7cd566e31bcd9042dd4ad9a49a63bb6eb9
|
|
| MD5 |
01fc51e113d023c6fefbb25646e948ec
|
|
| BLAKE2b-256 |
731548b39c6d5097e7a6047c7cd025c273938a7d62f5b907ea707c1e4ad86b7d
|
Provenance
The following attestation bundles were made for wordlift_sdk-2.16.0-py3-none-any.whl:
Publisher:
ci.yml on wordlift/python-sdk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
wordlift_sdk-2.16.0-py3-none-any.whl -
Subject digest:
0c5c812d068989b9a171a05972258b7cd566e31bcd9042dd4ad9a49a63bb6eb9 - Sigstore transparency entry: 919241650
- Sigstore integration time:
-
Permalink:
wordlift/python-sdk@701cf5cce74c8e7234903e225bf2e327cddf64fa -
Branch / Tag:
refs/tags/2.16.0 - Owner: https://github.com/wordlift
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@701cf5cce74c8e7234903e225bf2e327cddf64fa -
Trigger Event:
push
-
Statement type: