Unified dataset loaders for e-SNLI, QASC, and WorldTree.
Project description
theoremkb
theoremkb is a Python package with a unified API for loading:
- e-SNLI
- QASC
- WorldTree
The package uses Hugging Face datasets as the backend and keeps loader parameters consistent across all datasets.
Installation
pip install theoremkb
For pandas output:
pip install theoremkb[pandas]
Quick Start
from theoremkb import load_qasc
records = load_qasc(
split="train",
as_format="records",
max_samples=100,
shuffle=True,
seed=7,
)
print(records[0])
Filter eSNLI by label:
from theoremkb import load_esnli
entailment_rows = load_esnli(
split="train",
as_format="records",
trust_remote_code=True,
label=1,
)
print(entailment_rows[0])
Generic loader:
from theoremkb import load
dataset = load("worldtree", split="train")
Terminal demo: print the 30th eSNLI sample
PYTHONPATH=src python examples/esnli_thirtieth_record.py
Notebook demo
notebooks/theoremkb_esnli_demo.ipynb
Unified Parameters
All dataset-specific loaders have the same signature:
split: dataset split (train,validation,test;dev/valauto-mapped tovalidation)subset: optional HF config namecache_dir: custom cache directoryrevision: dataset revision/committoken: HF auth tokenforce_download: force redownload (True->force_redownload)streaming: use streaming modetrust_remote_code: required for some datasets (currentlyesnli)shuffle: shuffle loaded samplesseed: random seed for shufflelabel: optional filter on thelabelfield (supported on datasets that include alabelcolumn, e.g. eSNLI)max_samples: truncate sample size (required whenstreaming=Trueandas_formatisrecords/pandas)as_format: one ofdatasets,records,pandasdrop_empty_fields: whenTrue(default), remove fields whose value is""orNoneinrecords/pandasoutputvalidate_split: query and validate split name before loading
Included Dataset Mapping
esnli->esnli/esnliqasc->allenai/qascworldtree->nguyen-brat/worldtree
Use list_datasets() to list canonical names.
Publishing
python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*
Notes
load_esnli(...)requirestrust_remote_code=Truein current Hugging Face setup.- If you use
streaming=Truewithas_format="records"oras_format="pandas", setmax_samplesto avoid unbounded materialization.
Troubleshooting
If you see an error like Dataset scripts are no longer supported, but found ...,
your environment is likely using a too-new datasets version.
Run:
python -m pip install "datasets<3"
If you see TLS/SSL errors like UNEXPECTED_EOF_WHILE_READING, your machine cannot establish a secure connection to Hugging Face.
Check:
python - <<'PY'
import requests
print(requests.get("https://huggingface.co", timeout=15).status_code)
PY
If needed, configure proxy/certs (HTTPS_PROXY, HTTP_PROXY, REQUESTS_CA_BUNDLE).
For temporary debugging only, you can disable SSL verification:
export HF_HUB_DISABLE_SSL_VERIFICATION=1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file theoremkb-0.3.0.tar.gz.
File metadata
- Download URL: theoremkb-0.3.0.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e1eb41115452e92a4a9f341d4de9ca1eae0d8bfbac79a8a34bc3b0126810035
|
|
| MD5 |
5eca918a6a96d2fe92c5b8d3d7310201
|
|
| BLAKE2b-256 |
b78f9bbe4a33dfc4aa03d934df92a536a0506977d95097d95b7479b356a8bac0
|
File details
Details for the file theoremkb-0.3.0-py3-none-any.whl.
File metadata
- Download URL: theoremkb-0.3.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f8929c5219ebf7e12a9fa32102b8cbfcc6d0e37edae38fe0f75521ddcaa932d
|
|
| MD5 |
2e92fbd2c4e8a199a815b12e7fbfb65a
|
|
| BLAKE2b-256 |
93934004c6e48bdd2bf7e6498bdb4bacc2c6d3d74cf9dd218e3e275259287136
|