Skip to main content

Unified dataset loaders for e-SNLI, QASC, and WorldTree.

Project description

theoremkb

theoremkb is a Python package with a unified API for loading:

  • e-SNLI
  • QASC
  • WorldTree

The package uses Hugging Face datasets as the backend and keeps loader parameters consistent across all datasets.

Installation

pip install theoremkb

For pandas output:

pip install theoremkb[pandas]

Quick Start

from theoremkb import load_qasc

records = load_qasc(
    split="train",
    as_format="records",
    max_samples=100,
    shuffle=True,
    seed=7,
)
print(records[0])

Filter eSNLI by label:

from theoremkb import load_esnli

entailment_rows = load_esnli(
    split="train",
    as_format="records",
    trust_remote_code=True,
    label=1,
)
print(entailment_rows[0])

Generic loader:

from theoremkb import load

dataset = load("worldtree", split="train")

Terminal demo: print the 30th eSNLI sample

PYTHONPATH=src python examples/esnli_thirtieth_record.py

Notebook demo

  • notebooks/theoremkb_esnli_demo.ipynb

Unified Parameters

All dataset-specific loaders have the same signature:

  • split: dataset split (train, validation, test; dev/val auto-mapped to validation)
  • subset: optional HF config name
  • cache_dir: custom cache directory
  • revision: dataset revision/commit
  • token: HF auth token
  • force_download: force redownload (True -> force_redownload)
  • streaming: use streaming mode
  • trust_remote_code: required for some datasets (currently esnli)
  • shuffle: shuffle loaded samples
  • seed: random seed for shuffle
  • label: optional filter on the label field (supported on datasets that include a label column, e.g. eSNLI)
  • max_samples: truncate sample size (required when streaming=True and as_format is records/pandas)
  • as_format: one of datasets, records, pandas
  • drop_empty_fields: when True (default), remove fields whose value is "" or None in records/pandas output
  • validate_split: query and validate split name before loading

Included Dataset Mapping

  • esnli -> esnli/esnli
  • qasc -> allenai/qasc
  • worldtree -> nguyen-brat/worldtree

Use list_datasets() to list canonical names.

Publishing

python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*

Notes

  • load_esnli(...) requires trust_remote_code=True in current Hugging Face setup.
  • If you use streaming=True with as_format="records" or as_format="pandas", set max_samples to avoid unbounded materialization.

Troubleshooting

If you see an error like Dataset scripts are no longer supported, but found ..., your environment is likely using a too-new datasets version.

Run:

python -m pip install "datasets<3"

If you see TLS/SSL errors like UNEXPECTED_EOF_WHILE_READING, your machine cannot establish a secure connection to Hugging Face.

Check:

python - <<'PY'
import requests
print(requests.get("https://huggingface.co", timeout=15).status_code)
PY

If needed, configure proxy/certs (HTTPS_PROXY, HTTP_PROXY, REQUESTS_CA_BUNDLE). For temporary debugging only, you can disable SSL verification:

export HF_HUB_DISABLE_SSL_VERIFICATION=1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

theoremkb-0.3.0.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

theoremkb-0.3.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file theoremkb-0.3.0.tar.gz.

File metadata

  • Download URL: theoremkb-0.3.0.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for theoremkb-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2e1eb41115452e92a4a9f341d4de9ca1eae0d8bfbac79a8a34bc3b0126810035
MD5 5eca918a6a96d2fe92c5b8d3d7310201
BLAKE2b-256 b78f9bbe4a33dfc4aa03d934df92a536a0506977d95097d95b7479b356a8bac0

See more details on using hashes here.

File details

Details for the file theoremkb-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: theoremkb-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for theoremkb-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f8929c5219ebf7e12a9fa32102b8cbfcc6d0e37edae38fe0f75521ddcaa932d
MD5 2e92fbd2c4e8a199a815b12e7fbfb65a
BLAKE2b-256 93934004c6e48bdd2bf7e6498bdb4bacc2c6d3d74cf9dd218e3e275259287136

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page