Lift (process) and place (write) data streams, seamlessly and in parallel.
Project description
🏗️ Hyped Crane
Lift (process) and place (write) data streams, seamlessly and in parallel.
Hyped Crane is a Python library designed to simplify working with HuggingFace datasets' iterable datasets. It provides powerful tools for applying transformations to data streams, handling parallel processing, and writing data in varios formats.
Features
- Streaming-Friendly Transformations: Apply lazy, streaming-friendly transformations to iterable datasets without preloading data into memory.
- Seamless Multiprocessing: Effortlessly process and write datasets using multiple processes, improving performance on large datasets.
- Easily Extendable: Provides a straightforward interface to implement support for custom data formats.
- Interoperability with Hugging Face Datasets: Write datasets in formats directly loadable using HuggingFace’s
load_from_diskfunction.
Installation
To install the library from PyPI, run:
pip install hyped-crane
To install the library from source, clone run:
git clone https://github.com/open-hyped/crane.git
cd crane
pip install .
Getting Started
Here’s a quick example to illustrate how Hyped Crane works
Step 1: "Load" the Dataset
crane is designed to work seamlessly with HuggingFace’s iterable datasets. Let’s start by creating one:
import datasets
# Create a dummy iterable dataset
dummy_data = [
{"a": 0, "b": [1, 2, 3, 4]},
{"a": 1, "b": [5, 6]},
{"a": 1, "b": [7, 8, 9, 10]}
]
ds = datasets.Dataset.from_list(dummy_data)
ds = ds.to_iterable_dataset()
Step 2: Apply a Lazy Transformation
Transformations on iterable datasets are applied lazily. That means the data isn’t processed until it’s actually read:
# Apply a transformation to compute the maximum of list "b"
features = datasets.Features(ds.features | {"max(b)": ds.features["b"].feature})
ds = ds.map(lambda x: {"max(b)": max(x["b"])}, features=features)
Note: Some writers, including the ArrowDatasetWriter, require the dataset features to be well defined.
Step 3: Write the Dataset to Disk
Use crane’s ArrowDatasetWriter to save the transformed dataset to disk. You can enable multiprocessing to speed up the transformation and writing processes:
from crane import ArrowDatasetWriter
# Write the transformed dataset to disk with multiprocessing
writer = ArrowDatasetWriter("data", overwrite=True, num_proc=3)
writer.write(ds)
Key Benefits:
- Data-Parallel Transformations: The transformations defined by
mapoperations are moved into the workers, allowing transformation workload to be evenly distributed. - Efficient Writing: Each worker writes its own shard to disk in parallel, reducing bottlenecks in I/O operations.
crane handles worker communication, task distribution, and writing operations, so you can focus on defining your transformation logic without worrying about parallelization details.
Note: Datasets saved with the ArrowDatasetWriter are fully compatible with HuggingFace’s load_from_disk function. You can reload the dataset and continue working with it:
# Reload the dataset from disk
ds = datasets.load_from_disk("data")
Contributions
Contributions are welcome! Feel free to submit a pull request or open an issue to discuss your ideas.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hyped_crane-0.1.4.tar.gz.
File metadata
- Download URL: hyped_crane-0.1.4.tar.gz
- Upload date:
- Size: 59.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be1b7404c18231ecaac53063f9cf3ef6d5376020377b52818fb290921454ad85
|
|
| MD5 |
1154cc173938e359c5264476c617cf2b
|
|
| BLAKE2b-256 |
3046e113b8eec162de5c62272ccd94b2229da49e1b4f19575fc14b31e2344539
|
Provenance
The following attestation bundles were made for hyped_crane-0.1.4.tar.gz:
Publisher:
publish.yml on open-hyped/crane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hyped_crane-0.1.4.tar.gz -
Subject digest:
be1b7404c18231ecaac53063f9cf3ef6d5376020377b52818fb290921454ad85 - Sigstore transparency entry: 646798634
- Sigstore integration time:
-
Permalink:
open-hyped/crane@ea53d6e878fef4fe9cb8c93db4019b1ea7d0d2f0 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/open-hyped
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ea53d6e878fef4fe9cb8c93db4019b1ea7d0d2f0 -
Trigger Event:
create
-
Statement type:
File details
Details for the file hyped_crane-0.1.4-py3-none-any.whl.
File metadata
- Download URL: hyped_crane-0.1.4-py3-none-any.whl
- Upload date:
- Size: 52.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdbb25372a977f1ee31ef72cea72bdb4cc1a81ae9c9fec06311c2897baa41724
|
|
| MD5 |
39a945719b87d58ddde47de7ce23b0a9
|
|
| BLAKE2b-256 |
cfecbd5992845af193f967fa978ebe4a5692dc258d740ac865c91d8c8d7997a9
|
Provenance
The following attestation bundles were made for hyped_crane-0.1.4-py3-none-any.whl:
Publisher:
publish.yml on open-hyped/crane
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hyped_crane-0.1.4-py3-none-any.whl -
Subject digest:
bdbb25372a977f1ee31ef72cea72bdb4cc1a81ae9c9fec06311c2897baa41724 - Sigstore transparency entry: 646798653
- Sigstore integration time:
-
Permalink:
open-hyped/crane@ea53d6e878fef4fe9cb8c93db4019b1ea7d0d2f0 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/open-hyped
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ea53d6e878fef4fe9cb8c93db4019b1ea7d0d2f0 -
Trigger Event:
create
-
Statement type: