Skip to main content

The file format behind TACO.

Project description

The file format behind TACO. 🫓

PyPI License Black isort


GitHub: https://github.com/tacofoundation/tortilla-python 🌐

PyPI: https://pypi.tw.martin98.com/project/pytortilla/ 🛠️


Tortilla 🫓

Hello! I'm a Tortilla, a format to serialize your EO data 🤗.

pytortilla is a Python package that simplifies the creation and management of .tortilla files—these files are designed to encapsulate metadata, dataset information, and links to relevant files in remote sensing or AI workflows.

This package is “re-exported” within tacotoolbox, specifically under tacotoolbox.tortilla. Therefore, by installing and using pytortilla, you can also leverage it from tacotoolbox.tortilla.

Goals

  1. Metadata handling: Defines classes (Sample, Samples) to describe and structure your data’s information.
  2. Dataset structuring: Easily generate training, validation, and testing splits, and store them in .tortilla files.
  3. Internal validation: Validate your dataset’s integrity (e.g., opening each file with rasterio).
  4. Integration with Earth Engine (ee): Combines local data operations with GEE functionalities.
  5. Unified usage with tacotoolbox: Load and manipulate these datasets with tacoreader and other helper functions from tacotoolbox.

Table of Contents

Installation

pip install pytortilla

or from source:

git clone https://github.com/tacofoundation/tortilla-python.git
cd tortilla-python
pip install .

Note: You may also install it as part of tacotoolbox, where pytortilla is included as a dependency.

Usage guide

In this guide, we delve deeper into the step-by-step creation of .tortilla files, providing tips and best practices.

import pathlib
import rasterio
import pandas as pd
from sklearn.model_selection import train_test_split
import pytortilla

If you need Earth Engine:

import ee
ee.Initialize()  # Requires prior authentication if not done already

Files

Move the Files from Hugging Face to Your Local Machine

import os

# URL path to the Hugging Face repository
path = "https://huggingface.co/datasets/tacofoundation/tortilla_demo/resolve/main/"

# List of demo files to download
files = [
    "demo/high__test__ROI_0010__20190125T112341_20190125T112624_T28QFG.tif",
    "demo/high__test__ROI_0011__20190130T103251_20190130T104108_T31REP.tif",
    "demo/high__test__ROI_0011__20190830T102029_20190830T102552_T31REP.tif",
    "demo/high__test__ROI_0064__20190317T015619_20190317T020354_T51JVH.tif",
    "demo/high__test__ROI_0120__20191219T045209_20191219T045214_T45TXE.tif",
    "demo/high__test__ROI_0141__20190316T141049_20190316T142437_T19FDE.tif",
    "demo/high__test__ROI_0159__20200403T143721_20200403T144642_T19HBV.tif",
    "demo/high__test__ROI_0235__20200402T053639_20200402T053638_T44UNV.tif"
]

# Create a local folder called 'demo' (if not already existing)
os.system("mkdir -p demo")

# Download each file to the 'demo' folder
for file in files:
    os.system(f"wget {path}{file} -O {file}")

Note: Depending on your environment, you might prefer using requests or urllib instead of os.system for downloading files.

At this point, you should have a demo/ folder populated with several .tif files.

Creating samples

Now, we will create samples using pytortilla:

import pathlib
import pandas as pd
from sklearn.model_selection import train_test_split
import rasterio
from pytortilla.datamodel import Sample, Samples

# Define the local path containing the TIFF files
demo_path = pathlib.Path("./demo")

# Collect all .tif files in the demo folder
all_files = list(demo_path.glob("*.tif"))

# Split into train, val, and test
train_files, test_files = train_test_split(all_files, test_size=0.2, random_state=42)
train_files, val_files = train_test_split(train_files, test_size=0.2, random_state=42)

train_df = pd.DataFrame({"path": train_files, "split": "train"})
val_df = pd.DataFrame({"path": val_files, "split": "validation"})
test_df = pd.DataFrame({"path": test_files, "split": "test"})
dataset_full = pd.concat([train_df, val_df, test_df], ignore_index=True)

# Build a list of Sample objects
samples_list = []
for _, row in dataset_full.iterrows():
    with rasterio.open(row.path) as src:
        metadata = src.profile
        sample_obj = Sample(
            id=row.path.stem,
            path=str(row.path),
            file_format="GTiff",
            data_split=row.split,
            stac_data={
                "crs": str(metadata["crs"]),
                "geotransform": metadata["transform"].to_gdal(),
                "raster_shape": (metadata["height"], metadata["width"])
            }
        )
        samples_list.append(sample_obj)

samples_obj = Samples(samples=samples_list)

Validation and adding metadata

Validate each .tif file by trying to open it:

samples_obj.deep_validator(read_function=lambda x: rasterio.open(x))

If you need RAI metadata (or any other additional metadata) in your workflow, you can include it:

samples_obj = samples_obj.include_rai_metadata(
    sample_footprint=5120,  # Example footprint value
    cache=False,
    quiet=False
)

Generating the .tortilla file

Use pytortilla.create.main.create() (or the equivalent tacotoolbox.tortilla.create if you have tacotoolbox installed):

from pytortilla.create.main import create

# Generate the .tortilla file
output_file = create(
    samples=samples_obj,
    output="demo_dataset.tortilla"
)

print(f"Tortilla file generated: {output_file}")

The .tortilla might split into multiple files (.0000.part.tortilla, etc.) for large datasets.

Loading and using the .tortilla file

Finally, load the .tortilla file (or its parts) with tacoreader:

import tacoreader
import pandas as pd

dataset_chunks = []

# Try loading .part.tortilla files (assuming a maximum of 4 parts for this example)
for i in range(4):
    part_file = f"demo_dataset.{i:04d}.part.tortilla"
    try:
        dataset_part = tacoreader.load(part_file)
        dataset_chunks.append(dataset_part)
    except FileNotFoundError:
        break  # Stop if no more parts

if dataset_chunks:
    dataset = pd.concat(dataset_chunks, ignore_index=True)
    print(dataset.head())
else:
    print("No tortilla parts found.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytortilla-0.5.1.tar.gz (672.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytortilla-0.5.1-py3-none-any.whl (668.6 kB view details)

Uploaded Python 3

File details

Details for the file pytortilla-0.5.1.tar.gz.

File metadata

  • Download URL: pytortilla-0.5.1.tar.gz
  • Upload date:
  • Size: 672.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/7.0.1 keyring/24.3.1 pkginfo/1.9.6 readme-renderer/34.0 requests-toolbelt/1.0.0 requests/2.32.3 rfc3986/1.5.0 tqdm/4.67.1 urllib3/2.3.0 CPython/3.10.12

File hashes

Hashes for pytortilla-0.5.1.tar.gz
Algorithm Hash digest
SHA256 cc37f4c9182afabf3fe99eccecb30e11e39209a24b0727b413770ba89fe2093c
MD5 e1ed8ea707bab3d77d87546e0af35785
BLAKE2b-256 1edbb2592e430f8c2a04005909d72b2f8ef8e551b86f75052dd63fbf26974e7a

See more details on using hashes here.

File details

Details for the file pytortilla-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: pytortilla-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 668.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/7.0.1 keyring/24.3.1 pkginfo/1.9.6 readme-renderer/34.0 requests-toolbelt/1.0.0 requests/2.32.3 rfc3986/1.5.0 tqdm/4.67.1 urllib3/2.3.0 CPython/3.10.12

File hashes

Hashes for pytortilla-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 27b1d66db576a02c95070978c23e5410fbb8576938bf8c6da92c43c3106f367e
MD5 39a35b68bbb1268dc839c9fbf6be3e83
BLAKE2b-256 7296543816b82935f09d33038ea35210b5aaa47ba1b1ef5b2cebf277506d40cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page