A declarative, contract-driven medallion pipeline engine for data mesh architectures. Write once. Run on Spark, Polars, or DuckDB.

These details have not been verified by PyPI

Project description

LakeLogic

Your Data Estate. Under Contract.

Executable + Enforceable Data Contracts. Validate at runtime. Block bad merges in CI/CD.

Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.

Write once. Run on Spark, Polars, or DuckDB. Data Contracts as Code — the executable layer for data mesh.

LakeLogic Architecture

One contract. Executed at runtime. Enforced in CI/CD. Every row flows through the same gates — across Spark, Polars, or DuckDB — with bad data quarantined and breaking changes blocked before merge.

Data Mesh Alignment

LakeLogic is the missing runtime layer for Data Mesh — where domain ownership and federated governance stop being principles and start being enforced.

Pillar	How LakeLogic Delivers
Domain Ownership	Contracts are owned and defined by domain teams (e.g., CRM, Finance) who know the data best.
Data as a Product	The contract IS the product interface — a versioned, schema-enforced, SLA-backed guarantee that consuming teams can depend on.
Self-Serve Platform	A standardized runtime that any team can use to deploy quality gates without infra silos.
Federated Governance	PII masking rules, SLA thresholds, and schema standards defined once in a central registry — automatically enforced at every domain pipeline.

Quick Start

pip install lakelogic

Runtime — execute a contract against your data:

from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()
print(f"Valid: {result.good_count}  |  Quarantined: {result.bad_count}  |  Quality: {result.quality_score:.1f}")

CI/CD — block bad contract changes before they merge:

# Static validation — no data needed
lakelogic validate \
  --contract contract.yaml \
  --gates breaking_change,pii_classification,lineage_break

Drop lakelogic validate into your GitHub Actions workflow to enforce schema, PII, and lineage standards on every pull request.

Technical Capabilities

Data Quality & Trust

100% Reconciliation — Mathematically guaranteed: source = good + bad. Every row is accounted for — nothing silently dropped
Pydantic-Powered Validation — Every contract, system & domain configs are parsed through strict Pydantic models with Literal type enforcement — invalid YAML is caught at load time, not at runtime
SQL-First Rules — Define business logic in the language your team already speaks — no SDK, no custom DSL
SLO Monitoring & Anomaly Detection — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach

✏️ Try it out in Google Colab: Data Quality & Trust

Compliance & Governance

Contract Gates (CI/CD Enforcement) — Static-analysis gates that block PRs introducing breaking schema changes, unmasked PII, or broken lineage. Run via lakelogic validate --gates breaking_change,pii_classification,lineage_break before any data touches your pipeline
GDPR & HIPAA Compliance — Contract-driven forget_subjects() with nullify, hash, or redact strategies and immutable audit trail
Zero-Retention Architecture — Built-in zero_retention_days enforcement for transient data layers, automatically purging micro-batches after successful downstream processing
Automated PII Handling — Declarative encryption and hashing (pii: true, masking: "encrypt") applied at the Bronze layer before data even reaches rest
Pipeline Cost Intelligence — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration

✏️ Try it out in Google Colab: Compliance & Governance

Engine & Scale

Engine Agnostic — Write once, run on Spark, Polars, or DuckDB — same contract, zero code changes
Multi-Format Materialization — Natively output validated data to Apache Iceberg or Delta Lake open-table formats without requiring pipeline rewrites
Dimensional Modeling — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual MERGE INTO SQL required
Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
Parallel Processing — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
Backfill & Reprocessing — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
External Logic — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
Production Resilience — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (max_consecutive_failures) — pipelines self-heal transient failures without operator intervention

✏️ Try it out in Google Colab: Engine & Scale

Developer Experience

Structured Diagnostics & Observability — Deep contextual logging out-of-the-box (powered by loguru) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time
Dry Run Mode — Validate contracts, resolve dependencies, and preview execution plans without touching any data
DDL-Only Mode — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
DAG Dependency Viewer — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
Data Reset & Reload — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
Multi-Channel Alerts — Powered by Apprise for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full Jinja2 templating support for custom formatting

✏️ Try it out in Google Colab: Developer Experience

Data Generation & AI

Synthetic Data — Built-in DataGenerator (powered by Faker) with streaming simulation, time-windowed output, referential integrity, and edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation
Descriptive AI Test Data — Steer synthetic data generation with natural language prompts (e.g. "Generate users who are French or Japanese only, enterprise-tier, over 60 years old with SQL injection attempts in email fields") — output strictly adheres to the YAML contract schema
AI Contract Onboarding — lakelogic infer auto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions
Unstructured Processing — LLM extraction from PDFs, images, audio with same contract validation + lineage
Automated Run Logs — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table

✏️ Try it out in Google Colab: Data Generation & AI

Integrations

dbt Adapter — Import dbt schema.yml models and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting
dlt (Data Load Tool) — Native DltAdapter supporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival
Native Streaming Connectors — Built-in WebSocketConnector, SSEConnector, KafkaConnector, WebhookConnector (plus Azure Event Grid, Service Bus, AWS SQS, GCP Pub/Sub) with pre-validation rename transformations for real-time feeds
Native Database Ingestion — High-performance SQL extraction via Polars/ConnectorX and DuckDB — PostgreSQL, MySQL, SQL Server, SQLite with automatic dialect detection
Incremental CDC — Watermark-based change data capture with automatic state tracking — only processes rows newer than the last run
Batch Processing — Memory-safe chunked ingestion via fetch_size for massive initial loads — handles 100GB+ tables without OOM
Column Projection Pushdown — Automatically constructs SELECT queries from model.fields — only extracts what the contract declares
Cloud Data Sources — Native abfss://, s3://, gs:// URI support with automatic credential resolution via CloudCredentialResolver — Azure AD, AWS IAM, GCP ADC, service principals, and Databricks secret scopes

✏️ Try it out in Google Colab: Integrations

What a Contract Looks Like

One YAML file replaces hundreds of lines of ingestion, validation, and materialization code:

version: "1.0"
info:
  title: "Silver Customers"
  domain: "CRM"
  system: "Salesforce"

model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
      pii: true
      masking: "hash"
    - name: status
      type: string

transformations:
  - deduplicate: [customer_id]
  - sql: "SELECT *, UPPER(status) AS status_norm FROM source"
    phase: pre

quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"
    - sql: "status IN ('active', 'churned', 'pending')"
  dataset_rules:
    - unique: customer_id

materialization:
  strategy: merge
  merge_keys: [customer_id]
  format: iceberg  # natively supports iceberg, delta, parquet, csv

Same contract, any engine — swap engine="polars" for "spark" or "duckdb". Zero code changes.

Analogy: A contract is like a building inspection checklist. The inspector (LakeLogic) checks every room (row) against the blueprint (schema), flags violations (quarantine), and stamps a certificate (lineage) — regardless of whether the building was constructed with bricks (Spark), timber (Polars), or prefab (DuckDB).

What this buys you

Without LakeLogic	With LakeLogic
500+ lines of PySpark/Pandas validation per table	40 lines of YAML
Bad rows silently dropped or crash the pipeline	Bad rows quarantined with error reasons
Schema drift discovered in production dashboards	Schema drift caught at ingestion and blocked in CI/CD
Manual dedup scripts per team	`deduplicate: [key]` — one line
PII scattered across notebooks	`pii: true, masking: hash` — automatic
Breaking contract changes shipped to prod	`lakelogic validate --gates breaking_change` blocks the PR
No audit trail	Every row stamped with run ID, source path, timestamp

[!TIP] View the Complete Contract Reference for every available configuration option.

Architecture

LakeLogic enforces Data Contracts as quality gates across the Medallion Architecture (Bronze → Silver → Gold). Each layer uses its own contract:

Layer	Role	Guarantee
Bronze	Capture everything raw, no validation	Immutable record of source
Silver	Full validation, business rules, dedup	Trusted, queryable data
Gold	Aggregations, KPIs, ML features	Analytics-ready datasets
Quarantine	Failed rows isolated with error reasons	Nothing silently dropped

Key Guarantee: source_count = good_count + bad_count — 100% reconciliation, always.

Examples

For a complete list of runnable guides and end-to-end notebooks, please visit the Examples section of our Documentation.

Documentation

For full guides, API references, tutorials, and contract templates, please visit the LakeLogic Documentation Site.

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.31.0

May 14, 2026

1.30.0

May 11, 2026

1.25.0

May 7, 2026

1.24.0

May 6, 2026

1.22.0

May 1, 2026

1.20.0

Apr 27, 2026

1.18.0

Apr 16, 2026

1.17.1

Apr 15, 2026

1.17.0

Apr 15, 2026

1.16.0

Apr 15, 2026

1.15.0

Apr 15, 2026

1.13.0

Apr 14, 2026

1.12.0

Apr 7, 2026

1.10.0

Mar 30, 2026

1.9.0

Mar 27, 2026

1.8.0

Mar 27, 2026

1.7.0

Mar 27, 2026

1.6.0

Mar 26, 2026

1.5.0

Mar 24, 2026

1.2.0

Mar 8, 2026

1.1.0

Mar 7, 2026

0.14.0

Mar 4, 2026

0.13.0

Mar 4, 2026

0.12.0

Mar 3, 2026

0.10.0

Mar 1, 2026

0.9.0

Mar 1, 2026

0.8.0

Feb 28, 2026

0.5.0

Feb 28, 2026

0.2.0

Feb 27, 2026

0.2.0b0 pre-release

Feb 26, 2026

0.1.0b2 pre-release

Feb 14, 2026

0.1.0b1 pre-release

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-1.31.0.tar.gz (4.3 MB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lakelogic-1.31.0-py3-none-any.whl (642.9 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file lakelogic-1.31.0.tar.gz.

File metadata

Download URL: lakelogic-1.31.0.tar.gz
Upload date: May 14, 2026
Size: 4.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.31.0.tar.gz
Algorithm	Hash digest
SHA256	`f70363ed77346738d295ddf1cd5368d33b2a5a4c3b52049535c41806fe7d4189`
MD5	`a405bc76b85e41f90c25c9745b5fa5ac`
BLAKE2b-256	`f8acaa9360409fa9e6a975fb0da46c0a087655794f00c615b2f7c771fe33de99`

See more details on using hashes here.

File details

Details for the file lakelogic-1.31.0-py3-none-any.whl.

File metadata

Download URL: lakelogic-1.31.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 642.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.31.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51c786defc1ea76438bf6c9a01b1752b73732994937c2d736ebe2270480cbc7b`
MD5	`fd599b5cd68d350f83fc625108a91db3`
BLAKE2b-256	`f12df733ef58ccbe6e274ce6010a6b89a1d482d058b6caf761b328abc9824898`

See more details on using hashes here.

lakelogic 1.31.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LakeLogic

Data Mesh Alignment

Quick Start

Technical Capabilities

Data Quality & Trust

Compliance & Governance

Engine & Scale

Developer Experience

Data Generation & AI

Integrations

What a Contract Looks Like

What this buys you

Architecture

Examples

Documentation

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes