Skip to main content

A declarative, contract-driven medallion pipeline engine for data mesh architectures. Write once. Run on Spark, Polars, or DuckDB.

Project description

LakeLogic

Your Data Estate. Under Contract.

Documentation PyPI CI codecov Python License

Executable + Enforceable Data Contracts. Validate at runtime. Block bad merges in CI/CD.

Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.

Write once. Run on Spark, Polars, or DuckDB. Data Contracts as Code — the executable layer for data mesh.

LakeLogic Architecture

One contract. Executed at runtime. Enforced in CI/CD. Every row flows through the same gates — across Spark, Polars, or DuckDB — with bad data quarantined and breaking changes blocked before merge.


Data Mesh Alignment

LakeLogic is the missing runtime layer for Data Mesh — where domain ownership and federated governance stop being principles and start being enforced.

Pillar How LakeLogic Delivers
Domain Ownership Contracts are owned and defined by domain teams (e.g., CRM, Finance) who know the data best.
Data as a Product The contract IS the product interface — a versioned, schema-enforced, SLA-backed guarantee that consuming teams can depend on.
Self-Serve Platform A standardized runtime that any team can use to deploy quality gates without infra silos.
Federated Governance PII masking rules, SLA thresholds, and schema standards defined once in a central registry — automatically enforced at every domain pipeline.

Quick Start

pip install lakelogic

Runtime — execute a contract against your data:

from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()
print(f"Valid: {result.good_count}  |  Quarantined: {result.bad_count}  |  Quality: {result.quality_score:.1f}")

CI/CD — block bad contract changes before they merge:

# Static validation — no data needed
lakelogic validate \
  --contract contract.yaml \
  --gates breaking_change,pii_classification,lineage_break

Drop lakelogic validate into your GitHub Actions workflow to enforce schema, PII, and lineage standards on every pull request.


Technical Capabilities

Data Quality & Trust

  • 100% Reconciliation — Mathematically guaranteed: source = good + bad. Every row is accounted for — nothing silently dropped
  • Pydantic-Powered Validation — Every contract, system & domain configs are parsed through strict Pydantic models with Literal type enforcement — invalid YAML is caught at load time, not at runtime
  • SQL-First Rules — Define business logic in the language your team already speaks — no SDK, no custom DSL
  • SLO Monitoring & Anomaly Detection — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach

✏️ Try it out in Google Colab: Data Quality & Trust

Compliance & Governance

  • Contract Gates (CI/CD Enforcement) — Static-analysis gates that block PRs introducing breaking schema changes, unmasked PII, or broken lineage. Run via lakelogic validate --gates breaking_change,pii_classification,lineage_break before any data touches your pipeline
  • GDPR & HIPAA Compliance — Contract-driven forget_subjects() with nullify, hash, or redact strategies and immutable audit trail
  • Zero-Retention Architecture — Built-in zero_retention_days enforcement for transient data layers, automatically purging micro-batches after successful downstream processing
  • Automated PII Handling — Declarative encryption and hashing (pii: true, masking: "encrypt") applied at the Bronze layer before data even reaches rest
  • Pipeline Cost Intelligence — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration

✏️ Try it out in Google Colab: Compliance & Governance

Engine & Scale

  • Engine Agnostic — Write once, run on Spark, Polars, or DuckDB — same contract, zero code changes
  • Multi-Format Materialization — Natively output validated data to Apache Iceberg or Delta Lake open-table formats without requiring pipeline rewrites
  • Dimensional Modeling — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual MERGE INTO SQL required
  • Incremental-First — Built-in watermarking, CDC, and file-mtime tracking
  • Parallel Processing — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
  • Backfill & Reprocessing — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
  • External Logic — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
  • Production Resilience — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (max_consecutive_failures) — pipelines self-heal transient failures without operator intervention

✏️ Try it out in Google Colab: Engine & Scale

Developer Experience

  • Structured Diagnostics & Observability — Deep contextual logging out-of-the-box (powered by loguru) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time
  • Dry Run Mode — Validate contracts, resolve dependencies, and preview execution plans without touching any data
  • DDL-Only Mode — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
  • DAG Dependency Viewer — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
  • Data Reset & Reload — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
  • Multi-Channel Alerts — Powered by Apprise for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full Jinja2 templating support for custom formatting

✏️ Try it out in Google Colab: Developer Experience

Data Generation & AI

  • Synthetic Data — Built-in DataGenerator (powered by Faker) with streaming simulation, time-windowed output, referential integrity, and edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation
  • Descriptive AI Test Data — Steer synthetic data generation with natural language prompts (e.g. "Generate users who are French or Japanese only, enterprise-tier, over 60 years old with SQL injection attempts in email fields") — output strictly adheres to the YAML contract schema
  • AI Contract Onboardinglakelogic infer auto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions
  • Unstructured Processing — LLM extraction from PDFs, images, audio with same contract validation + lineage
  • Automated Run Logs — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table

✏️ Try it out in Google Colab: Data Generation & AI

Integrations

  • dbt Adapter — Import dbt schema.yml models and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting
  • dlt (Data Load Tool) — Native DltAdapter supporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival
  • Native Streaming Connectors — Built-in WebSocketConnector, SSEConnector, KafkaConnector, WebhookConnector (plus Azure Event Grid, Service Bus, AWS SQS, GCP Pub/Sub) with pre-validation rename transformations for real-time feeds
  • Native Database Ingestion — High-performance SQL extraction via Polars/ConnectorX and DuckDB — PostgreSQL, MySQL, SQL Server, SQLite with automatic dialect detection
  • Incremental CDC — Watermark-based change data capture with automatic state tracking — only processes rows newer than the last run
  • Batch Processing — Memory-safe chunked ingestion via fetch_size for massive initial loads — handles 100GB+ tables without OOM
  • Column Projection Pushdown — Automatically constructs SELECT queries from model.fields — only extracts what the contract declares
  • Cloud Data Sources — Native abfss://, s3://, gs:// URI support with automatic credential resolution via CloudCredentialResolver — Azure AD, AWS IAM, GCP ADC, service principals, and Databricks secret scopes

✏️ Try it out in Google Colab: Integrations


What a Contract Looks Like

One YAML file replaces hundreds of lines of ingestion, validation, and materialization code:

version: "1.0"
info:
  title: "Silver Customers"
  domain: "CRM"
  system: "Salesforce"

model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
      pii: true
      masking: "hash"
    - name: status
      type: string

transformations:
  - deduplicate: [customer_id]
  - sql: "SELECT *, UPPER(status) AS status_norm FROM source"
    phase: pre

quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"
    - sql: "status IN ('active', 'churned', 'pending')"
  dataset_rules:
    - unique: customer_id

materialization:
  strategy: merge
  merge_keys: [customer_id]
  format: iceberg  # natively supports iceberg, delta, parquet, csv

Same contract, any engine — swap engine="polars" for "spark" or "duckdb". Zero code changes.

Analogy: A contract is like a building inspection checklist. The inspector (LakeLogic) checks every room (row) against the blueprint (schema), flags violations (quarantine), and stamps a certificate (lineage) — regardless of whether the building was constructed with bricks (Spark), timber (Polars), or prefab (DuckDB).

What this buys you

Without LakeLogic With LakeLogic
500+ lines of PySpark/Pandas validation per table 40 lines of YAML
Bad rows silently dropped or crash the pipeline Bad rows quarantined with error reasons
Schema drift discovered in production dashboards Schema drift caught at ingestion and blocked in CI/CD
Manual dedup scripts per team deduplicate: [key] — one line
PII scattered across notebooks pii: true, masking: hash — automatic
Breaking contract changes shipped to prod lakelogic validate --gates breaking_change blocks the PR
No audit trail Every row stamped with run ID, source path, timestamp

[!TIP] View the Complete Contract Reference for every available configuration option.


Architecture

LakeLogic enforces Data Contracts as quality gates across the Medallion Architecture (Bronze → Silver → Gold). Each layer uses its own contract:

Layer Role Guarantee
Bronze Capture everything raw, no validation Immutable record of source
Silver Full validation, business rules, dedup Trusted, queryable data
Gold Aggregations, KPIs, ML features Analytics-ready datasets
Quarantine Failed rows isolated with error reasons Nothing silently dropped

Key Guarantee: source_count = good_count + bad_count — 100% reconciliation, always.

Examples

For a complete list of runnable guides and end-to-end notebooks, please visit the Examples section of our Documentation.


Documentation

For full guides, API references, tutorials, and contract templates, please visit the LakeLogic Documentation Site.

Contributing

See CONTRIBUTING.md to get started, or docs/installation.md#developer-installation for environment setup.


License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakelogic-1.31.0.tar.gz (4.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakelogic-1.31.0-py3-none-any.whl (642.9 kB view details)

Uploaded Python 3

File details

Details for the file lakelogic-1.31.0.tar.gz.

File metadata

  • Download URL: lakelogic-1.31.0.tar.gz
  • Upload date:
  • Size: 4.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.31.0.tar.gz
Algorithm Hash digest
SHA256 f70363ed77346738d295ddf1cd5368d33b2a5a4c3b52049535c41806fe7d4189
MD5 a405bc76b85e41f90c25c9745b5fa5ac
BLAKE2b-256 f8acaa9360409fa9e6a975fb0da46c0a087655794f00c615b2f7c771fe33de99

See more details on using hashes here.

File details

Details for the file lakelogic-1.31.0-py3-none-any.whl.

File metadata

  • Download URL: lakelogic-1.31.0-py3-none-any.whl
  • Upload date:
  • Size: 642.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakelogic-1.31.0-py3-none-any.whl
Algorithm Hash digest
SHA256 51c786defc1ea76438bf6c9a01b1752b73732994937c2d736ebe2270480cbc7b
MD5 fd599b5cd68d350f83fc625108a91db3
BLAKE2b-256 f12df733ef58ccbe6e274ce6010a6b89a1d482d058b6caf761b328abc9824898

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page