Skip to main content

Simple, production-ready Python profiling for data pipelines

Project description

PipelineScope

Production-ready profiling and performance monitoring for Python data pipelines, ML workflows, and ETL systems

PyPI version Python Versions License: MIT

PipelineScope is a lightweight Python profiling library that instruments data pipelines, ETL systems, and ML workflows to identify bottlenecks, track resource consumption, and extrapolate performance metrics across scales.


โœจ Features

  • Zero-Configuration Profiling - Works out of the box with sensible defaults
  • Scalable Insights - Sample at 100 functions, extrapolate to 1M+ with statistical confidence
  • Resource Monitoring - CPU, GPU, and memory tracking with per-function attribution
  • Production Ready - Minimal overhead, deterministic profiling via sys.setprofile
  • Static HTML Reports - Modern glassmorphism UI, no server dependencies
  • CLI Utilities - Diff profiling runs, compare baseline vs. current performance
  • YAML Configuration - Flexible runtime configuration with auto-discovery
  • Comprehensive Logging - Built on py-logex for structured, production-grade logging
  • Realistic Examples - Three end-to-end examples: simple linear, nested calls, complex graphs

๐Ÿ“ฆ Installation

pip install pipelinescope

Requirements:

  • Python >= 3.8
  • psutil >= 5.8.0
  • GPUtil >= 1.4.0
  • pyyaml >= 6.0
  • jinja2 >= 3.0.0
  • py-logex-enhanced >= 0.1.3

๐Ÿš€ Quickstart

Minimal Integration (2 Lines)

from pipelinescope import profile_pipeline

if __name__ == "__main__":
    profile_pipeline.start()
    # Your pipeline code here
    process_data()
    train_model()
    export_results()

PipelineScope automatically:

  1. Profiles all function calls via sys.setprofile
  2. Monitors CPU, GPU, and memory per function
  3. Extrapolates metrics from your sample to expected scale
  4. Generates an interactive HTML dashboard
  5. Logs detailed profiling data as JSON

Output: .pipelinescope_output/run_<timestamp>/

  • summary.html - Interactive profiling dashboard
  • profile_data.json - Raw profiling data and statistics
  • pipelinescope.log - Detailed execution logs

๐Ÿ“‹ Configuration

Create .pipelinescope.yaml in your project root (auto-discovered):

# Profiling behavior
sample_size: 100                    # Number of functions sampled
expected_size: 1000000              # Expected function count at production scale
min_time_threshold_ms: 1.0          # Minimum function duration to report (ms)
min_time_percentage: 0.5            # Minimum % of total time to report

# Output
output_dir: .pipelinescope_output
dashboard_title: "My Pipeline"
enable_dashboard: true

# Resource monitoring
enable_cpu_monitoring: true
enable_gpu_monitoring: true

# Filtering
collapse_stdlib: true               # Hide standard library frames
ignore_modules:                     # Exclude module patterns
  - venv
  - site-packages
  - .venv
  - env

# Logging
enable_console_logging: false
log_file: pipelinescope.log
log_level: INFO

Configuration Discovery

If no config_path is specified in profile_pipeline.start(), PipelineScope walks up 6 directory levels searching for .pipelinescope.yaml. Falls back to defaults if not found.


๐Ÿ“Š Architecture Overview

User Code
    โ†“
PipelineScope.start()
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Profiler (sys.setprofile)       โ”‚
โ”‚  โ”œโ”€ Tracks function calls and timing    โ”‚
โ”‚  โ”œโ”€ Manages call stack depth            โ”‚
โ”‚  โ””โ”€ Integrates with ResourceMonitor     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       ResourceMonitor (psutil/GPUtil)   โ”‚
โ”‚  โ”œโ”€ Samples CPU % per function          โ”‚
โ”‚  โ”œโ”€ Tracks memory (RSS) per function    โ”‚
โ”‚  โ””โ”€ Monitors GPU utilization            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Stats & Extrapolation Module         โ”‚
โ”‚  โ”œโ”€ Aggregates call counts and timing   โ”‚
โ”‚  โ”œโ”€ Calculates percentiles              โ”‚
โ”‚  โ””โ”€ Extrapolates to production scale    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Report Generation (Jinja2)          โ”‚
โ”‚  โ”œโ”€ Renders static HTML dashboard       โ”‚
โ”‚  โ”œโ”€ Serializes data as JSON             โ”‚
โ”‚  โ””โ”€ Writes logs via py-logex            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
Output: HTML + JSON + Logs

๐Ÿ’ป API Reference

profile_pipeline.start(config_path=None)

Activate profiling for the entire pipeline.

from pipelinescope import profile_pipeline
from pathlib import Path

# Auto-discover config (.pipelinescope.yaml)
profile_pipeline.start()

# Or specify explicit path
profile_pipeline.start(config_path=Path("./config/custom.yaml"))

# Then run your pipeline
my_pipeline()

Behavior:

  • Starts a global singleton profiler (one per process)
  • Registers atexit handler to finalize on process exit
  • Can be called multiple times; subsequent calls are no-ops

profile_pipeline.stop()

Manually finalize profiling and generate outputs. Usually not needed (automatic on process exit).

profile_pipeline.start()
my_pipeline()
profile_pipeline.stop()  # Generate outputs immediately

๐Ÿ”ง End-to-End Usage Example

Simple Linear Pipeline

# pipeline.py
def extract(data):
    """Load data"""
    return data * 2

def transform(data):
    """Clean and validate"""
    return [x for x in data if x > 0]

def load(data):
    """Save results"""
    return len(data)

def run_pipeline(n):
    data = list(range(n))
    data = extract(data)
    data = transform(data)
    result = load(data)
    return result

# main.py
from pipelinescope import profile_pipeline
from pipeline import run_pipeline

if __name__ == "__main__":
    profile_pipeline.start()
    for i in range(10):
        run_pipeline(1000)
    # Profiling completes automatically on exit
    # Check .pipelinescope_output/run_<timestamp>/ for results

Running:

python main.py
# Output:
# PipelineScope profiling started
# Configuration loaded from: defaults
# Tracked 12 functions
# Extrapolating from 100 to 1000000
# Report generated: .pipelinescope_output/run_1704461234/summary.html
# JSON data saved: .pipelinescope_output/run_1704461234/profile_data.json
# PipelineScope profiling complete

Nested Calls (Complex Graph)

See examples/nested_calls/ for a pipeline with multiple function call layers and interdependencies.

Complex DAG Pipeline

See examples/complex_graph/ for a realistic pipeline with parallel-like execution patterns.


๐Ÿงช Testing

Local Development

# Clone repository
git clone https://github.com/sherozshaikh/pipelinescope.git
cd pipelinescope

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Run test suite
pytest

# Run with coverage report
pytest --cov=pipelinescope --cov-report=html --cov-report=term-missing

# Run specific test file
pytest tests/test_profiler.py -v

# Run specific test
pytest tests/test_profiler.py::TestProfilerBasic::test_profiler_start_stop -v

Test Structure

  • Unit Tests: test_config.py, test_stats.py, test_serializer.py, test_extrapolation.py, test_logger.py
  • Integration Tests: test_profiler.py, test_resource_monitor.py, test_analyzer.py, test_generator.py
  • E2E Tests: test_entrypoint.py (global profiler lifecycle)
  • CLI Tests: test_diff.py (diff utility)
  • Fixtures: conftest.py (mock psutil/GPUtil, singleton reset, temp directories)

๐Ÿ”„ CI/CD

This project uses GitHub Actions for continuous testing.

Local CI Simulation

# Install test dependencies
pip install -e ".[dev]"

# Run tests with coverage (as CI does)
pytest --cov=pipelinescope --cov-report=xml --cov-report=term-missing

# Run linting (optional, not in CI)
ruff check src/ tests/
black --check src/ tests/

GitHub Actions Workflow

The repository includes .github/workflows/tests.yml which:

  • Runs on push to main and develop
  • Runs on pull_request to main and develop
  • Tests on Python 3.8 (primary)
  • Uploads coverage to Codecov
  • Uses pip caching for fast builds

๐Ÿ› ๏ธ Troubleshooting

No Data Collected

Issue: Dashboard is empty or "No profiling data collected" warning.

Cause: Pipeline finishes before profiler registers meaningful function calls (e.g., pipeline runs too fast or only calls built-in functions).

Solution:

  • Ensure your pipeline calls user-defined functions (not just built-ins)
  • Profile longer-running pipelines with more function overhead
  • Check min_time_threshold_ms and min_time_percentage configโ€”lower them if needed

High Overhead / Slow Profiling

Issue: Profiler adds significant latency to pipeline execution.

Cause: System is profiling too many functions (e.g., stdlib, venv).

Solution:

  • Set collapse_stdlib: true (default)
  • Extend ignore_modules to exclude unnecessary paths
  • Increase min_time_threshold_ms to skip short-lived functions

Missing GPU Data

Issue: GPU metrics not appearing in dashboard.

Cause: GPUtil not installed, or no NVIDIA GPU detected.

Solution:

  • Verify GPU driver: nvidia-smi
  • Set enable_gpu_monitoring: false if GPU unavailable
  • Check pipelinescope.log for GPUtil errors

Config Not Loading

Issue: Custom .pipelinescope.yaml ignored; defaults used instead.

Cause: Config file path incorrect or not in search path (current directory or 5 parent levels).

Solution:

  • Verify file exists: ls -la .pipelinescope.yaml
  • Check YAML syntax: python -c "import yaml; yaml.safe_load(open('.pipelinescope.yaml'))"
  • Pass explicit path: profile_pipeline.start(config_path=Path("./config/custom.yaml"))

Memory Bloat in Long-Running Pipelines

Issue: Memory usage grows unbounded.

Cause: Profiler accumulates function stats indefinitely.

Solution:

  • Profile in segments (restart process between segments)
  • Lower expected_size to trigger extrapolation earlier
  • Review function_stats dict size in logs

๐Ÿ“– Example Outputs

HTML Dashboard

The generated summary.html includes:

  • Function call tree - Shows nested call hierarchy and timing
  • Top 20 by time - Slowest functions across the pipeline
  • Resource usage - CPU, memory, and GPU per function
  • Call statistics - Counts, percentiles, extrapolated metrics

JSON Profile Data

profile_data.json structure:

{
  "metadata": {
    "sample_size": 100,
    "expected_size": 1000000,
    "profiling_duration_seconds": 12.34,
    "total_functions": 45
  },
  "function_stats": {
    "module:function_name": {
      "call_count": 1000,
      "total_time_ms": 5000.0,
      "cpu_percent": 45.2,
      "memory_mb": 256.5,
      "gpu_memory_mb": 512.0
    }
  },
  "extrapolated_stats": {
    "module:function_name": {
      "extrapolated_call_count": 10000000,
      "extrapolated_total_time_ms": 50000000.0
    }
  }
}

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/your-feature)
  3. Make your changes and add tests
  4. Run the test suite: pytest --cov=pipelinescope
  5. Ensure code formatting and linting: isort . black . ruff check --fix . ruff format .
  6. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.


๐Ÿ™ Acknowledgments


๐Ÿ“ง Support


Made with โค๏ธ for production data engineering

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipelinescope-0.1.0.tar.gz (50.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipelinescope-0.1.0-py3-none-any.whl (34.4 kB view details)

Uploaded Python 3

File details

Details for the file pipelinescope-0.1.0.tar.gz.

File metadata

  • Download URL: pipelinescope-0.1.0.tar.gz
  • Upload date:
  • Size: 50.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pipelinescope-0.1.0.tar.gz
Algorithm Hash digest
SHA256 40883da940e28e1c5693dc09f939002c5b63340690ee0076f3d6cd7143d5b1da
MD5 afb4581b4ac712175f01e3b80dfe0ae2
BLAKE2b-256 add17d8c6fa98eaa51e09578b6f13d09989263ee348e2638f1b1f643e853e272

See more details on using hashes here.

File details

Details for the file pipelinescope-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pipelinescope-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pipelinescope-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 811603a191c70f6fa896b8f733a716dbaded62e161d218f96be169aa1e7033ea
MD5 f59d351c45ff169dbb0762c82088d865
BLAKE2b-256 75e80a8d09f266ae31ff68b16699ef18b481d98fc765e6d6d92823f9e6eeb41e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page