Simple, production-ready Python profiling for data pipelines
Project description
PipelineScope
Production-ready profiling and performance monitoring for Python data pipelines, ML workflows, and ETL systems
PipelineScope is a lightweight Python profiling library that instruments data pipelines, ETL systems, and ML workflows to identify bottlenecks, track resource consumption, and extrapolate performance metrics across scales.
โจ Features
- Zero-Configuration Profiling - Works out of the box with sensible defaults
- Scalable Insights - Sample at 100 functions, extrapolate to 1M+ with statistical confidence
- Resource Monitoring - CPU, GPU, and memory tracking with per-function attribution
- Production Ready - Minimal overhead, deterministic profiling via
sys.setprofile - Static HTML Reports - Modern glassmorphism UI, no server dependencies
- CLI Utilities - Diff profiling runs, compare baseline vs. current performance
- YAML Configuration - Flexible runtime configuration with auto-discovery
- Comprehensive Logging - Built on py-logex for structured, production-grade logging
- Realistic Examples - Three end-to-end examples: simple linear, nested calls, complex graphs
๐ฆ Installation
pip install pipelinescope
Requirements:
- Python >= 3.8
- psutil >= 5.8.0
- GPUtil >= 1.4.0
- pyyaml >= 6.0
- jinja2 >= 3.0.0
- py-logex-enhanced >= 0.1.3
๐ Quickstart
Minimal Integration (2 Lines)
from pipelinescope import profile_pipeline
if __name__ == "__main__":
profile_pipeline.start()
# Your pipeline code here
process_data()
train_model()
export_results()
PipelineScope automatically:
- Profiles all function calls via
sys.setprofile - Monitors CPU, GPU, and memory per function
- Extrapolates metrics from your sample to expected scale
- Generates an interactive HTML dashboard
- Logs detailed profiling data as JSON
Output: .pipelinescope_output/run_<timestamp>/
summary.html- Interactive profiling dashboardprofile_data.json- Raw profiling data and statisticspipelinescope.log- Detailed execution logs
๐ Configuration
Create .pipelinescope.yaml in your project root (auto-discovered):
# Profiling behavior
sample_size: 100 # Number of functions sampled
expected_size: 1000000 # Expected function count at production scale
min_time_threshold_ms: 1.0 # Minimum function duration to report (ms)
min_time_percentage: 0.5 # Minimum % of total time to report
# Output
output_dir: .pipelinescope_output
dashboard_title: "My Pipeline"
enable_dashboard: true
# Resource monitoring
enable_cpu_monitoring: true
enable_gpu_monitoring: true
# Filtering
collapse_stdlib: true # Hide standard library frames
ignore_modules: # Exclude module patterns
- venv
- site-packages
- .venv
- env
# Logging
enable_console_logging: false
log_file: pipelinescope.log
log_level: INFO
Configuration Discovery
If no config_path is specified in profile_pipeline.start(), PipelineScope walks up 6 directory levels searching for .pipelinescope.yaml. Falls back to defaults if not found.
๐ Architecture Overview
User Code
โ
PipelineScope.start()
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Profiler (sys.setprofile) โ
โ โโ Tracks function calls and timing โ
โ โโ Manages call stack depth โ
โ โโ Integrates with ResourceMonitor โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ResourceMonitor (psutil/GPUtil) โ
โ โโ Samples CPU % per function โ
โ โโ Tracks memory (RSS) per function โ
โ โโ Monitors GPU utilization โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Stats & Extrapolation Module โ
โ โโ Aggregates call counts and timing โ
โ โโ Calculates percentiles โ
โ โโ Extrapolates to production scale โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Report Generation (Jinja2) โ
โ โโ Renders static HTML dashboard โ
โ โโ Serializes data as JSON โ
โ โโ Writes logs via py-logex โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Output: HTML + JSON + Logs
๐ป API Reference
profile_pipeline.start(config_path=None)
Activate profiling for the entire pipeline.
from pipelinescope import profile_pipeline
from pathlib import Path
# Auto-discover config (.pipelinescope.yaml)
profile_pipeline.start()
# Or specify explicit path
profile_pipeline.start(config_path=Path("./config/custom.yaml"))
# Then run your pipeline
my_pipeline()
Behavior:
- Starts a global singleton profiler (one per process)
- Registers
atexithandler to finalize on process exit - Can be called multiple times; subsequent calls are no-ops
profile_pipeline.stop()
Manually finalize profiling and generate outputs. Usually not needed (automatic on process exit).
profile_pipeline.start()
my_pipeline()
profile_pipeline.stop() # Generate outputs immediately
๐ง End-to-End Usage Example
Simple Linear Pipeline
# pipeline.py
def extract(data):
"""Load data"""
return data * 2
def transform(data):
"""Clean and validate"""
return [x for x in data if x > 0]
def load(data):
"""Save results"""
return len(data)
def run_pipeline(n):
data = list(range(n))
data = extract(data)
data = transform(data)
result = load(data)
return result
# main.py
from pipelinescope import profile_pipeline
from pipeline import run_pipeline
if __name__ == "__main__":
profile_pipeline.start()
for i in range(10):
run_pipeline(1000)
# Profiling completes automatically on exit
# Check .pipelinescope_output/run_<timestamp>/ for results
Running:
python main.py
# Output:
# PipelineScope profiling started
# Configuration loaded from: defaults
# Tracked 12 functions
# Extrapolating from 100 to 1000000
# Report generated: .pipelinescope_output/run_1704461234/summary.html
# JSON data saved: .pipelinescope_output/run_1704461234/profile_data.json
# PipelineScope profiling complete
Nested Calls (Complex Graph)
See examples/nested_calls/ for a pipeline with multiple function call layers and interdependencies.
Complex DAG Pipeline
See examples/complex_graph/ for a realistic pipeline with parallel-like execution patterns.
๐งช Testing
Local Development
# Clone repository
git clone https://github.com/sherozshaikh/pipelinescope.git
cd pipelinescope
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Run test suite
pytest
# Run with coverage report
pytest --cov=pipelinescope --cov-report=html --cov-report=term-missing
# Run specific test file
pytest tests/test_profiler.py -v
# Run specific test
pytest tests/test_profiler.py::TestProfilerBasic::test_profiler_start_stop -v
Test Structure
- Unit Tests:
test_config.py,test_stats.py,test_serializer.py,test_extrapolation.py,test_logger.py - Integration Tests:
test_profiler.py,test_resource_monitor.py,test_analyzer.py,test_generator.py - E2E Tests:
test_entrypoint.py(global profiler lifecycle) - CLI Tests:
test_diff.py(diff utility) - Fixtures:
conftest.py(mock psutil/GPUtil, singleton reset, temp directories)
๐ CI/CD
This project uses GitHub Actions for continuous testing.
Local CI Simulation
# Install test dependencies
pip install -e ".[dev]"
# Run tests with coverage (as CI does)
pytest --cov=pipelinescope --cov-report=xml --cov-report=term-missing
# Run linting (optional, not in CI)
ruff check src/ tests/
black --check src/ tests/
GitHub Actions Workflow
The repository includes .github/workflows/tests.yml which:
- Runs on
pushtomainanddevelop - Runs on
pull_requesttomainanddevelop - Tests on Python 3.8 (primary)
- Uploads coverage to Codecov
- Uses pip caching for fast builds
๐ ๏ธ Troubleshooting
No Data Collected
Issue: Dashboard is empty or "No profiling data collected" warning.
Cause: Pipeline finishes before profiler registers meaningful function calls (e.g., pipeline runs too fast or only calls built-in functions).
Solution:
- Ensure your pipeline calls user-defined functions (not just built-ins)
- Profile longer-running pipelines with more function overhead
- Check
min_time_threshold_msandmin_time_percentageconfigโlower them if needed
High Overhead / Slow Profiling
Issue: Profiler adds significant latency to pipeline execution.
Cause: System is profiling too many functions (e.g., stdlib, venv).
Solution:
- Set
collapse_stdlib: true(default) - Extend
ignore_modulesto exclude unnecessary paths - Increase
min_time_threshold_msto skip short-lived functions
Missing GPU Data
Issue: GPU metrics not appearing in dashboard.
Cause: GPUtil not installed, or no NVIDIA GPU detected.
Solution:
- Verify GPU driver:
nvidia-smi - Set
enable_gpu_monitoring: falseif GPU unavailable - Check
pipelinescope.logfor GPUtil errors
Config Not Loading
Issue: Custom .pipelinescope.yaml ignored; defaults used instead.
Cause: Config file path incorrect or not in search path (current directory or 5 parent levels).
Solution:
- Verify file exists:
ls -la .pipelinescope.yaml - Check YAML syntax:
python -c "import yaml; yaml.safe_load(open('.pipelinescope.yaml'))" - Pass explicit path:
profile_pipeline.start(config_path=Path("./config/custom.yaml"))
Memory Bloat in Long-Running Pipelines
Issue: Memory usage grows unbounded.
Cause: Profiler accumulates function stats indefinitely.
Solution:
- Profile in segments (restart process between segments)
- Lower
expected_sizeto trigger extrapolation earlier - Review
function_statsdict size in logs
๐ Example Outputs
HTML Dashboard
The generated summary.html includes:
- Function call tree - Shows nested call hierarchy and timing
- Top 20 by time - Slowest functions across the pipeline
- Resource usage - CPU, memory, and GPU per function
- Call statistics - Counts, percentiles, extrapolated metrics
JSON Profile Data
profile_data.json structure:
{
"metadata": {
"sample_size": 100,
"expected_size": 1000000,
"profiling_duration_seconds": 12.34,
"total_functions": 45
},
"function_stats": {
"module:function_name": {
"call_count": 1000,
"total_time_ms": 5000.0,
"cpu_percent": 45.2,
"memory_mb": 256.5,
"gpu_memory_mb": 512.0
}
},
"extrapolated_stats": {
"module:function_name": {
"extrapolated_call_count": 10000000,
"extrapolated_total_time_ms": 50000000.0
}
}
}
๐ค Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Make your changes and add tests
- Run the test suite:
pytest --cov=pipelinescope - Ensure code formatting and linting: isort . black . ruff check --fix . ruff format .
- Submit a pull request
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Built with psutil for system resource monitoring
- GPU tracking via GPUtil
- Configuration via PyYAML
- Templating with Jinja2
- Production logging via py-logex
๐ง Support
- Issues: GitHub Issues
- PyPI: https://pypi.tw.martin98.com/project/pipelinescope/
- Documentation: See this README and inline code docstrings
Made with โค๏ธ for production data engineering
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipelinescope-0.1.0.tar.gz.
File metadata
- Download URL: pipelinescope-0.1.0.tar.gz
- Upload date:
- Size: 50.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40883da940e28e1c5693dc09f939002c5b63340690ee0076f3d6cd7143d5b1da
|
|
| MD5 |
afb4581b4ac712175f01e3b80dfe0ae2
|
|
| BLAKE2b-256 |
add17d8c6fa98eaa51e09578b6f13d09989263ee348e2638f1b1f643e853e272
|
File details
Details for the file pipelinescope-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pipelinescope-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
811603a191c70f6fa896b8f733a716dbaded62e161d218f96be169aa1e7033ea
|
|
| MD5 |
f59d351c45ff169dbb0762c82088d865
|
|
| BLAKE2b-256 |
75e80a8d09f266ae31ff68b16699ef18b481d98fc765e6d6d92823f9e6eeb41e
|