One command to find why your PyTorch model is slow — and fix it.
Project description
slowai
One command to find why your PyTorch model is slow — and fix it.
slowai diagnoses which performance regime your workload is stuck in (compute-bound, memory-bound, or overhead-bound), prescribes the right fix, auto-applies it, and proves the speedup with before/after measurements. No guesswork. No manual profiler interpretation.
Install
pip install slowai
That's it. Works on any NVIDIA GPU — desktop, server, or Jetson edge devices.
Quick start
slowai fix model.py # diagnose + auto-fix + measure speedup
slowai fix model.py --export # ^ plus save winning config for production
slowai optimize model.py # export ONNX → TensorRT .engine (production)
slowai optimize model.py --precision=int8 # INT8 engine for max throughput
slowai optimize model.py --dla # V8: DLA offloading (Orin)
slowai validate model.py # safety validation (numerical equivalence)
slowai validate model.py --strict # V8: hard gate (any non-SAFE = fail)
slowai report model.py # HTML diagnostic report with charts
slowai scan ./workloads/ # batch-scan an entire directory
slowai capabilities # show GPU + acceleration backends
slowai power # V8: Jetson power/thermal status
Example output
$ slowai fix model.py
==============================================================
BASELINE: model.py: COMPUTE_BOUND (confidence: 0.85)
wall time: 7.523s
==============================================================
Tried 4 remedies:
1. [10.00x] bf16_autocast ** BEST **
Run under bfloat16 automatic mixed precision
7.523s >>> 0.752s
regime: compute (confidence: 0.85)
2. [6.32x] tf32_tensor_cores
Enable TF32 tensor cores (~2x matmul throughput on Ampere+)
7.523s >>> 1.191s
3. [6.22x] high_matmul_precision
Set float32 matmul precision to 'high'
7.523s >>> 1.210s
4. [1.31x] cudnn_benchmark
Enable cuDNN auto-tuner for conv kernels
7.523s >>> 5.719s
--------------------------------------------------------------
WINNER: bf16_autocast
7.523s >>> 0.752s (10.00x, +900% faster)
How: Run under bfloat16 automatic mixed precision
--------------------------------------------------------------
Why this exists
Every deep learning workload is stuck in one of three performance regimes (Horace He, 2022):
| Regime | What's happening | Wrong fix = no speedup |
|---|---|---|
| Compute-bound | GPU is saturated doing math (matmuls, convolutions) | Fusing ops won't help — the math itself is the bottleneck |
| Memory-bound | GPU is waiting for data (pointwise ops, activations) | Smaller model won't help — you need less data movement |
| Overhead-bound | GPU is idle waiting for Python/dispatcher (tiny ops) | Lower precision won't help — you need fewer, bigger ops |
The fix for each regime is different. Applying a compute-bound fix to a memory-bound workload does nothing. Engineers waste hours in profiler UIs figuring this out manually.
slowai does it in one command.
How it works
Under the hood:
- Profile — Runs your workload under
torch.profilerwith CUDA timing, warmup pass, and op-level statistics - Classify — A heuristic classifier analyzes op shares (matmul, normalization, pointwise, tiny-op fraction) to determine the dominant regime
- Prescribe — Returns a ranked list of fixes for that regime, cheapest first
- Remediate — Auto-applies each applicable fix, re-profiles, and ranks by measured speedup
No code changes required. Remedies are environment-level transforms — they modify PyTorch's runtime settings, not your model code.
V8 Features (NEW)
Hard Safety Gates — CI/CD Deployment Blocking
slowai validate model.py --strict # non-zero exit on ANY non-SAFE
slowai validate model.py --json # machine-readable for pipelines
echo $? # 0=SAFE, 1=CAUTION, 2=UNSAFE, 3=ERROR
V8 turns safety validation into a hard deployment gate. In --strict mode, anything non-SAFE blocks your CI/CD pipeline with a non-zero exit code. No more "warnings" that engineers ignore — UNSAFE means DO NOT DEPLOY.
DLA Offloading (Orin Hardware Accelerator)
slowai optimize model.py --dla # offload to DLA core 0
slowai optimize model.py --dla --dla-core=1 # use DLA core 1
The Jetson Orin family includes dedicated Deep Learning Accelerator cores on select hardware (Orin NX, AGX Orin, Thor). DLA offloading routes compatible layers to the DLA, freeing CUDA cores for other tasks. Note: Orin Nano does not have DLA hardware — slowai will detect this and report it via slowai capabilities.
Power & Thermal Awareness
slowai power # show power mode + thermal status
slowai power --json # machine-readable
Edge devices care about performance-per-watt as much as latency. The new power command detects nvpmodel mode, jetson_clocks state, GPU/CPU frequencies, and thermal throttling. If your Jetson is throttling, your benchmarks are unreliable — slowai tells you.
V7 Features
TensorRT Engine Export (Production Deployment)
slowai optimize model.py # ONNX → TensorRT FP16 engine
slowai optimize model.py --precision=int8 # INT8 for maximum throughput
slowai optimize model.py --target=onnx # ONNX export only
The #1 feature NVIDIA engineers demand. Exports your model to a production-grade TensorRT .engine file via the reliable ONNX → trtexec path (bypasses broken torch_tensorrt wheels). Auto-detects JetPack, CUDA, and TensorRT versions. Outputs engine + metadata JSON with size, build time, and environment info.
Safety Validation Suite
slowai validate model.py # test all winning remedies
slowai validate model.py --atol=1e-6 # strict tolerance
slowai validate model.py --strict # V8: hard gate
Every remedy that changes precision risks silent accuracy degradation. The safety suite catches it before you ship: runs baseline vs. remedied, compares output tensors (max absolute diff, max relative diff, cosine similarity), and grades each remedy as SAFE / CAUTION / UNSAFE. Includes determinism checking for flight-ready validation.
HTML Diagnostic Reports
slowai report model.py -o report.html
Generates a self-contained HTML report with hardware info, regime classification, interactive speedup charts, and a remedy leaderboard. Share with your team or include in CI artifacts.
Batch Scanning
slowai scan ./workloads/
Profiles every .py file in a directory and prints a summary table with regime, best remedy, and speedup for each workload. Great for auditing an entire model zoo.
Acceleration Backends
slowai V6 includes 10 remedies across all acceleration tiers:
| Remedy | Type | Best for |
|---|---|---|
torch.compile |
JIT compilation | Overhead-bound (fuses ops, eliminates dispatch) |
TensorRT |
Inference optimizer | Compute-bound (layer fusion, kernel auto-tuning) |
TensorRT FP16 |
Half-precision TRT | Memory-bound (maximum throughput) |
INT8 quantization |
Dynamic quantization | Linear/LSTM-heavy models (2-4x on Ampere) |
bf16 autocast |
Mixed precision | Matmul-heavy architectures |
fp16 autocast |
Mixed precision | Memory-bound models |
TF32 tensor cores |
Precision relaxation | Transformer workloads |
matmul precision |
Internal downcasting | General compute |
cuDNN benchmark |
Kernel auto-tuner | CNN-heavy models |
channels_last |
Memory layout | Convolution pipelines |
Capability Detection
slowai capabilities
Shows your GPU, CUDA capability, Jetson status, power mode, and which acceleration backends (torch.compile, TensorRT, INT8 tensor cores, DLA) are available.
Export to production
The --export flag saves the winning remedy as a drop-in Python module:
slowai fix model.py --export
# Creates slowai_config.py in the current directory
Then in your production code:
import slowai_config
slowai_config.apply() # Set globally before your model runs
# Or as a context manager:
with slowai_config.optimized():
model(data)
The exported config includes the exact PyTorch settings, speedup metadata, and both a global apply() function and an optimized() context manager. Zero dependencies beyond PyTorch.
CI/CD mode
Catch performance regressions on every commit:
# Fail if no remedy achieves at least 1.5x speedup
slowai fix model.py --ci --threshold 1.5
# Returns exit code 0 (pass) or 1 (fail)
# Outputs JSON for pipeline consumption
echo $?
Combine with --export to auto-generate optimized configs in your pipeline:
slowai fix model.py --ci --threshold 2.0 --export slowai_config.py
Benchmarks
Tested on NVIDIA Jetson Orin Nano Super (Ampere GPU, 1024 CUDA cores, 8GB unified RAM, JetPack 6.2, CUDA 12.6, PyTorch 2.8.0). 27 workloads across 18 industry verticals — the most comprehensive edge AI performance benchmark suite available.
Synthetic workloads (regime validation)
| Workload | Regime | Baseline | Best remedy | After | Speedup |
|---|---|---|---|---|---|
| Dense GEMM (4096x4096) | Compute | 7.523s | bf16_autocast | 0.752s | 10.00x |
| Pointwise chain (8192x8192) | Memory | 2.400s | tf32_tensor_cores | 0.575s | 4.17x |
| Tiny ops (5000 micro-ops) | Overhead | 3.281s | tf32_tensor_cores | 1.141s | 2.88x |
Production models — standard architectures
| Workload | Industry | Baseline | Best remedy | After | Speedup |
|---|---|---|---|---|---|
| MobileNetV2 | Mobile / Edge | 1.778s | cudnn_benchmark | 1.681s | 1.06x |
| ResNet-50 | Classification | 2.105s | bf16_autocast | 1.969s | 1.07x |
| EfficientNet-B0 | Drones / Aerospace | 1.445s | cudnn_benchmark | 1.404s | 1.03x |
| R3D-18 (video) | Surveillance / Defense | 6.032s | bf16_autocast | 3.443s | 1.75x |
| Transformer (12L/768d/12H) | NLP / LLMs | 6.207s | bf16_autocast | 2.060s | 3.01x |
Production models — industry-specific pipelines
| Workload | Industry | Baseline | Best remedy | After | Speedup |
|---|---|---|---|---|---|
| Underwater AUV (sonar + camera + nav) | Oil & Gas / Navy | 2.232s | cudnn_benchmark | 0.122s | 18.34x |
| LiDAR 3D point cloud (PointNet-style) | Autonomous vehicles | 2.288s | bf16_autocast | 0.138s | 16.57x |
| Agriculture drone (multispectral + NDVI) | Precision agriculture | — | tf32_tensor_cores | — | 13.78x |
| Pose estimation (FPN + PAF, multi-person) | Retail / AR-VR / Sports | 2.394s | bf16_autocast | 0.287s | 8.34x |
| Satellite imaging (change detection + priority) | Space / Defense | — | bf16_autocast | — | 7.72x |
| Robotics pick-and-place (RGB-D + 7-DOF) | Industrial robotics | — | cudnn_benchmark | — | 7.39x |
| GNN smart grid (message-passing + pooling) | Energy / Telecom | 2.569s | tf32_tensor_cores | 0.453s | 5.67x |
| Medical imaging (DenseNet + multi-task) | Healthcare | — | bf16_autocast | — | 5.54x |
| 1D ConvNet (signal processing) | Navy radar / sonar | 2.990s | bf16_autocast | 0.757s | 3.95x |
| Time Series Transformer | Predictive maintenance | 3.814s | bf16_autocast | 1.072s | 3.56x |
| Edge diffusion (UNet denoiser, 128x128) | Generative AI on device | 2.904s | bf16_autocast | 0.918s | 3.17x |
| Fly-by-wire control (sensor + transformer) | Aviation / eVTOL | — | tf32_tensor_cores | — | 3.08x |
| Cybersecurity anomaly (flow transformer) | Network defense / SOC | 3.749s | tf32_tensor_cores | 1.308s | 2.87x |
| Speech-to-text (Whisper-style encoder-decoder) | Consumer / Accessibility | — | tf32_tensor_cores | — | 2.50x |
| RL policy network (LSTM + multi-modal, 200Hz) | Industrial robotics / Logistics | 4.370s | tf32_tensor_cores | 2.071s | 2.11x |
| Mamba SSM (selective state-space, 4-layer) | Telecom / IoT | 30.582s | tf32_tensor_cores | 27.448s | 1.11x |
| DeepLabV3 (MobileNetV3) | Autonomous driving | 4.019s | bf16_autocast | 3.623s | 1.11x |
| Detection + Segmentation pipeline | Autonomous driving | 5.744s | bf16_autocast | 5.289s | 1.09x |
| SSD-Lite (MobileNetV3) | Autonomous driving | 1.918s | cudnn_benchmark | 1.816s | 1.06x |
What the results tell you
Massive gains (5-18x) on custom multi-stream pipelines — AUV sensor fusion, LiDAR 3D processing, agriculture multispectral, pose estimation, GNN smart grid. These architectures use unique compute patterns (point clouds, multi-modal fusion, feature pyramids, scatter/gather ops) that PyTorch doesn't optimize by default.
Strong gains (2-5x) on transformer-based models and recurrent policies — cybersecurity flow analysis, speech-to-text, time series, BERT, RL policy networks, edge diffusion. Mixed precision and TF32 dramatically reduce matmul cost.
Modest gains (1-1.1x) on already-optimized architectures and sequential workloads — MobileNet, EfficientNet, SSD-Lite, Mamba SSM. Mobile architectures use depthwise separable convolutions that are already fast; sequential scan models are overhead-bound and need torch.compile (V5+).
The real value is that slowai finds the right fix automatically — cuDNN benchmark wins for convolution-heavy models, bf16 autocast wins for matmul-heavy architectures, TF32 wins for transformer workloads. Different models, different winners, zero guesswork.
Industries covered
Autonomous vehicles, aviation/eVTOL, oil & gas, Navy/defense, marine science, space, healthcare, industrial robotics, precision agriculture, cybersecurity, consumer/AR-VR, sports analytics, retail, generative AI, predictive maintenance, energy/smart grid, telecom/IoT, warehouse logistics.
Writing a workload
slowai profiles any Python script that exposes a main() function:
# my_model.py
import torch
from torchvision.models import resnet50
model = resnet50().cuda().eval()
data = torch.randn(8, 3, 224, 224, device="cuda")
def main():
with torch.no_grad():
for _ in range(30):
model(data)
slowai fix my_model.py
Architecture
slowai/
schema.py # Regime enum, Diagnosis dataclass — the product thesis in types
profiler.py # torch.profiler wrapper -> ProfileResult (op stats + wall time)
diagnose.py # Heuristic classifier -> Diagnosis (regime + confidence + prescriptions)
remediate.py # Auto-fix engine -> FixReport (before/after speedup per remedy)
optimize.py # ONNX → TensorRT engine export pipeline (trtexec + DLA offloading)
safety.py # Numerical validation suite (equivalence testing, hard safety gates)
power.py # V8: Jetson power management, thermal monitoring, perf-per-watt
report.py # HTML report generator with Chart.js visualizations
export.py # Production config exporter (slowai_config.py)
cli.py # CLI: diagnose, fix, optimize, validate, report, scan, capabilities, power
The classifier is a pure function of profiler output — no torch dependency, fully unit-testable. The remediate engine applies fixes as environment transforms (global flags, autocast context managers, JIT compilation) so it never modifies user code.
What's different
Other tools in this space are profiler UIs that show you data and leave interpretation to you. slowai is the only tool that goes from raw workload to regime classification to ranked prescriptions to auto-applied fixes to measured speedup in a single CLI command.
| Tool | Profiles | Classifies | Prescribes | Auto-fixes | Measures | Reports |
|---|---|---|---|---|---|---|
| PyTorch Profiler | Yes | No | No | No | No | No |
| NVIDIA Nsight | Yes | No | No | No | No | No |
| torch.utils.bottleneck | Yes | No | No | No | No | No |
| DeepSpeed Flops Profiler | Yes | No | No | No | No | No |
| slowai | Yes | Yes | Yes | Yes | Yes | Yes |
Roadmap
- V1 (shipped) — Profile + classify regime for synthetic workloads
- V2 (shipped) — Noise filtering, normalization-aware classification, real model support
- V3 (shipped) — Auto-remediate: apply fixes and measure before/after speedup
- V3.1 (shipped) —
--exportflag: save winning config as production-ready Python module - V3.2 (shipped) —
--cimode: CI/CD integration with threshold-based pass/fail - V4 (shipped) — channels_last, Jetson power mode detection, numerical accuracy validation, 27 workloads / 18 industries
- V5 (shipped) — TensorRT via torch.compile, torch.compile inductor, INT8 dynamic quantization, DLA detection
- V6 (shipped) — HTML diagnostic reports with Chart.js, batch directory scanning, README glow-up, PyPI 0.6.0
- V7 (shipped) — Production TensorRT engine export (ONNX → trtexec), safety validation suite (numerical equivalence + flight-readiness),
slowai optimize+slowai validatecommands, PyPI 0.7.0 - V8 (shipped) — Hard safety gates (non-zero exit codes block deployment), DLA offloading (Orin hardware accelerator),
slowai powerthermal/power monitoring,--strictvalidation mode,--dlaengine export, PyPI 0.8.0 - V9 (next) — Hardware-aware detection (gate
--dlaon SKU), ONNX→trtexec hardening, INT8 calibration pipeline with dataset, streaming inference benchmarker
Built by
Rico Allen — @ricojallen37-sketch
Built and tested on NVIDIA Jetson Orin Nano Super Developer Kit.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file slowai-0.10.0.tar.gz.
File metadata
- Download URL: slowai-0.10.0.tar.gz
- Upload date:
- Size: 68.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60aea53841801281b6248f165f8d15ed3dce060e8b30f105279bfdfa0cb48308
|
|
| MD5 |
5fe37f164c24f4a8d3a5f4230ebda979
|
|
| BLAKE2b-256 |
bb0d6d829de5917867175d9d39a2ddaeec6700efe598e5e21c70423e68b313cb
|
File details
Details for the file slowai-0.10.0-py3-none-any.whl.
File metadata
- Download URL: slowai-0.10.0-py3-none-any.whl
- Upload date:
- Size: 60.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca03239320c8e96f82be108cedd95d870776d5bc892411ea03264f0081c81fb8
|
|
| MD5 |
2f79d7b7b25515af2912310ca1ffa743
|
|
| BLAKE2b-256 |
50551292904111bbfce7e2ca089e03478cae5249deee1b1d83cc2bb7257ea9b7
|