Skip to main content

One command to find why your PyTorch model is slow — and fix it.

Project description

slowai

PyPI version Python 3.10+ License: MIT Edge AI

One command to find why your PyTorch model is slow — and fix it.

slowai diagnoses which performance regime your workload is stuck in (compute-bound, memory-bound, or overhead-bound), prescribes the right fix, auto-applies it, and proves the speedup with before/after measurements. No guesswork. No manual profiler interpretation.

Install

pip install slowai

That's it. Works on any NVIDIA GPU — desktop, server, or Jetson edge devices.

Quick start

slowai fix model.py                # diagnose + auto-fix + measure speedup
slowai fix model.py --export       # ^ plus save winning config for production
slowai optimize model.py           # export ONNX → TensorRT .engine (production)
slowai optimize model.py --precision=int8  # INT8 engine for max throughput
slowai optimize model.py --dla     # V8: DLA offloading (Orin)
slowai validate model.py           # safety validation (numerical equivalence)
slowai validate model.py --strict  # V8: hard gate (any non-SAFE = fail)
slowai report model.py             # HTML diagnostic report with charts
slowai scan ./workloads/           # batch-scan an entire directory
slowai capabilities                # show GPU + acceleration backends
slowai power                       # V8: Jetson power/thermal status

Example output

$ slowai fix model.py

==============================================================
  BASELINE: model.py: COMPUTE_BOUND (confidence: 0.85)
  wall time: 7.523s
==============================================================

  Tried 4 remedies:

  1. [10.00x] bf16_autocast  ** BEST **
     Run under bfloat16 automatic mixed precision
     7.523s >>> 0.752s
     regime: compute (confidence: 0.85)

  2. [6.32x] tf32_tensor_cores
     Enable TF32 tensor cores (~2x matmul throughput on Ampere+)
     7.523s >>> 1.191s

  3. [6.22x] high_matmul_precision
     Set float32 matmul precision to 'high'
     7.523s >>> 1.210s

  4. [1.31x] cudnn_benchmark
     Enable cuDNN auto-tuner for conv kernels
     7.523s >>> 5.719s

--------------------------------------------------------------
  WINNER: bf16_autocast
  7.523s >>> 0.752s  (10.00x, +900% faster)
  How: Run under bfloat16 automatic mixed precision
--------------------------------------------------------------

Why this exists

Every deep learning workload is stuck in one of three performance regimes (Horace He, 2022):

Regime What's happening Wrong fix = no speedup
Compute-bound GPU is saturated doing math (matmuls, convolutions) Fusing ops won't help — the math itself is the bottleneck
Memory-bound GPU is waiting for data (pointwise ops, activations) Smaller model won't help — you need less data movement
Overhead-bound GPU is idle waiting for Python/dispatcher (tiny ops) Lower precision won't help — you need fewer, bigger ops

The fix for each regime is different. Applying a compute-bound fix to a memory-bound workload does nothing. Engineers waste hours in profiler UIs figuring this out manually.

slowai does it in one command.

How it works

Under the hood:

  1. Profile — Runs your workload under torch.profiler with CUDA timing, warmup pass, and op-level statistics
  2. Classify — A heuristic classifier analyzes op shares (matmul, normalization, pointwise, tiny-op fraction) to determine the dominant regime
  3. Prescribe — Returns a ranked list of fixes for that regime, cheapest first
  4. Remediate — Auto-applies each applicable fix, re-profiles, and ranks by measured speedup

No code changes required. Remedies are environment-level transforms — they modify PyTorch's runtime settings, not your model code.

V8 Features (NEW)

Hard Safety Gates — CI/CD Deployment Blocking

slowai validate model.py --strict             # non-zero exit on ANY non-SAFE
slowai validate model.py --json               # machine-readable for pipelines
echo $?                                       # 0=SAFE, 1=CAUTION, 2=UNSAFE, 3=ERROR

V8 turns safety validation into a hard deployment gate. In --strict mode, anything non-SAFE blocks your CI/CD pipeline with a non-zero exit code. No more "warnings" that engineers ignore — UNSAFE means DO NOT DEPLOY.

DLA Offloading (Orin Hardware Accelerator)

slowai optimize model.py --dla                # offload to DLA core 0
slowai optimize model.py --dla --dla-core=1   # use DLA core 1

The Jetson Orin family includes dedicated Deep Learning Accelerator cores on select hardware (Orin NX, AGX Orin, Thor). DLA offloading routes compatible layers to the DLA, freeing CUDA cores for other tasks. Note: Orin Nano does not have DLA hardware — slowai will detect this and report it via slowai capabilities.

Power & Thermal Awareness

slowai power                                  # show power mode + thermal status
slowai power --json                           # machine-readable

Edge devices care about performance-per-watt as much as latency. The new power command detects nvpmodel mode, jetson_clocks state, GPU/CPU frequencies, and thermal throttling. If your Jetson is throttling, your benchmarks are unreliable — slowai tells you.

V7 Features

TensorRT Engine Export (Production Deployment)

slowai optimize model.py                      # ONNX → TensorRT FP16 engine
slowai optimize model.py --precision=int8     # INT8 for maximum throughput
slowai optimize model.py --target=onnx        # ONNX export only

The #1 feature NVIDIA engineers demand. Exports your model to a production-grade TensorRT .engine file via the reliable ONNX → trtexec path (bypasses broken torch_tensorrt wheels). Auto-detects JetPack, CUDA, and TensorRT versions. Outputs engine + metadata JSON with size, build time, and environment info.

Safety Validation Suite

slowai validate model.py                      # test all winning remedies
slowai validate model.py --atol=1e-6          # strict tolerance
slowai validate model.py --strict             # V8: hard gate

Every remedy that changes precision risks silent accuracy degradation. The safety suite catches it before you ship: runs baseline vs. remedied, compares output tensors (max absolute diff, max relative diff, cosine similarity), and grades each remedy as SAFE / CAUTION / UNSAFE. Includes determinism checking for flight-ready validation.

HTML Diagnostic Reports

slowai report model.py -o report.html

Generates a self-contained HTML report with hardware info, regime classification, interactive speedup charts, and a remedy leaderboard. Share with your team or include in CI artifacts.

Batch Scanning

slowai scan ./workloads/

Profiles every .py file in a directory and prints a summary table with regime, best remedy, and speedup for each workload. Great for auditing an entire model zoo.

Acceleration Backends

slowai V6 includes 10 remedies across all acceleration tiers:

Remedy Type Best for
torch.compile JIT compilation Overhead-bound (fuses ops, eliminates dispatch)
TensorRT Inference optimizer Compute-bound (layer fusion, kernel auto-tuning)
TensorRT FP16 Half-precision TRT Memory-bound (maximum throughput)
INT8 quantization Dynamic quantization Linear/LSTM-heavy models (2-4x on Ampere)
bf16 autocast Mixed precision Matmul-heavy architectures
fp16 autocast Mixed precision Memory-bound models
TF32 tensor cores Precision relaxation Transformer workloads
matmul precision Internal downcasting General compute
cuDNN benchmark Kernel auto-tuner CNN-heavy models
channels_last Memory layout Convolution pipelines

Capability Detection

slowai capabilities

Shows your GPU, CUDA capability, Jetson status, power mode, and which acceleration backends (torch.compile, TensorRT, INT8 tensor cores, DLA) are available.

Export to production

The --export flag saves the winning remedy as a drop-in Python module:

slowai fix model.py --export
# Creates slowai_config.py in the current directory

Then in your production code:

import slowai_config
slowai_config.apply()  # Set globally before your model runs

# Or as a context manager:
with slowai_config.optimized():
    model(data)

The exported config includes the exact PyTorch settings, speedup metadata, and both a global apply() function and an optimized() context manager. Zero dependencies beyond PyTorch.

CI/CD mode

Catch performance regressions on every commit:

# Fail if no remedy achieves at least 1.5x speedup
slowai fix model.py --ci --threshold 1.5

# Returns exit code 0 (pass) or 1 (fail)
# Outputs JSON for pipeline consumption
echo $?

Combine with --export to auto-generate optimized configs in your pipeline:

slowai fix model.py --ci --threshold 2.0 --export slowai_config.py

Benchmarks

Tested on NVIDIA Jetson Orin Nano Super (Ampere GPU, 1024 CUDA cores, 8GB unified RAM, JetPack 6.2, CUDA 12.6, PyTorch 2.8.0). 27 workloads across 18 industry verticals — the most comprehensive edge AI performance benchmark suite available.

Synthetic workloads (regime validation)

Workload Regime Baseline Best remedy After Speedup
Dense GEMM (4096x4096) Compute 7.523s bf16_autocast 0.752s 10.00x
Pointwise chain (8192x8192) Memory 2.400s tf32_tensor_cores 0.575s 4.17x
Tiny ops (5000 micro-ops) Overhead 3.281s tf32_tensor_cores 1.141s 2.88x

Production models — standard architectures

Workload Industry Baseline Best remedy After Speedup
MobileNetV2 Mobile / Edge 1.778s cudnn_benchmark 1.681s 1.06x
ResNet-50 Classification 2.105s bf16_autocast 1.969s 1.07x
EfficientNet-B0 Drones / Aerospace 1.445s cudnn_benchmark 1.404s 1.03x
R3D-18 (video) Surveillance / Defense 6.032s bf16_autocast 3.443s 1.75x
Transformer (12L/768d/12H) NLP / LLMs 6.207s bf16_autocast 2.060s 3.01x

Production models — industry-specific pipelines

Workload Industry Baseline Best remedy After Speedup
Underwater AUV (sonar + camera + nav) Oil & Gas / Navy 2.232s cudnn_benchmark 0.122s 18.34x
LiDAR 3D point cloud (PointNet-style) Autonomous vehicles 2.288s bf16_autocast 0.138s 16.57x
Agriculture drone (multispectral + NDVI) Precision agriculture tf32_tensor_cores 13.78x
Pose estimation (FPN + PAF, multi-person) Retail / AR-VR / Sports 2.394s bf16_autocast 0.287s 8.34x
Satellite imaging (change detection + priority) Space / Defense bf16_autocast 7.72x
Robotics pick-and-place (RGB-D + 7-DOF) Industrial robotics cudnn_benchmark 7.39x
GNN smart grid (message-passing + pooling) Energy / Telecom 2.569s tf32_tensor_cores 0.453s 5.67x
Medical imaging (DenseNet + multi-task) Healthcare bf16_autocast 5.54x
1D ConvNet (signal processing) Navy radar / sonar 2.990s bf16_autocast 0.757s 3.95x
Time Series Transformer Predictive maintenance 3.814s bf16_autocast 1.072s 3.56x
Edge diffusion (UNet denoiser, 128x128) Generative AI on device 2.904s bf16_autocast 0.918s 3.17x
Fly-by-wire control (sensor + transformer) Aviation / eVTOL tf32_tensor_cores 3.08x
Cybersecurity anomaly (flow transformer) Network defense / SOC 3.749s tf32_tensor_cores 1.308s 2.87x
Speech-to-text (Whisper-style encoder-decoder) Consumer / Accessibility tf32_tensor_cores 2.50x
RL policy network (LSTM + multi-modal, 200Hz) Industrial robotics / Logistics 4.370s tf32_tensor_cores 2.071s 2.11x
Mamba SSM (selective state-space, 4-layer) Telecom / IoT 30.582s tf32_tensor_cores 27.448s 1.11x
DeepLabV3 (MobileNetV3) Autonomous driving 4.019s bf16_autocast 3.623s 1.11x
Detection + Segmentation pipeline Autonomous driving 5.744s bf16_autocast 5.289s 1.09x
SSD-Lite (MobileNetV3) Autonomous driving 1.918s cudnn_benchmark 1.816s 1.06x

What the results tell you

Massive gains (5-18x) on custom multi-stream pipelines — AUV sensor fusion, LiDAR 3D processing, agriculture multispectral, pose estimation, GNN smart grid. These architectures use unique compute patterns (point clouds, multi-modal fusion, feature pyramids, scatter/gather ops) that PyTorch doesn't optimize by default.

Strong gains (2-5x) on transformer-based models and recurrent policies — cybersecurity flow analysis, speech-to-text, time series, BERT, RL policy networks, edge diffusion. Mixed precision and TF32 dramatically reduce matmul cost.

Modest gains (1-1.1x) on already-optimized architectures and sequential workloads — MobileNet, EfficientNet, SSD-Lite, Mamba SSM. Mobile architectures use depthwise separable convolutions that are already fast; sequential scan models are overhead-bound and need torch.compile (V5+).

The real value is that slowai finds the right fix automatically — cuDNN benchmark wins for convolution-heavy models, bf16 autocast wins for matmul-heavy architectures, TF32 wins for transformer workloads. Different models, different winners, zero guesswork.

Industries covered

Autonomous vehicles, aviation/eVTOL, oil & gas, Navy/defense, marine science, space, healthcare, industrial robotics, precision agriculture, cybersecurity, consumer/AR-VR, sports analytics, retail, generative AI, predictive maintenance, energy/smart grid, telecom/IoT, warehouse logistics.

Writing a workload

slowai profiles any Python script that exposes a main() function:

# my_model.py
import torch
from torchvision.models import resnet50

model = resnet50().cuda().eval()
data = torch.randn(8, 3, 224, 224, device="cuda")

def main():
    with torch.no_grad():
        for _ in range(30):
            model(data)
slowai fix my_model.py

Architecture

slowai/
  schema.py      # Regime enum, Diagnosis dataclass — the product thesis in types
  profiler.py    # torch.profiler wrapper -> ProfileResult (op stats + wall time)
  diagnose.py    # Heuristic classifier -> Diagnosis (regime + confidence + prescriptions)
  remediate.py   # Auto-fix engine -> FixReport (before/after speedup per remedy)
  optimize.py    # ONNX → TensorRT engine export pipeline (trtexec + DLA offloading)
  safety.py      # Numerical validation suite (equivalence testing, hard safety gates)
  power.py       # V8: Jetson power management, thermal monitoring, perf-per-watt
  report.py      # HTML report generator with Chart.js visualizations
  export.py      # Production config exporter (slowai_config.py)
  cli.py         # CLI: diagnose, fix, optimize, validate, report, scan, capabilities, power

The classifier is a pure function of profiler output — no torch dependency, fully unit-testable. The remediate engine applies fixes as environment transforms (global flags, autocast context managers, JIT compilation) so it never modifies user code.

What's different

Other tools in this space are profiler UIs that show you data and leave interpretation to you. slowai is the only tool that goes from raw workload to regime classification to ranked prescriptions to auto-applied fixes to measured speedup in a single CLI command.

Tool Profiles Classifies Prescribes Auto-fixes Measures Reports
PyTorch Profiler Yes No No No No No
NVIDIA Nsight Yes No No No No No
torch.utils.bottleneck Yes No No No No No
DeepSpeed Flops Profiler Yes No No No No No
slowai Yes Yes Yes Yes Yes Yes

Roadmap

  • V1 (shipped) — Profile + classify regime for synthetic workloads
  • V2 (shipped) — Noise filtering, normalization-aware classification, real model support
  • V3 (shipped) — Auto-remediate: apply fixes and measure before/after speedup
  • V3.1 (shipped) — --export flag: save winning config as production-ready Python module
  • V3.2 (shipped) — --ci mode: CI/CD integration with threshold-based pass/fail
  • V4 (shipped) — channels_last, Jetson power mode detection, numerical accuracy validation, 27 workloads / 18 industries
  • V5 (shipped) — TensorRT via torch.compile, torch.compile inductor, INT8 dynamic quantization, DLA detection
  • V6 (shipped) — HTML diagnostic reports with Chart.js, batch directory scanning, README glow-up, PyPI 0.6.0
  • V7 (shipped) — Production TensorRT engine export (ONNX → trtexec), safety validation suite (numerical equivalence + flight-readiness), slowai optimize + slowai validate commands, PyPI 0.7.0
  • V8 (shipped) — Hard safety gates (non-zero exit codes block deployment), DLA offloading (Orin hardware accelerator), slowai power thermal/power monitoring, --strict validation mode, --dla engine export, PyPI 0.8.0
  • V9 (next) — Hardware-aware detection (gate --dla on SKU), ONNX→trtexec hardening, INT8 calibration pipeline with dataset, streaming inference benchmarker

Built by

Rico Allen — @ricojallen37-sketch

Built and tested on NVIDIA Jetson Orin Nano Super Developer Kit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slowai-0.10.0.tar.gz (68.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slowai-0.10.0-py3-none-any.whl (60.9 kB view details)

Uploaded Python 3

File details

Details for the file slowai-0.10.0.tar.gz.

File metadata

  • Download URL: slowai-0.10.0.tar.gz
  • Upload date:
  • Size: 68.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for slowai-0.10.0.tar.gz
Algorithm Hash digest
SHA256 60aea53841801281b6248f165f8d15ed3dce060e8b30f105279bfdfa0cb48308
MD5 5fe37f164c24f4a8d3a5f4230ebda979
BLAKE2b-256 bb0d6d829de5917867175d9d39a2ddaeec6700efe598e5e21c70423e68b313cb

See more details on using hashes here.

File details

Details for the file slowai-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: slowai-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 60.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for slowai-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca03239320c8e96f82be108cedd95d870776d5bc892411ea03264f0081c81fb8
MD5 2f79d7b7b25515af2912310ca1ffa743
BLAKE2b-256 50551292904111bbfce7e2ca089e03478cae5249deee1b1d83cc2bb7257ea9b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page