One command to find why your PyTorch model is slow — and fix it.

These details have not been verified by PyPI

Project links

Project description

slowai

One command to find why your PyTorch model is slow — and fix it.

slowai diagnoses which performance regime your workload is stuck in (compute-bound, memory-bound, or overhead-bound), prescribes the right fix, auto-applies it, and proves the speedup with before/after measurements. No guesswork. No manual profiler interpretation.

Install

pip install slowai

That's it. Works on any NVIDIA GPU — desktop, server, or Jetson edge devices.

Quick start

slowai fix model.py                # diagnose + auto-fix + measure speedup
slowai fix model.py --export       # ^ plus save winning config for production
slowai optimize model.py           # export ONNX → TensorRT .engine (production)
slowai optimize model.py --precision=int8  # INT8 engine for max throughput
slowai optimize model.py --dla     # V8: DLA offloading (Orin)
slowai validate model.py           # safety validation (numerical equivalence)
slowai validate model.py --strict  # V8: hard gate (any non-SAFE = fail)
slowai report model.py             # HTML diagnostic report with charts
slowai scan ./workloads/           # batch-scan an entire directory
slowai capabilities                # show GPU + acceleration backends
slowai power                       # V8: Jetson power/thermal status

Example output

$ slowai fix model.py

==============================================================
  BASELINE: model.py: COMPUTE_BOUND (confidence: 0.85)
  wall time: 7.523s
==============================================================

  Tried 4 remedies:

  1. [10.00x] bf16_autocast  ** BEST **
     Run under bfloat16 automatic mixed precision
     7.523s >>> 0.752s
     regime: compute (confidence: 0.85)

  2. [6.32x] tf32_tensor_cores
     Enable TF32 tensor cores (~2x matmul throughput on Ampere+)
     7.523s >>> 1.191s

  3. [6.22x] high_matmul_precision
     Set float32 matmul precision to 'high'
     7.523s >>> 1.210s

  4. [1.31x] cudnn_benchmark
     Enable cuDNN auto-tuner for conv kernels
     7.523s >>> 5.719s

--------------------------------------------------------------
  WINNER: bf16_autocast
  7.523s >>> 0.752s  (10.00x, +900% faster)
  How: Run under bfloat16 automatic mixed precision
--------------------------------------------------------------

Why this exists

Every deep learning workload is stuck in one of three performance regimes (Horace He, 2022):

Regime	What's happening	Wrong fix = no speedup
Compute-bound	GPU is saturated doing math (matmuls, convolutions)	Fusing ops won't help — the math itself is the bottleneck
Memory-bound	GPU is waiting for data (pointwise ops, activations)	Smaller model won't help — you need less data movement
Overhead-bound	GPU is idle waiting for Python/dispatcher (tiny ops)	Lower precision won't help — you need fewer, bigger ops

The fix for each regime is different. Applying a compute-bound fix to a memory-bound workload does nothing. Engineers waste hours in profiler UIs figuring this out manually.

slowai does it in one command.

How it works

Under the hood:

Profile — Runs your workload under torch.profiler with CUDA timing, warmup pass, and op-level statistics
Classify — A heuristic classifier analyzes op shares (matmul, normalization, pointwise, tiny-op fraction) to determine the dominant regime
Prescribe — Returns a ranked list of fixes for that regime, cheapest first
Remediate — Auto-applies each applicable fix, re-profiles, and ranks by measured speedup

No code changes required. Remedies are environment-level transforms — they modify PyTorch's runtime settings, not your model code.

V8 Features (NEW)

Hard Safety Gates — CI/CD Deployment Blocking

slowai validate model.py --strict             # non-zero exit on ANY non-SAFE
slowai validate model.py --json               # machine-readable for pipelines
echo $?                                       # 0=SAFE, 1=CAUTION, 2=UNSAFE, 3=ERROR

V8 turns safety validation into a hard deployment gate. In --strict mode, anything non-SAFE blocks your CI/CD pipeline with a non-zero exit code. No more "warnings" that engineers ignore — UNSAFE means DO NOT DEPLOY.

DLA Offloading (Orin Hardware Accelerator)

slowai optimize model.py --dla                # offload to DLA core 0
slowai optimize model.py --dla --dla-core=1   # use DLA core 1

The Jetson Orin family includes dedicated Deep Learning Accelerator cores on select hardware (Orin NX, AGX Orin, Thor). DLA offloading routes compatible layers to the DLA, freeing CUDA cores for other tasks. Note: Orin Nano does not have DLA hardware — slowai will detect this and report it via slowai capabilities.

Power & Thermal Awareness

slowai power                                  # show power mode + thermal status
slowai power --json                           # machine-readable

Edge devices care about performance-per-watt as much as latency. The new power command detects nvpmodel mode, jetson_clocks state, GPU/CPU frequencies, and thermal throttling. If your Jetson is throttling, your benchmarks are unreliable — slowai tells you.

V7 Features

TensorRT Engine Export (Production Deployment)

slowai optimize model.py                      # ONNX → TensorRT FP16 engine
slowai optimize model.py --precision=int8     # INT8 for maximum throughput
slowai optimize model.py --target=onnx        # ONNX export only

The #1 feature NVIDIA engineers demand. Exports your model to a production-grade TensorRT .engine file via the reliable ONNX → trtexec path (bypasses broken torch_tensorrt wheels). Auto-detects JetPack, CUDA, and TensorRT versions. Outputs engine + metadata JSON with size, build time, and environment info.

Safety Validation Suite

slowai validate model.py                      # test all winning remedies
slowai validate model.py --atol=1e-6          # strict tolerance
slowai validate model.py --strict             # V8: hard gate

Every remedy that changes precision risks silent accuracy degradation. The safety suite catches it before you ship: runs baseline vs. remedied, compares output tensors (max absolute diff, max relative diff, cosine similarity), and grades each remedy as SAFE / CAUTION / UNSAFE. Includes determinism checking for flight-ready validation.

HTML Diagnostic Reports

slowai report model.py -o report.html

Generates a self-contained HTML report with hardware info, regime classification, interactive speedup charts, and a remedy leaderboard. Share with your team or include in CI artifacts.

Batch Scanning

slowai scan ./workloads/

Profiles every .py file in a directory and prints a summary table with regime, best remedy, and speedup for each workload. Great for auditing an entire model zoo.

Acceleration Backends

slowai V6 includes 10 remedies across all acceleration tiers:

Remedy	Type	Best for
`torch.compile`	JIT compilation	Overhead-bound (fuses ops, eliminates dispatch)
`TensorRT`	Inference optimizer	Compute-bound (layer fusion, kernel auto-tuning)
`TensorRT FP16`	Half-precision TRT	Memory-bound (maximum throughput)
`INT8 quantization`	Dynamic quantization	Linear/LSTM-heavy models (2-4x on Ampere)
`bf16 autocast`	Mixed precision	Matmul-heavy architectures
`fp16 autocast`	Mixed precision	Memory-bound models
`TF32 tensor cores`	Precision relaxation	Transformer workloads
`matmul precision`	Internal downcasting	General compute
`cuDNN benchmark`	Kernel auto-tuner	CNN-heavy models
`channels_last`	Memory layout	Convolution pipelines

Capability Detection

slowai capabilities

Shows your GPU, CUDA capability, Jetson status, power mode, and which acceleration backends (torch.compile, TensorRT, INT8 tensor cores, DLA) are available.

Export to production

The --export flag saves the winning remedy as a drop-in Python module:

slowai fix model.py --export
# Creates slowai_config.py in the current directory

Then in your production code:

import slowai_config
slowai_config.apply()  # Set globally before your model runs

# Or as a context manager:
with slowai_config.optimized():
    model(data)

The exported config includes the exact PyTorch settings, speedup metadata, and both a global apply() function and an optimized() context manager. Zero dependencies beyond PyTorch.

CI/CD mode

Catch performance regressions on every commit:

# Fail if no remedy achieves at least 1.5x speedup
slowai fix model.py --ci --threshold 1.5

# Returns exit code 0 (pass) or 1 (fail)
# Outputs JSON for pipeline consumption
echo $?

Combine with --export to auto-generate optimized configs in your pipeline:

slowai fix model.py --ci --threshold 2.0 --export slowai_config.py

Benchmarks

Tested on NVIDIA Jetson Orin Nano Super (Ampere GPU, 1024 CUDA cores, 8GB unified RAM, JetPack 6.2, CUDA 12.6, PyTorch 2.8.0). 27 workloads across 18 industry verticals — the most comprehensive edge AI performance benchmark suite available.

Synthetic workloads (regime validation)

Workload	Regime	Baseline	Best remedy	After	Speedup
Dense GEMM (4096x4096)	Compute	7.523s	bf16_autocast	0.752s	10.00x
Pointwise chain (8192x8192)	Memory	2.400s	tf32_tensor_cores	0.575s	4.17x
Tiny ops (5000 micro-ops)	Overhead	3.281s	tf32_tensor_cores	1.141s	2.88x

Production models — standard architectures

Workload	Industry	Baseline	Best remedy	After	Speedup
MobileNetV2	Mobile / Edge	1.778s	cudnn_benchmark	1.681s	1.06x
ResNet-50	Classification	2.105s	bf16_autocast	1.969s	1.07x
EfficientNet-B0	Drones / Aerospace	1.445s	cudnn_benchmark	1.404s	1.03x
R3D-18 (video)	Surveillance / Defense	6.032s	bf16_autocast	3.443s	1.75x
Transformer (12L/768d/12H)	NLP / LLMs	6.207s	bf16_autocast	2.060s	3.01x

Production models — industry-specific pipelines

Workload	Industry	Baseline	Best remedy	After	Speedup
Underwater AUV (sonar + camera + nav)	Oil & Gas / Navy	2.232s	cudnn_benchmark	0.122s	18.34x
LiDAR 3D point cloud (PointNet-style)	Autonomous vehicles	2.288s	bf16_autocast	0.138s	16.57x
Agriculture drone (multispectral + NDVI)	Precision agriculture	—	tf32_tensor_cores	—	13.78x
Pose estimation (FPN + PAF, multi-person)	Retail / AR-VR / Sports	2.394s	bf16_autocast	0.287s	8.34x
Satellite imaging (change detection + priority)	Space / Defense	—	bf16_autocast	—	7.72x
Robotics pick-and-place (RGB-D + 7-DOF)	Industrial robotics	—	cudnn_benchmark	—	7.39x
GNN smart grid (message-passing + pooling)	Energy / Telecom	2.569s	tf32_tensor_cores	0.453s	5.67x
Medical imaging (DenseNet + multi-task)	Healthcare	—	bf16_autocast	—	5.54x
1D ConvNet (signal processing)	Navy radar / sonar	2.990s	bf16_autocast	0.757s	3.95x
Time Series Transformer	Predictive maintenance	3.814s	bf16_autocast	1.072s	3.56x
Edge diffusion (UNet denoiser, 128x128)	Generative AI on device	2.904s	bf16_autocast	0.918s	3.17x
Fly-by-wire control (sensor + transformer)	Aviation / eVTOL	—	tf32_tensor_cores	—	3.08x
Cybersecurity anomaly (flow transformer)	Network defense / SOC	3.749s	tf32_tensor_cores	1.308s	2.87x
Speech-to-text (Whisper-style encoder-decoder)	Consumer / Accessibility	—	tf32_tensor_cores	—	2.50x
RL policy network (LSTM + multi-modal, 200Hz)	Industrial robotics / Logistics	4.370s	tf32_tensor_cores	2.071s	2.11x
Mamba SSM (selective state-space, 4-layer)	Telecom / IoT	30.582s	tf32_tensor_cores	27.448s	1.11x
DeepLabV3 (MobileNetV3)	Autonomous driving	4.019s	bf16_autocast	3.623s	1.11x
Detection + Segmentation pipeline	Autonomous driving	5.744s	bf16_autocast	5.289s	1.09x
SSD-Lite (MobileNetV3)	Autonomous driving	1.918s	cudnn_benchmark	1.816s	1.06x

What the results tell you

Massive gains (5-18x) on custom multi-stream pipelines — AUV sensor fusion, LiDAR 3D processing, agriculture multispectral, pose estimation, GNN smart grid. These architectures use unique compute patterns (point clouds, multi-modal fusion, feature pyramids, scatter/gather ops) that PyTorch doesn't optimize by default.

Strong gains (2-5x) on transformer-based models and recurrent policies — cybersecurity flow analysis, speech-to-text, time series, BERT, RL policy networks, edge diffusion. Mixed precision and TF32 dramatically reduce matmul cost.

Modest gains (1-1.1x) on already-optimized architectures and sequential workloads — MobileNet, EfficientNet, SSD-Lite, Mamba SSM. Mobile architectures use depthwise separable convolutions that are already fast; sequential scan models are overhead-bound and need torch.compile (V5+).

The real value is that slowai finds the right fix automatically — cuDNN benchmark wins for convolution-heavy models, bf16 autocast wins for matmul-heavy architectures, TF32 wins for transformer workloads. Different models, different winners, zero guesswork.

Industries covered

Autonomous vehicles, aviation/eVTOL, oil & gas, Navy/defense, marine science, space, healthcare, industrial robotics, precision agriculture, cybersecurity, consumer/AR-VR, sports analytics, retail, generative AI, predictive maintenance, energy/smart grid, telecom/IoT, warehouse logistics.

Writing a workload

slowai profiles any Python script that exposes a main() function:

# my_model.py
import torch
from torchvision.models import resnet50

model = resnet50().cuda().eval()
data = torch.randn(8, 3, 224, 224, device="cuda")

def main():
    with torch.no_grad():
        for _ in range(30):
            model(data)

slowai fix my_model.py

Architecture

slowai/
  schema.py      # Regime enum, Diagnosis dataclass — the product thesis in types
  profiler.py    # torch.profiler wrapper -> ProfileResult (op stats + wall time)
  diagnose.py    # Heuristic classifier -> Diagnosis (regime + confidence + prescriptions)
  remediate.py   # Auto-fix engine -> FixReport (before/after speedup per remedy)
  optimize.py    # ONNX → TensorRT engine export pipeline (trtexec + DLA offloading)
  safety.py      # Numerical validation suite (equivalence testing, hard safety gates)
  power.py       # V8: Jetson power management, thermal monitoring, perf-per-watt
  report.py      # HTML report generator with Chart.js visualizations
  export.py      # Production config exporter (slowai_config.py)
  cli.py         # CLI: diagnose, fix, optimize, validate, report, scan, capabilities, power

The classifier is a pure function of profiler output — no torch dependency, fully unit-testable. The remediate engine applies fixes as environment transforms (global flags, autocast context managers, JIT compilation) so it never modifies user code.

What's different

Other tools in this space are profiler UIs that show you data and leave interpretation to you. slowai is the only tool that goes from raw workload to regime classification to ranked prescriptions to auto-applied fixes to measured speedup in a single CLI command.

Tool	Profiles	Classifies	Prescribes	Auto-fixes	Measures	Reports
PyTorch Profiler	Yes	No	No	No	No	No
NVIDIA Nsight	Yes	No	No	No	No	No
torch.utils.bottleneck	Yes	No	No	No	No	No
DeepSpeed Flops Profiler	Yes	No	No	No	No	No
slowai	Yes	Yes	Yes	Yes	Yes	Yes

Roadmap

V1 (shipped) — Profile + classify regime for synthetic workloads
V2 (shipped) — Noise filtering, normalization-aware classification, real model support
V3 (shipped) — Auto-remediate: apply fixes and measure before/after speedup
V3.1 (shipped) — --export flag: save winning config as production-ready Python module
V3.2 (shipped) — --ci mode: CI/CD integration with threshold-based pass/fail
V4 (shipped) — channels_last, Jetson power mode detection, numerical accuracy validation, 27 workloads / 18 industries
V5 (shipped) — TensorRT via torch.compile, torch.compile inductor, INT8 dynamic quantization, DLA detection
V6 (shipped) — HTML diagnostic reports with Chart.js, batch directory scanning, README glow-up, PyPI 0.6.0
V7 (shipped) — Production TensorRT engine export (ONNX → trtexec), safety validation suite (numerical equivalence + flight-readiness), slowai optimize + slowai validate commands, PyPI 0.7.0
V8 (shipped) — Hard safety gates (non-zero exit codes block deployment), DLA offloading (Orin hardware accelerator), slowai power thermal/power monitoring, --strict validation mode, --dla engine export, PyPI 0.8.0
V9 (next) — Hardware-aware detection (gate --dla on SKU), ONNX→trtexec hardening, INT8 calibration pipeline with dataset, streaming inference benchmarker

Built by

Rico Allen — @ricojallen37-sketch

Built and tested on NVIDIA Jetson Orin Nano Super Developer Kit.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.10.0

Apr 11, 2026

0.9.0

Apr 11, 2026

0.7.1

Apr 10, 2026

0.7.0

Apr 10, 2026

0.6.0

Apr 10, 2026

0.5.0

Apr 10, 2026

0.3.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slowai-0.10.0.tar.gz (68.5 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slowai-0.10.0-py3-none-any.whl (60.9 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file slowai-0.10.0.tar.gz.

File metadata

Download URL: slowai-0.10.0.tar.gz
Upload date: Apr 11, 2026
Size: 68.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for slowai-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`60aea53841801281b6248f165f8d15ed3dce060e8b30f105279bfdfa0cb48308`
MD5	`5fe37f164c24f4a8d3a5f4230ebda979`
BLAKE2b-256	`bb0d6d829de5917867175d9d39a2ddaeec6700efe598e5e21c70423e68b313cb`

See more details on using hashes here.

File details

Details for the file slowai-0.10.0-py3-none-any.whl.

File metadata

Download URL: slowai-0.10.0-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 60.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for slowai-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca03239320c8e96f82be108cedd95d870776d5bc892411ea03264f0081c81fb8`
MD5	`2f79d7b7b25515af2912310ca1ffa743`
BLAKE2b-256	`50551292904111bbfce7e2ca089e03478cae5249deee1b1d83cc2bb7257ea9b7`

See more details on using hashes here.

slowai 0.10.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

slowai

Install

Quick start

Example output

Why this exists

How it works

V8 Features (NEW)

Hard Safety Gates — CI/CD Deployment Blocking

DLA Offloading (Orin Hardware Accelerator)

Power & Thermal Awareness

V7 Features

TensorRT Engine Export (Production Deployment)

Safety Validation Suite

HTML Diagnostic Reports

Batch Scanning

Acceleration Backends

Capability Detection

Export to production

CI/CD mode

Benchmarks

Synthetic workloads (regime validation)

Production models — standard architectures

Production models — industry-specific pipelines

What the results tell you

Industries covered

Writing a workload

Architecture

What's different

Roadmap

Built by

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes