SIMD-accelerated tensor operations for neural networks
Project description
LabNeura
High-performance C++ tensor library with Python bindings, featuring hardware-accelerated SIMD operations optimized for modern CPUs. Supports AVX2 (x86_64), NEON (ARM64/Apple Silicon), and provides mixed-precision computation with optional INT8 quantization.
๐ Features
Hardware Acceleration
- Multi-backend Architecture: Runtime detection and selection of optimal SIMD backend
- AVX2 Backend: 8-wide FP32 or 32-wide INT8 vectorization on Intel/AMD x86_64 CPUs
- NEON Backend: 4-wide FP32 or 16-wide INT8 vectorization on Apple Silicon (M1/M2/M3/M4) and ARM64
- Generic Backend: Portable fallback for all architectures
- Automatic Backend Selection: Detects CPU capabilities at runtime using
detect_backend() - Compile-time Optimization: CMake detects hardware and enables appropriate compiler flags
Tensor Operations
- Mixed-Precision Support: FP32 (32-bit float), INT8 (8-bit quantized), with future support for FP64, FP16, INT32, INT16
- In-place Operations:
add_inplace(),mul_inplace(),sub_inplace()for memory efficiency - Immutable Operations:
add()returns new tensors without modifying originals - Quantization: Built-in INT8 quantization with 4x memory reduction
Python Integration
- Easy Installation:
pip install labneura2(PyPI package name, import aslabneura) - pybind11 Bindings: Zero-copy data access between C++ and Python
- NumPy Compatible: Seamless integration with NumPy workflows
๐ฆ Installation
Python Package (Recommended)
pip install labneura2
Note: The PyPI package is named labneura2, but you import it as labneura:
import labneura # Import statement
Build from Source
Prerequisites
- CMake >= 3.14
- C++17 compatible compiler (Clang, GCC, MSVC)
- Python 3.9+ (for Python bindings)
- pybind11 >= 3.0.1
C++ Library
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release
cmake --install .
Python Bindings
cd python
pip install .
For development with coverage:
LABNEURA_COVERAGE=1 pip install -e .
๐ฏ Quick Start
Python Usage
import labneura
# Detect available backend
print(f"Using backend: {labneura.detect_backend()}") # "AVX2", "NEON", or "GENERIC"
# FP32 operations
t1 = labneura.Tensor([1.0, 2.0, 3.0, 4.0], labneura.QuantizationMode.FP32)
t2 = labneura.Tensor([0.5, 0.5, 0.5, 0.5], labneura.QuantizationMode.FP32)
# In-place addition (modifies t1)
t1.add_inplace(t2)
print(t1.data_fp32()) # [1.5, 2.5, 3.5, 4.5]
# Non-destructive addition (returns new tensor)
t3 = t1.add(t2)
# Other operations
t1.mul_inplace(t2) # Element-wise multiplication
t1.sub_inplace(t2) # Element-wise subtraction
# INT8 quantization for memory efficiency
t_int8 = labneura.Tensor([10, 20, 30, 40], labneura.QuantizationMode.INT8)
print(t_int8.data_int8()) # [10, 20, 30, 40]
C++ Usage
#include "labneura/tensor.h"
#include "labneura/backends/backend_factory.h"
#include <iostream>
int main() {
// Detect backend at runtime
std::cout << "Backend: " << labneura::detect_backend() << std::endl;
// Create tensors
std::vector<float> data1 = {1.0f, 2.0f, 3.0f, 4.0f};
std::vector<float> data2 = {0.5f, 0.5f, 0.5f, 0.5f};
labneura::Tensor t1(data1, labneura::QuantizationMode::FP32);
labneura::Tensor t2(data2, labneura::QuantizationMode::FP32);
// In-place operations
t1.add_inplace(t2);
// Access data
const float* result = t1.data_fp32();
for (size_t i = 0; i < t1.size(); ++i) {
std::cout << result[i] << " ";
}
return 0;
}
Compile:
g++ -std=c++17 -O3 -mavx2 main.cpp -I/path/to/labneura/include -L/path/to/labneura/lib -llabneura
โก Performance
LabNeura delivers SIMD-accelerated performance that significantly outperforms NumPy and is competitive with PyTorch through hardware-specific optimizations optimized for quantized inference on edge devices.
Key Results (1M Elements, Apple Silicon)
| Operation | LabNeura | vs NumPy | vs PyTorch |
|---|---|---|---|
| INT16 ADD | 67.98 ฮผs | 6.1x faster | 2.9x faster |
| INT16 MUL | 71.49 ฮผs | 5.8x faster | 1.96x faster |
| FP16 ADD | 67.94 ฮผs | 60.4x faster | 1.18x slower* |
| FP16 MUL | 67.94 ฮผs | 134.8x faster | 1.23x slower* |
*PyTorch benefits from GPU pipeline optimization for large batches; LabNeura is superior for CPU-only deployment.
Complete Benchmarks
๐ View Full Benchmark Report
- INT16 & FP16 operations vs NumPy and PyTorch
- Multiple tensor sizes (1K to 1M elements)
- Comprehensive analysis and optimization recommendations
- Quick Reference Table
- Executive Summary
Run Benchmarks Locally
# INT16/FP16 comparison
python benchmarking/benchmark_int16_fp16.py
# View results
cat benchmarking/benchmark_int16_fp16_results.json
Performance Characteristics
- NEON (ARM64): Processes 4 FP32 or 16 INT8 elements per instruction
- AVX2 (x86_64): Processes 8 FP32 or 32 INT8 elements per instruction
- Memory Efficiency: INT8 mode uses 4x less memory, improving cache utilization
- Zero Threading Overhead: Single-threaded SIMD maximizes per-core performance
๐๏ธ Architecture
Backend Selection Hierarchy
Runtime Detection โ Best Available Backend
1. Check AVX2 support (x86_64) โ AVX2Backend
2. Check NEON support (ARM64) โ NEONBackend
3. Fallback โ GenericBackend
Directory Structure
LabNeura/
โโโ include/labneura/ # Public C++ headers
โ โโโ tensor.h # Main Tensor API
โ โโโ types.hpp # Type definitions
โ โโโ backends/ # Backend interfaces
โ โโโ base.h # TensorBackend abstract base
โ โโโ avx2.h # AVX2 SIMD backend
โ โโโ neon.h # NEON SIMD backend
โ โโโ generic.h # Portable fallback
โ โโโ backend_factory.h # Factory + detect_backend()
โ โโโ cpu_features.h # Runtime CPU feature detection
โโโ src/labneura/ # Implementation
โ โโโ tensor.cpp # Tensor class implementation
โ โโโ backends/*.cpp # Backend implementations
โโโ python/ # Python bindings
โ โโโ labneura_py.cpp # pybind11 bindings
โ โโโ setup.py # Python package config
โโโ examples/ # Usage examples
โ โโโ main.cpp # C++ example
โ โโโ test_tensor.py # Python example
โ โโโ benchmark_numpy.py # Performance comparison
โโโ tests/ # Test suite
โโโ CMakeLists.txt # CMake build configuration
โโโ Makefile # Build automation
๐งช Testing
Run Tests
# Python tests
pytest tests/
# Specific test files
pytest tests/test_tensor_ops.py
pytest tests/test_quantization.py
# With coverage
pytest --cov=labneura tests/
Run Examples
# C++ example
./build/labneura_example
# Python examples
python examples/test_tensor.py
python examples/benchmark_numpy.py
๐ ๏ธ Development
Build Commands
# Clean build
make clean
# Build Python package
make build
# Install locally
make install
# Run tests
make test
# Build distribution packages
make build-dist
# Publish to TestPyPI
TEST_PYPI_TOKEN=pypi-... make publish-testpypi
# Publish to PyPI
PYPI_TOKEN=pypi-... make publish
Build Flags
# Enable AVX2 (x86_64)
cmake .. -DCMAKE_CXX_FLAGS="-mavx2"
# Enable LLVM coverage
LABNEURA_COVERAGE=1 cmake ..
# Release build with optimizations
cmake .. -DCMAKE_BUILD_TYPE=Release
๐ Implementation Details
SIMD Optimization
AVX2 Backend (x86_64)
- Register Width: 256 bits
- FP32: Processes 8 floats per instruction (
_mm256_add_ps,_mm256_mul_ps) - INT8: Processes 32 bytes per instruction (
_mm256_add_epi8) - Alignment: Pads arrays to 8-element boundaries
NEON Backend (ARM64/Apple Silicon)
- Register Width: 128 bits
- FP32: Processes 4 floats per instruction (
vaddq_f32,vmulq_f32) - INT8: Processes 16 bytes per instruction (
vaddq_s8) - Alignment: Pads arrays to 4-element boundaries
- Optimization: Single-threaded design maximizes instruction-level parallelism
Quantization Strategy
- INT8 Mode: Stores data as 8-bit signed integers (-128 to 127)
- Memory Reduction: 4x smaller than FP32 (1 byte vs 4 bytes per element)
- Saturation Arithmetic: Prevents overflow by clamping to [-128, 127]
- Use Cases: Neural network inference, mobile deployment
Backend Detection
// Runtime detection in backend_factory.cpp
std::string detect_backend() {
if (cpu_supports_avx2()) return "AVX2";
if (cpu_supports_neon()) return "NEON";
return "GENERIC";
}
CPU feature detection uses:
- x86_64:
__builtin_cpu_supports("avx2") - ARM64: Compile-time
__ARM_NEONmacro (NEON is mandatory on ARMv8)
๐ Publishing
Package Names
- PyPI Name:
labneura2(install withpip install labneura2) - Python Module:
labneura(import withimport labneura) - GitHub: https://github.com/gokatharun/LabNeura
CI/CD Pipeline
GitHub Actions workflow automates:
- Build distribution packages
- Run test suite
- Publish to TestPyPI (manual trigger)
- Publish to PyPI (on release tags)
๐ License
MIT License - see LICENSE file for details.
๐ค Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ฎ Support
- Issues: https://github.com/gokatharun/LabNeura/issues
- Discussions: https://github.com/gokatharun/LabNeura/discussions
๐ Acknowledgments
Built with:
- pybind11 - Seamless Python/C++ bindings
- CMake - Cross-platform build system
- SIMD intrinsics: Intel AVX2, ARM NEON
Author: LabNeura Contributors
Version: 0.1.2
Python: 3.9+ | C++: 17+ | Platforms: macOS (Apple Silicon & Intel), Linux (x86_64, ARM64)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file labneura2-0.2.6.tar.gz.
File metadata
- Download URL: labneura2-0.2.6.tar.gz
- Upload date:
- Size: 42.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c6df13a530e583fed3d509da97ce3cc78805121ec16f04a37ff4d101c60f494
|
|
| MD5 |
fa936619e6a4ff6165cce2203a0de8a9
|
|
| BLAKE2b-256 |
2ac3857d25b195d36ff5363b24074b4a2d5560d99d16cac3234b4d65b9f1441d
|
File details
Details for the file labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl.
File metadata
- Download URL: labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl
- Upload date:
- Size: 884.6 kB
- Tags: CPython 3.9, macOS 10.9+ universal2 (ARM64, x86-64)
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7aa52b6a2a8f74d02b23063efef1e9d0d8e388dcb3be2d7724664ce3980cca77
|
|
| MD5 |
b48b8e44f76ba7a99474b21caa0b21e3
|
|
| BLAKE2b-256 |
dc5a6dc73f9b51b7045efd6bd926675034054d78a10a9e3700121fdf9f5d948a
|