Skip to main content

SIMD-accelerated tensor operations for neural networks

Project description

LabNeura

PyPI version License: MIT

High-performance C++ tensor library with Python bindings, featuring hardware-accelerated SIMD operations optimized for modern CPUs. Supports AVX2 (x86_64), NEON (ARM64/Apple Silicon), and provides mixed-precision computation with optional INT8 quantization.

๐Ÿš€ Features

Hardware Acceleration

  • Multi-backend Architecture: Runtime detection and selection of optimal SIMD backend
    • AVX2 Backend: 8-wide FP32 or 32-wide INT8 vectorization on Intel/AMD x86_64 CPUs
    • NEON Backend: 4-wide FP32 or 16-wide INT8 vectorization on Apple Silicon (M1/M2/M3/M4) and ARM64
    • Generic Backend: Portable fallback for all architectures
  • Automatic Backend Selection: Detects CPU capabilities at runtime using detect_backend()
  • Compile-time Optimization: CMake detects hardware and enables appropriate compiler flags

Tensor Operations

  • Mixed-Precision Support: FP32 (32-bit float), INT8 (8-bit quantized), with future support for FP64, FP16, INT32, INT16
  • In-place Operations: add_inplace(), mul_inplace(), sub_inplace() for memory efficiency
  • Immutable Operations: add() returns new tensors without modifying originals
  • Quantization: Built-in INT8 quantization with 4x memory reduction

Python Integration

  • Easy Installation: pip install labneura2 (PyPI package name, import as labneura)
  • pybind11 Bindings: Zero-copy data access between C++ and Python
  • NumPy Compatible: Seamless integration with NumPy workflows

๐Ÿ“ฆ Installation

Python Package (Recommended)

pip install labneura2

Note: The PyPI package is named labneura2, but you import it as labneura:

import labneura  # Import statement

Build from Source

Prerequisites

  • CMake >= 3.14
  • C++17 compatible compiler (Clang, GCC, MSVC)
  • Python 3.9+ (for Python bindings)
  • pybind11 >= 3.0.1

C++ Library

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release
cmake --install .

Python Bindings

cd python
pip install .

For development with coverage:

LABNEURA_COVERAGE=1 pip install -e .

๐ŸŽฏ Quick Start

Python Usage

import labneura

# Detect available backend
print(f"Using backend: {labneura.detect_backend()}")  # "AVX2", "NEON", or "GENERIC"

# FP32 operations
t1 = labneura.Tensor([1.0, 2.0, 3.0, 4.0], labneura.QuantizationMode.FP32)
t2 = labneura.Tensor([0.5, 0.5, 0.5, 0.5], labneura.QuantizationMode.FP32)

# In-place addition (modifies t1)
t1.add_inplace(t2)
print(t1.data_fp32())  # [1.5, 2.5, 3.5, 4.5]

# Non-destructive addition (returns new tensor)
t3 = t1.add(t2)

# Other operations
t1.mul_inplace(t2)  # Element-wise multiplication
t1.sub_inplace(t2)  # Element-wise subtraction

# INT8 quantization for memory efficiency
t_int8 = labneura.Tensor([10, 20, 30, 40], labneura.QuantizationMode.INT8)
print(t_int8.data_int8())  # [10, 20, 30, 40]

C++ Usage

#include "labneura/tensor.h"
#include "labneura/backends/backend_factory.h"
#include <iostream>

int main() {
    // Detect backend at runtime
    std::cout << "Backend: " << labneura::detect_backend() << std::endl;
    
    // Create tensors
    std::vector<float> data1 = {1.0f, 2.0f, 3.0f, 4.0f};
    std::vector<float> data2 = {0.5f, 0.5f, 0.5f, 0.5f};
    
    labneura::Tensor t1(data1, labneura::QuantizationMode::FP32);
    labneura::Tensor t2(data2, labneura::QuantizationMode::FP32);
    
    // In-place operations
    t1.add_inplace(t2);
    
    // Access data
    const float* result = t1.data_fp32();
    for (size_t i = 0; i < t1.size(); ++i) {
        std::cout << result[i] << " ";
    }
    
    return 0;
}

Compile:

g++ -std=c++17 -O3 -mavx2 main.cpp -I/path/to/labneura/include -L/path/to/labneura/lib -llabneura

โšก Performance

LabNeura delivers SIMD-accelerated performance that significantly outperforms NumPy and is competitive with PyTorch through hardware-specific optimizations optimized for quantized inference on edge devices.

Key Results (1M Elements, Apple Silicon)

Operation LabNeura vs NumPy vs PyTorch
INT16 ADD 67.98 ฮผs 6.1x faster 2.9x faster
INT16 MUL 71.49 ฮผs 5.8x faster 1.96x faster
FP16 ADD 67.94 ฮผs 60.4x faster 1.18x slower*
FP16 MUL 67.94 ฮผs 134.8x faster 1.23x slower*

*PyTorch benefits from GPU pipeline optimization for large batches; LabNeura is superior for CPU-only deployment.

Complete Benchmarks

๐Ÿ“Š View Full Benchmark Report

Run Benchmarks Locally

# INT16/FP16 comparison
python benchmarking/benchmark_int16_fp16.py

# View results
cat benchmarking/benchmark_int16_fp16_results.json

Performance Characteristics

  • NEON (ARM64): Processes 4 FP32 or 16 INT8 elements per instruction
  • AVX2 (x86_64): Processes 8 FP32 or 32 INT8 elements per instruction
  • Memory Efficiency: INT8 mode uses 4x less memory, improving cache utilization
  • Zero Threading Overhead: Single-threaded SIMD maximizes per-core performance

๐Ÿ—๏ธ Architecture

Backend Selection Hierarchy

Runtime Detection โ†’ Best Available Backend
1. Check AVX2 support (x86_64) โ†’ AVX2Backend
2. Check NEON support (ARM64) โ†’ NEONBackend
3. Fallback โ†’ GenericBackend

Directory Structure

LabNeura/
โ”œโ”€โ”€ include/labneura/          # Public C++ headers
โ”‚   โ”œโ”€โ”€ tensor.h               # Main Tensor API
โ”‚   โ”œโ”€โ”€ types.hpp              # Type definitions
โ”‚   โ””โ”€โ”€ backends/              # Backend interfaces
โ”‚       โ”œโ”€โ”€ base.h             # TensorBackend abstract base
โ”‚       โ”œโ”€โ”€ avx2.h             # AVX2 SIMD backend
โ”‚       โ”œโ”€โ”€ neon.h             # NEON SIMD backend
โ”‚       โ”œโ”€โ”€ generic.h          # Portable fallback
โ”‚       โ”œโ”€โ”€ backend_factory.h  # Factory + detect_backend()
โ”‚       โ””โ”€โ”€ cpu_features.h     # Runtime CPU feature detection
โ”œโ”€โ”€ src/labneura/              # Implementation
โ”‚   โ”œโ”€โ”€ tensor.cpp             # Tensor class implementation
โ”‚   โ””โ”€โ”€ backends/*.cpp         # Backend implementations
โ”œโ”€โ”€ python/                    # Python bindings
โ”‚   โ”œโ”€โ”€ labneura_py.cpp        # pybind11 bindings
โ”‚   โ””โ”€โ”€ setup.py               # Python package config
โ”œโ”€โ”€ examples/                  # Usage examples
โ”‚   โ”œโ”€โ”€ main.cpp               # C++ example
โ”‚   โ”œโ”€โ”€ test_tensor.py         # Python example
โ”‚   โ””โ”€โ”€ benchmark_numpy.py     # Performance comparison
โ”œโ”€โ”€ tests/                     # Test suite
โ”œโ”€โ”€ CMakeLists.txt             # CMake build configuration
โ””โ”€โ”€ Makefile                   # Build automation

๐Ÿงช Testing

Run Tests

# Python tests
pytest tests/

# Specific test files
pytest tests/test_tensor_ops.py
pytest tests/test_quantization.py

# With coverage
pytest --cov=labneura tests/

Run Examples

# C++ example
./build/labneura_example

# Python examples
python examples/test_tensor.py
python examples/benchmark_numpy.py

๐Ÿ› ๏ธ Development

Build Commands

# Clean build
make clean

# Build Python package
make build

# Install locally
make install

# Run tests
make test

# Build distribution packages
make build-dist

# Publish to TestPyPI
TEST_PYPI_TOKEN=pypi-... make publish-testpypi

# Publish to PyPI
PYPI_TOKEN=pypi-... make publish

Build Flags

# Enable AVX2 (x86_64)
cmake .. -DCMAKE_CXX_FLAGS="-mavx2"

# Enable LLVM coverage
LABNEURA_COVERAGE=1 cmake ..

# Release build with optimizations
cmake .. -DCMAKE_BUILD_TYPE=Release

๐Ÿ“Š Implementation Details

SIMD Optimization

AVX2 Backend (x86_64)

  • Register Width: 256 bits
  • FP32: Processes 8 floats per instruction (_mm256_add_ps, _mm256_mul_ps)
  • INT8: Processes 32 bytes per instruction (_mm256_add_epi8)
  • Alignment: Pads arrays to 8-element boundaries

NEON Backend (ARM64/Apple Silicon)

  • Register Width: 128 bits
  • FP32: Processes 4 floats per instruction (vaddq_f32, vmulq_f32)
  • INT8: Processes 16 bytes per instruction (vaddq_s8)
  • Alignment: Pads arrays to 4-element boundaries
  • Optimization: Single-threaded design maximizes instruction-level parallelism

Quantization Strategy

  • INT8 Mode: Stores data as 8-bit signed integers (-128 to 127)
  • Memory Reduction: 4x smaller than FP32 (1 byte vs 4 bytes per element)
  • Saturation Arithmetic: Prevents overflow by clamping to [-128, 127]
  • Use Cases: Neural network inference, mobile deployment

Backend Detection

// Runtime detection in backend_factory.cpp
std::string detect_backend() {
    if (cpu_supports_avx2()) return "AVX2";
    if (cpu_supports_neon()) return "NEON";
    return "GENERIC";
}

CPU feature detection uses:

  • x86_64: __builtin_cpu_supports("avx2")
  • ARM64: Compile-time __ARM_NEON macro (NEON is mandatory on ARMv8)

๐ŸŒ Publishing

Package Names

CI/CD Pipeline

GitHub Actions workflow automates:

  1. Build distribution packages
  2. Run test suite
  3. Publish to TestPyPI (manual trigger)
  4. Publish to PyPI (on release tags)

๐Ÿ“ License

MIT License - see LICENSE file for details.

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ฎ Support

๐Ÿ™ Acknowledgments

Built with:

  • pybind11 - Seamless Python/C++ bindings
  • CMake - Cross-platform build system
  • SIMD intrinsics: Intel AVX2, ARM NEON

Author: LabNeura Contributors
Version: 0.1.2
Python: 3.9+ | C++: 17+ | Platforms: macOS (Apple Silicon & Intel), Linux (x86_64, ARM64)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labneura2-0.2.6.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl (884.6 kB view details)

Uploaded CPython 3.9macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file labneura2-0.2.6.tar.gz.

File metadata

  • Download URL: labneura2-0.2.6.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for labneura2-0.2.6.tar.gz
Algorithm Hash digest
SHA256 6c6df13a530e583fed3d509da97ce3cc78805121ec16f04a37ff4d101c60f494
MD5 fa936619e6a4ff6165cce2203a0de8a9
BLAKE2b-256 2ac3857d25b195d36ff5363b24074b4a2d5560d99d16cac3234b4d65b9f1441d

See more details on using hashes here.

File details

Details for the file labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 7aa52b6a2a8f74d02b23063efef1e9d0d8e388dcb3be2d7724664ce3980cca77
MD5 b48b8e44f76ba7a99474b21caa0b21e3
BLAKE2b-256 dc5a6dc73f9b51b7045efd6bd926675034054d78a10a9e3700121fdf9f5d948a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page