SIMD-accelerated tensor operations for neural networks

These details have not been verified by PyPI

Project links

Homepage

Project description

LabNeura

High-performance C++ tensor library with Python bindings, featuring hardware-accelerated SIMD operations optimized for modern CPUs. Supports AVX2 (x86_64), NEON (ARM64/Apple Silicon), and provides mixed-precision computation with optional INT8 quantization.

🚀 Features

Hardware Acceleration

Multi-backend Architecture: Runtime detection and selection of optimal SIMD backend
- AVX2 Backend: 8-wide FP32 or 32-wide INT8 vectorization on Intel/AMD x86_64 CPUs
- NEON Backend: 4-wide FP32 or 16-wide INT8 vectorization on Apple Silicon (M1/M2/M3/M4) and ARM64
- Generic Backend: Portable fallback for all architectures
Automatic Backend Selection: Detects CPU capabilities at runtime using detect_backend()
Compile-time Optimization: CMake detects hardware and enables appropriate compiler flags

Tensor Operations

Mixed-Precision Support: FP32 (32-bit float), INT8 (8-bit quantized), with future support for FP64, FP16, INT32, INT16
In-place Operations: add_inplace(), mul_inplace(), sub_inplace() for memory efficiency
Immutable Operations: add() returns new tensors without modifying originals
Quantization: Built-in INT8 quantization with 4x memory reduction

Python Integration

Easy Installation: pip install labneura2 (PyPI package name, import as labneura)
pybind11 Bindings: Zero-copy data access between C++ and Python
NumPy Compatible: Seamless integration with NumPy workflows

📦 Installation

Python Package (Recommended)

pip install labneura2

Note: The PyPI package is named labneura2, but you import it as labneura:

import labneura  # Import statement

Build from Source

Prerequisites

CMake >= 3.14
C++17 compatible compiler (Clang, GCC, MSVC)
Python 3.9+ (for Python bindings)
pybind11 >= 3.0.1

C++ Library

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release
cmake --install .

Python Bindings

cd python
pip install .

For development with coverage:

LABNEURA_COVERAGE=1 pip install -e .

🎯 Quick Start

Python Usage

import labneura

# Detect available backend
print(f"Using backend: {labneura.detect_backend()}")  # "AVX2", "NEON", or "GENERIC"

# FP32 operations
t1 = labneura.Tensor([1.0, 2.0, 3.0, 4.0], labneura.QuantizationMode.FP32)
t2 = labneura.Tensor([0.5, 0.5, 0.5, 0.5], labneura.QuantizationMode.FP32)

# In-place addition (modifies t1)
t1.add_inplace(t2)
print(t1.data_fp32())  # [1.5, 2.5, 3.5, 4.5]

# Non-destructive addition (returns new tensor)
t3 = t1.add(t2)

# Other operations
t1.mul_inplace(t2)  # Element-wise multiplication
t1.sub_inplace(t2)  # Element-wise subtraction

# INT8 quantization for memory efficiency
t_int8 = labneura.Tensor([10, 20, 30, 40], labneura.QuantizationMode.INT8)
print(t_int8.data_int8())  # [10, 20, 30, 40]

C++ Usage

#include "labneura/tensor.h"
#include "labneura/backends/backend_factory.h"
#include <iostream>

int main() {
    // Detect backend at runtime
    std::cout << "Backend: " << labneura::detect_backend() << std::endl;
    
    // Create tensors
    std::vector<float> data1 = {1.0f, 2.0f, 3.0f, 4.0f};
    std::vector<float> data2 = {0.5f, 0.5f, 0.5f, 0.5f};
    
    labneura::Tensor t1(data1, labneura::QuantizationMode::FP32);
    labneura::Tensor t2(data2, labneura::QuantizationMode::FP32);
    
    // In-place operations
    t1.add_inplace(t2);
    
    // Access data
    const float* result = t1.data_fp32();
    for (size_t i = 0; i < t1.size(); ++i) {
        std::cout << result[i] << " ";
    }
    
    return 0;
}

Compile:

g++ -std=c++17 -O3 -mavx2 main.cpp -I/path/to/labneura/include -L/path/to/labneura/lib -llabneura

⚡ Performance

LabNeura delivers SIMD-accelerated performance that significantly outperforms NumPy and is competitive with PyTorch through hardware-specific optimizations optimized for quantized inference on edge devices.

Key Results (1M Elements, Apple Silicon)

Operation	LabNeura	vs NumPy	vs PyTorch
INT16 ADD	67.98 μs	6.1x faster	2.9x faster
INT16 MUL	71.49 μs	5.8x faster	1.96x faster
FP16 ADD	67.94 μs	60.4x faster	1.18x slower*
FP16 MUL	67.94 μs	134.8x faster	1.23x slower*

*PyTorch benefits from GPU pipeline optimization for large batches; LabNeura is superior for CPU-only deployment.

Complete Benchmarks

📊 View Full Benchmark Report

INT16 & FP16 operations vs NumPy and PyTorch
Multiple tensor sizes (1K to 1M elements)
Comprehensive analysis and optimization recommendations
Quick Reference Table
Executive Summary

Run Benchmarks Locally

# INT16/FP16 comparison
python benchmarking/benchmark_int16_fp16.py

# View results
cat benchmarking/benchmark_int16_fp16_results.json

Performance Characteristics

NEON (ARM64): Processes 4 FP32 or 16 INT8 elements per instruction
AVX2 (x86_64): Processes 8 FP32 or 32 INT8 elements per instruction
Memory Efficiency: INT8 mode uses 4x less memory, improving cache utilization
Zero Threading Overhead: Single-threaded SIMD maximizes per-core performance

🏗️ Architecture

Backend Selection Hierarchy

Runtime Detection → Best Available Backend
1. Check AVX2 support (x86_64) → AVX2Backend
2. Check NEON support (ARM64) → NEONBackend
3. Fallback → GenericBackend

Directory Structure

LabNeura/
├── include/labneura/          # Public C++ headers
│   ├── tensor.h               # Main Tensor API
│   ├── types.hpp              # Type definitions
│   └── backends/              # Backend interfaces
│       ├── base.h             # TensorBackend abstract base
│       ├── avx2.h             # AVX2 SIMD backend
│       ├── neon.h             # NEON SIMD backend
│       ├── generic.h          # Portable fallback
│       ├── backend_factory.h  # Factory + detect_backend()
│       └── cpu_features.h     # Runtime CPU feature detection
├── src/labneura/              # Implementation
│   ├── tensor.cpp             # Tensor class implementation
│   └── backends/*.cpp         # Backend implementations
├── python/                    # Python bindings
│   ├── labneura_py.cpp        # pybind11 bindings
│   └── setup.py               # Python package config
├── examples/                  # Usage examples
│   ├── main.cpp               # C++ example
│   ├── test_tensor.py         # Python example
│   └── benchmark_numpy.py     # Performance comparison
├── tests/                     # Test suite
├── CMakeLists.txt             # CMake build configuration
└── Makefile                   # Build automation

🧪 Testing

Run Tests

# Python tests
pytest tests/

# Specific test files
pytest tests/test_tensor_ops.py
pytest tests/test_quantization.py

# With coverage
pytest --cov=labneura tests/

Run Examples

# C++ example
./build/labneura_example

# Python examples
python examples/test_tensor.py
python examples/benchmark_numpy.py

🛠️ Development

Build Commands

# Clean build
make clean

# Build Python package
make build

# Install locally
make install

# Run tests
make test

# Build distribution packages
make build-dist

# Publish to TestPyPI
TEST_PYPI_TOKEN=pypi-... make publish-testpypi

# Publish to PyPI
PYPI_TOKEN=pypi-... make publish

Build Flags

# Enable AVX2 (x86_64)
cmake .. -DCMAKE_CXX_FLAGS="-mavx2"

# Enable LLVM coverage
LABNEURA_COVERAGE=1 cmake ..

# Release build with optimizations
cmake .. -DCMAKE_BUILD_TYPE=Release

📊 Implementation Details

SIMD Optimization

AVX2 Backend (x86_64)

Register Width: 256 bits
FP32: Processes 8 floats per instruction (_mm256_add_ps, _mm256_mul_ps)
INT8: Processes 32 bytes per instruction (_mm256_add_epi8)
Alignment: Pads arrays to 8-element boundaries

NEON Backend (ARM64/Apple Silicon)

Register Width: 128 bits
FP32: Processes 4 floats per instruction (vaddq_f32, vmulq_f32)
INT8: Processes 16 bytes per instruction (vaddq_s8)
Alignment: Pads arrays to 4-element boundaries
Optimization: Single-threaded design maximizes instruction-level parallelism

Quantization Strategy

INT8 Mode: Stores data as 8-bit signed integers (-128 to 127)
Memory Reduction: 4x smaller than FP32 (1 byte vs 4 bytes per element)
Saturation Arithmetic: Prevents overflow by clamping to [-128, 127]
Use Cases: Neural network inference, mobile deployment

Backend Detection

// Runtime detection in backend_factory.cpp
std::string detect_backend() {
    if (cpu_supports_avx2()) return "AVX2";
    if (cpu_supports_neon()) return "NEON";
    return "GENERIC";
}

CPU feature detection uses:

x86_64: __builtin_cpu_supports("avx2")
ARM64: Compile-time __ARM_NEON macro (NEON is mandatory on ARMv8)

🌐 Publishing

Package Names

PyPI Name: labneura2 (install with pip install labneura2)
Python Module: labneura (import with import labneura)
GitHub: https://github.com/gokatharun/LabNeura

CI/CD Pipeline

GitHub Actions workflow automates:

Build distribution packages
Run test suite
Publish to TestPyPI (manual trigger)
Publish to PyPI (on release tags)

📝 License

MIT License - see LICENSE file for details.

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📮 Support

Issues: https://github.com/gokatharun/LabNeura/issues
Discussions: https://github.com/gokatharun/LabNeura/discussions

🙏 Acknowledgments

Built with:

pybind11 - Seamless Python/C++ bindings
CMake - Cross-platform build system
SIMD intrinsics: Intel AVX2, ARM NEON

Author: LabNeura Contributors
Version: 0.1.2
Python: 3.9+ | C++: 17+ | Platforms: macOS (Apple Silicon & Intel), Linux (x86_64, ARM64)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.6

Jan 23, 2026

0.2.5

Jan 23, 2026

0.2.4

Jan 23, 2026

0.2.3

Jan 23, 2026

0.2.2

Jan 23, 2026

0.2.1

Jan 23, 2026

0.2.0

Jan 11, 2026

0.1.9

Jan 8, 2026

0.1.8

Jan 8, 2026

0.1.7

Jan 7, 2026

0.1.6

Jan 7, 2026

0.1.5

Jan 7, 2026

0.1.4

Jan 7, 2026

0.1.3

Jan 7, 2026

0.1.2

Jan 7, 2026

0.1.1

Jan 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labneura2-0.2.6.tar.gz (42.0 kB view details)

Uploaded Jan 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl (884.6 kB view details)

Uploaded Jan 23, 2026 CPython 3.9macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file labneura2-0.2.6.tar.gz.

File metadata

Download URL: labneura2-0.2.6.tar.gz
Upload date: Jan 23, 2026
Size: 42.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for labneura2-0.2.6.tar.gz
Algorithm	Hash digest
SHA256	`6c6df13a530e583fed3d509da97ce3cc78805121ec16f04a37ff4d101c60f494`
MD5	`fa936619e6a4ff6165cce2203a0de8a9`
BLAKE2b-256	`2ac3857d25b195d36ff5363b24074b4a2d5560d99d16cac3234b4d65b9f1441d`

See more details on using hashes here.

File details

Details for the file labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl.

File metadata

Download URL: labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl
Upload date: Jan 23, 2026
Size: 884.6 kB
Tags: CPython 3.9, macOS 10.9+ universal2 (ARM64, x86-64)
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for labneura2-0.2.6-cp39-cp39-macosx_10_9_universal2.whl
Algorithm	Hash digest
SHA256	`7aa52b6a2a8f74d02b23063efef1e9d0d8e388dcb3be2d7724664ce3980cca77`
MD5	`b48b8e44f76ba7a99474b21caa0b21e3`
BLAKE2b-256	`dc5a6dc73f9b51b7045efd6bd926675034054d78a10a9e3700121fdf9f5d948a`

See more details on using hashes here.

labneura2 0.2.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LabNeura

🚀 Features

Hardware Acceleration

Tensor Operations

Python Integration

📦 Installation

Python Package (Recommended)

Build from Source

Prerequisites

C++ Library

Python Bindings

🎯 Quick Start

Python Usage

C++ Usage

⚡ Performance

Key Results (1M Elements, Apple Silicon)

Complete Benchmarks

Run Benchmarks Locally

Performance Characteristics

🏗️ Architecture

Backend Selection Hierarchy

Directory Structure

🧪 Testing

Run Tests

Run Examples

🛠️ Development

Build Commands

Build Flags

📊 Implementation Details

SIMD Optimization

AVX2 Backend (x86_64)

NEON Backend (ARM64/Apple Silicon)

Quantization Strategy

Backend Detection

🌐 Publishing

Package Names

CI/CD Pipeline

📝 License

🤝 Contributing

📮 Support

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes