Skip to main content

Comprehensive document conversion library with batch processing, caching, and template rendering

Project description

Document Converter

Document Converter Logo

A comprehensive Python library for document conversion with batch processing, intelligent caching, and template rendering.

FeaturesInstallationQuick StartDocumentationContributing


✨ Features

🔄 Multi-Format Conversion

Convert between popular document formats:

  • PDF ↔ TXT, DOCX (with OCR support for scanned documents)
  • DOCX ↔ PDF, HTML, Markdown, TXT
  • HTML ↔ PDF, DOCX
  • Markdown ↔ HTML, PDF
  • ODT ↔ Multiple formats
  • TXT ↔ HTML, PDF

⚡ High Performance

  • Two-tier caching: In-memory LRU + persistent disk cache
  • Up to 138x speedup on repeated conversions
  • Parallel batch processing: 50-200 files/second
  • Streaming template rendering for memory efficiency

🛠️ Developer Friendly

  • Clean, extensible API
  • Comprehensive error handling with actionable suggestions
  • Transaction safety with automatic rollback
  • Full CLI with progress bars
  • 79% test coverage with 274+ tests

📦 Standalone Executable

  • Interactive mode: Double-click and use menu-driven interface
  • CLI mode: Full command-line support
  • Drag & Drop: Drop multiple files onto the .exe to convert them all at once
  • No Python installation required for end users

📋 Requirements

  • Python 3.9+
  • See requirements.txt for dependencies

🚀 Installation

From Source

# Clone the repository
git clone https://github.com/MikeAMSDev/document-converter
cd document-converter

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

Verify Installation

python -c "from converter.engine import ConversionEngine; print('✓ Installation successful!')"

🎯 Quick Start

Basic Conversion

from converter.engine import ConversionEngine
from converter.formats.pdf_converter import PDFConverter

# Setup
engine = ConversionEngine()
engine.register_converter('pdf', PDFConverter)

# Convert
engine.convert('document.pdf', 'document.txt')

Batch Processing

from converter.batch_processor import BatchProcessor

processor = BatchProcessor(max_workers=8)
processor.scan_directory('./documents', './output', from_format='docx', to_format='pdf')
report = processor.process_queue()

print(f"Converted {report.success} files")

With Caching (138x Faster!)

from converter.engine import ConversionEngine
from core.cache_manager import CacheManager

cache = CacheManager(cache_dir=".cache")
engine = ConversionEngine(cache_manager=cache)

# First conversion: normal speed
engine.convert('large.pdf', 'large.txt')

# Second conversion: instant (from cache)
engine.convert('large.pdf', 'large_copy.txt')

Template Rendering

from converter.template_engine import TemplateEngine

engine = TemplateEngine()
template = "Hello {{ name }}! {% for item in items %}{{ item }} {% endfor %}"
result = engine.render(template, {"name": "World", "items": ["A", "B", "C"]})

💻 CLI Usage

Single File Conversion

# Standard conversion
python -m cli.main convert input.pdf output.txt

# With options
python -m cli.main convert input.pdf --output output.txt --ocr

Drag & Drop Multiple Files (Windows)

# Drop files onto document-converter.exe, or run:
document-converter.exe file1.docx file2.pdf file3.txt --format pdf

# Result: Converts all to PDF in the same directory

Batch Processing

python -m cli.main batch ./documents ./output --from-format docx --to-format pdf --workers 8

Cache Management

# View cache stats
python -m cli.main cache-stats

# Clear cache
python -m cli.main cache-clear

Standalone Executable

Download document-converter.exe from the dist/ folder:

# Interactive mode (double-click or run without arguments)
document-converter.exe

# CLI mode
document-converter.exe convert input.pdf output.txt

📚 Documentation

Document Description
User Guide Step-by-step tutorials and common use cases
API Reference Complete API documentation
Developer Guide Contributing and extending the library
Examples Ready-to-run example scripts
Changelog Version history and changes

📁 Project Structure

document-converter/
├── converter/          # Core conversion logic
│   ├── engine.py       # Main conversion engine
│   ├── batch_processor.py
│   ├── template_engine.py
│   ├── formats/        # Format-specific converters
│   └── processors/     # OCR, images, styles
├── core/               # Core utilities
│   ├── cache_manager.py
│   ├── error_handler.py
│   ├── transaction.py
│   └── worker_pool.py
├── cli/                # Command-line interface
├── utils/              # Helper utilities
├── docs/               # Documentation
├── examples/           # Example scripts
├── tests/              # Test suite
└── dist/               # Standalone executable

🧪 Testing

# Run all tests
pytest

# With coverage
pytest --cov=converter --cov=core --cov-report=html

# Run specific test types
pytest -m unit
pytest -m integration

Current Coverage: 79% (274+ tests)


🤝 Contributing

Contributions are welcome! Please read our Developer Guide for:

  • Development setup
  • Code style guidelines
  • Testing requirements
  • How to add new format converters

Quick Start for Contributors

# Fork and clone
git clone https://github.com/MikeAMSDev/document-converter
cd document-converter

# Install dev dependencies
pip install -r requirements-dev.txt

# Create feature branch
git checkout -b feat/my-feature

# Make changes and test
pytest

# Submit pull request

📊 Performance Benchmarks

Operation Performance
Cache Speedup Up to 138x faster
Batch Throughput 50-200 files/sec
Memory Cache Lookup <1ms
Disk Cache Lookup <100ms
Template Rendering (100K items) <5 seconds

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • Built with Python 3.13
  • PDF processing: PyPDF2, ReportLab
  • DOCX handling: python-docx
  • OCR: Tesseract via pytesseract
  • CLI: Click

Made with ❤️ by MikeAMSDev

⭐ Star this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_converter-1.2.0.tar.gz (226.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_converter-1.2.0-py3-none-any.whl (79.0 kB view details)

Uploaded Python 3

File details

Details for the file document_converter-1.2.0.tar.gz.

File metadata

  • Download URL: document_converter-1.2.0.tar.gz
  • Upload date:
  • Size: 226.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for document_converter-1.2.0.tar.gz
Algorithm Hash digest
SHA256 f3a27a5a847eecdee43c5a21737b4c6e16ea11064594441dc51257609f95ea11
MD5 f3bad24270971365162a25c1ae20a83f
BLAKE2b-256 ab06fb4cf3b0626e15fd16ecc6f9a69984041d6a187b4dac9189e8f2a485d89f

See more details on using hashes here.

File details

Details for the file document_converter-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_converter-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cef345bbc9a14c876162ba17bc99b9e346b67fa02dd0c412927a5b29839cdf4e
MD5 90bce745e68689e5dc4bd30fa586ca68
BLAKE2b-256 4fcd3090a3998586cf96d48e6e9eb13d5ee63c128958fe44d7b8ad0e60cc496a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page