Comprehensive document conversion library with batch processing, caching, and template rendering
Project description
Document Converter
A comprehensive Python library for document conversion with batch processing, intelligent caching, and template rendering.
Features • Installation • Quick Start • Documentation • Contributing
✨ Features
🔄 Multi-Format Conversion
Convert between popular document formats:
- PDF ↔ TXT, DOCX (with OCR support for scanned documents)
- DOCX ↔ PDF, HTML, Markdown, TXT
- HTML ↔ PDF, DOCX
- Markdown ↔ HTML, PDF
- ODT ↔ Multiple formats
- TXT ↔ HTML, PDF
⚡ High Performance
- Two-tier caching: In-memory LRU + persistent disk cache
- Up to 138x speedup on repeated conversions
- Parallel batch processing: 50-200 files/second
- Streaming template rendering for memory efficiency
🛠️ Developer Friendly
- Clean, extensible API
- Comprehensive error handling with actionable suggestions
- Transaction safety with automatic rollback
- Full CLI with progress bars
- 79% test coverage with 274+ tests
📦 Standalone Executable
- Interactive mode: Double-click and use menu-driven interface
- CLI mode: Full command-line support
- Drag & Drop: Drop multiple files onto the .exe to convert them all at once
- No Python installation required for end users
📋 Requirements
- Python 3.9+
- See
requirements.txtfor dependencies
🚀 Installation
From Source
# Clone the repository
git clone https://github.com/MikeAMSDev/document-converter
cd document-converter
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
Verify Installation
python -c "from converter.engine import ConversionEngine; print('✓ Installation successful!')"
🎯 Quick Start
Basic Conversion
from converter.engine import ConversionEngine
from converter.formats.pdf_converter import PDFConverter
# Setup
engine = ConversionEngine()
engine.register_converter('pdf', PDFConverter)
# Convert
engine.convert('document.pdf', 'document.txt')
Batch Processing
from converter.batch_processor import BatchProcessor
processor = BatchProcessor(max_workers=8)
processor.scan_directory('./documents', './output', from_format='docx', to_format='pdf')
report = processor.process_queue()
print(f"Converted {report.success} files")
With Caching (138x Faster!)
from converter.engine import ConversionEngine
from core.cache_manager import CacheManager
cache = CacheManager(cache_dir=".cache")
engine = ConversionEngine(cache_manager=cache)
# First conversion: normal speed
engine.convert('large.pdf', 'large.txt')
# Second conversion: instant (from cache)
engine.convert('large.pdf', 'large_copy.txt')
Template Rendering
from converter.template_engine import TemplateEngine
engine = TemplateEngine()
template = "Hello {{ name }}! {% for item in items %}{{ item }} {% endfor %}"
result = engine.render(template, {"name": "World", "items": ["A", "B", "C"]})
💻 CLI Usage
Single File Conversion
# Standard conversion
python -m cli.main convert input.pdf output.txt
# With options
python -m cli.main convert input.pdf --output output.txt --ocr
Drag & Drop Multiple Files (Windows)
# Drop files onto document-converter.exe, or run:
document-converter.exe file1.docx file2.pdf file3.txt --format pdf
# Result: Converts all to PDF in the same directory
Batch Processing
python -m cli.main batch ./documents ./output --from-format docx --to-format pdf --workers 8
Cache Management
# View cache stats
python -m cli.main cache-stats
# Clear cache
python -m cli.main cache-clear
Standalone Executable
Download document-converter.exe from the dist/ folder:
# Interactive mode (double-click or run without arguments)
document-converter.exe
# CLI mode
document-converter.exe convert input.pdf output.txt
📚 Documentation
| Document | Description |
|---|---|
| User Guide | Step-by-step tutorials and common use cases |
| API Reference | Complete API documentation |
| Developer Guide | Contributing and extending the library |
| Examples | Ready-to-run example scripts |
| Changelog | Version history and changes |
📁 Project Structure
document-converter/
├── converter/ # Core conversion logic
│ ├── engine.py # Main conversion engine
│ ├── batch_processor.py
│ ├── template_engine.py
│ ├── formats/ # Format-specific converters
│ └── processors/ # OCR, images, styles
├── core/ # Core utilities
│ ├── cache_manager.py
│ ├── error_handler.py
│ ├── transaction.py
│ └── worker_pool.py
├── cli/ # Command-line interface
├── utils/ # Helper utilities
├── docs/ # Documentation
├── examples/ # Example scripts
├── tests/ # Test suite
└── dist/ # Standalone executable
🧪 Testing
# Run all tests
pytest
# With coverage
pytest --cov=converter --cov=core --cov-report=html
# Run specific test types
pytest -m unit
pytest -m integration
Current Coverage: 79% (274+ tests)
🤝 Contributing
Contributions are welcome! Please read our Developer Guide for:
- Development setup
- Code style guidelines
- Testing requirements
- How to add new format converters
Quick Start for Contributors
# Fork and clone
git clone https://github.com/MikeAMSDev/document-converter
cd document-converter
# Install dev dependencies
pip install -r requirements-dev.txt
# Create feature branch
git checkout -b feat/my-feature
# Make changes and test
pytest
# Submit pull request
📊 Performance Benchmarks
| Operation | Performance |
|---|---|
| Cache Speedup | Up to 138x faster |
| Batch Throughput | 50-200 files/sec |
| Memory Cache Lookup | <1ms |
| Disk Cache Lookup | <100ms |
| Template Rendering (100K items) | <5 seconds |
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built with Python 3.13
- PDF processing: PyPDF2, ReportLab
- DOCX handling: python-docx
- OCR: Tesseract via pytesseract
- CLI: Click
Made with ❤️ by MikeAMSDev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_converter-1.2.0.tar.gz.
File metadata
- Download URL: document_converter-1.2.0.tar.gz
- Upload date:
- Size: 226.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3a27a5a847eecdee43c5a21737b4c6e16ea11064594441dc51257609f95ea11
|
|
| MD5 |
f3bad24270971365162a25c1ae20a83f
|
|
| BLAKE2b-256 |
ab06fb4cf3b0626e15fd16ecc6f9a69984041d6a187b4dac9189e8f2a485d89f
|
File details
Details for the file document_converter-1.2.0-py3-none-any.whl.
File metadata
- Download URL: document_converter-1.2.0-py3-none-any.whl
- Upload date:
- Size: 79.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cef345bbc9a14c876162ba17bc99b9e346b67fa02dd0c412927a5b29839cdf4e
|
|
| MD5 |
90bce745e68689e5dc4bd30fa586ca68
|
|
| BLAKE2b-256 |
4fcd3090a3998586cf96d48e6e9eb13d5ee63c128958fe44d7b8ad0e60cc496a
|