Convert legacy MS Word .doc files to Markdown — inspired by antiword
Project description
unword
Convert legacy Microsoft Word .doc files (OLE/CFB format) to Markdown. Inspired by antiword.
Extracts body text with heading levels, page breaks, and textbox contents. No external dependencies (no LibreOffice, no COM).
Installation
CLI (Rust)
cargo build --release
Python
Requires maturin and a virtual environment:
uv venv .venv && source .venv/bin/activate
maturin develop
Or build a wheel:
maturin build --release
pip install target/wheels/unword-*.whl
Usage
CLI
# Print to stdout
unword -i document.doc
# Write to file
unword -i document.doc -o output.md
Python
import unword
doc = unword.parse_doc(open("document.doc", "rb").read())
print(doc.body_text) # Markdown string with headings
print(doc.textboxes) # List of textbox strings
Rust library
let data = std::fs::read("document.doc")?;
let doc = unword::parse_doc(&data)?;
println!("{}", doc.body_text);
Output format
- Headings are rendered as
#,##,###, etc. based on Word styles - Paragraphs are separated by blank lines
- Page breaks become
--- - Textboxes are extracted separately
Tests
# Rust
cargo test
# Python
pytest tests/test_python.py
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unword-0.2.0.tar.gz.
File metadata
- Download URL: unword-0.2.0.tar.gz
- Upload date:
- Size: 17.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3cda6971613f0b4a4b260718e92f52b4810b9c50e299e04e8ddbf327493dee3
|
|
| MD5 |
60dce6914202abcac6b50ab5ed4d6ac3
|
|
| BLAKE2b-256 |
cada561902fa0d98a6a1a58152264541d68fdbb9123632d5962b43a72bd25bd2
|
File details
Details for the file unword-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: unword-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 447.6 kB
- Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
082b6ddca462d050a1cfb9aa9707d84bbb2c06f5607ef4373e31bbfceb44ca05
|
|
| MD5 |
5872521cad397e21870e6f0f1bd8b216
|
|
| BLAKE2b-256 |
22873c3bb4294e4b6f815cb17587a05e8f6d4adf841debd510595e829e3fe94e
|