Skip to main content

Convert legacy MS Word .doc files to Markdown — inspired by antiword

Project description

unword

Convert legacy Microsoft Word .doc files (OLE/CFB format) to Markdown. Inspired by antiword.

Extracts body text with heading levels, page breaks, and textbox contents. No external dependencies (no LibreOffice, no COM).

Installation

CLI (Rust)

cargo build --release

Python

Requires maturin and a virtual environment:

uv venv .venv && source .venv/bin/activate
maturin develop

Or build a wheel:

maturin build --release
pip install target/wheels/unword-*.whl

Usage

CLI

# Print to stdout
unword -i document.doc

# Write to file
unword -i document.doc -o output.md

Python

import unword

doc = unword.parse_doc(open("document.doc", "rb").read())

print(doc.body_text)      # Markdown string with headings
print(doc.textboxes)      # List of textbox strings

Rust library

let data = std::fs::read("document.doc")?;
let doc = unword::parse_doc(&data)?;
println!("{}", doc.body_text);

Output format

  • Headings are rendered as #, ##, ###, etc. based on Word styles
  • Paragraphs are separated by blank lines
  • Page breaks become ---
  • Textboxes are extracted separately

Tests

# Rust
cargo test

# Python
pytest tests/test_python.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unword-0.2.0.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unword-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl (447.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file unword-0.2.0.tar.gz.

File metadata

  • Download URL: unword-0.2.0.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for unword-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b3cda6971613f0b4a4b260718e92f52b4810b9c50e299e04e8ddbf327493dee3
MD5 60dce6914202abcac6b50ab5ed4d6ac3
BLAKE2b-256 cada561902fa0d98a6a1a58152264541d68fdbb9123632d5962b43a72bd25bd2

See more details on using hashes here.

File details

Details for the file unword-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for unword-0.2.0-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 082b6ddca462d050a1cfb9aa9707d84bbb2c06f5607ef4373e31bbfceb44ca05
MD5 5872521cad397e21870e6f0f1bd8b216
BLAKE2b-256 22873c3bb4294e4b6f815cb17587a05e8f6d4adf841debd510595e829e3fe94e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page