Skip to main content

A Python CLI and utility library for manipulating EPUB files

Project description

epub-utils

PyPI Changelog Python 3.x License

A Python library and CLI tool for inspecting ePub from the terminal.

Features

  • Complete EPUB Support - Parse both EPUB 2.0.1 and EPUB 3.0+ specifications with container, package, manifest, spine, and table of contents inspection
  • Rich Metadata Extraction - Extract Dublin Core metadata (title, author, language, publisher) with key-value, XML, and raw output formats for easy scripting
  • Content Analysis - Access document content by manifest ID or file path, with plain text extraction for content analysis and word counting
  • File System Navigation - Browse and extract any file within EPUB archives (XHTML, CSS, images, fonts) with detailed file information including sizes and compression ratios
  • Multiple Output Formats - XML with syntax highlighting, raw content, key-value pairs, plain text, and formatted tables to suit different workflows
  • CLI and Python API - Comprehensive command-line tool for terminal workflows plus a clean Python library for programmatic access
  • Standards Compliance - Built-in validation capabilities and adherence to W3C/IDPF specifications for reliable EPUB processing
  • Performance Optimized - Lazy loading, efficient ZIP parsing, and optional lxml support for handling large EPUB collections

Installation

epub-utils is available as a PyPI package

pip install epub-utils

Use as a CLI tool

The basic format is:

epub-utils EPUB_PATH COMMAND [OPTIONS]

Commands

  • container - Display the container.xml contents

    # Show container.xml with syntax highlighting
    epub-utils book.epub container
    
    # Show container.xml as raw content
    epub-utils book.epub container --format raw
    
    # Show container.xml with pretty formatting
    epub-utils book.epub container --pretty-print
    
  • package - Display the package OPF file contents

    # Show package.opf with syntax highlighting
    epub-utils book.epub package
    
    # Show package.opf as raw content
    epub-utils book.epub package --format raw
    
  • toc - Display the table of contents file contents

    # Show toc.ncx/nav.xhtml with syntax highlighting (auto-detect)
    epub-utils book.epub toc
    
    # Show toc.ncx/nav.xhtml as raw content
    epub-utils book.epub toc --format raw
    
    # Force NCX format (EPUB 2 navigation control file)
    epub-utils book.epub toc --ncx
    
    # Force Navigation Document (EPUB 3 navigation file)
    epub-utils book.epub toc --nav
    
  • metadata - Display the metadata information from the package file

    # Show metadata with syntax highlighting
    epub-utils book.epub metadata
    
    # Show metadata as key-value pairs
    epub-utils book.epub metadata --format kv
    
    # Show metadata with pretty formatting
    epub-utils book.epub metadata --pretty-print
    
  • manifest - Display the manifest information from the package file

    # Show manifest with syntax highlighting
    epub-utils book.epub manifest
    
    # Show manifest as raw content
    epub-utils book.epub manifest --format raw
    
  • spine - Display the spine information from the package file

    # Show spine with syntax highlighting
    epub-utils book.epub spine
    
    # Show spine as raw content
    epub-utils book.epub spine --format raw
    
  • content - Display the content of a document by its manifest item ID

    # Show content with syntax highlighting
    epub-utils book.epub content chapter1
    
    # Show raw HTML/XML content
    epub-utils book.epub content chapter1 --format raw
    
    # Show plain text content (HTML tags stripped)
    epub-utils book.epub content chapter1 --format plain
    
  • files - List all files in the EPUB archive or display content of a specific file

    # List all files in table format (default)
    epub-utils book.epub files
    
    # List all files as simple paths
    epub-utils book.epub files --format raw
    
    # Display content of a specific file by path
    epub-utils book.epub files OEBPS/chapter1.xhtml
    
    # Display XHTML file content in different formats
    epub-utils book.epub files OEBPS/chapter1.xhtml --format raw
    epub-utils book.epub files OEBPS/chapter1.xhtml --format xml --pretty-print
    epub-utils book.epub files OEBPS/chapter1.xhtml --format plain
    
    # Display non-XHTML files (CSS, images, etc.)
    epub-utils book.epub files OEBPS/styles/main.css
    epub-utils book.epub files META-INF/container.xml
    

Options

  • -h, --help - Show help message and exit

  • -v, --version - Show program version and exit

  • -fmt, --format - Output format (default: xml)

    • xml - Display with XML syntax highlighting (default)
    • raw - Display raw content without formatting
    • plain - Display plain text content (HTML tags stripped, for content command only)
    • kv - Display key-value pairs (where supported)
  • -pp, --pretty-print - Pretty-print XML output (applies to xml and raw formats only)

    # Display as raw content
    epub-utils book.epub package --format raw
    
    # Display with XML syntax highlighting (default)
    epub-utils book.epub package --format xml
    
    # Display as key-value pairs (for supported commands)
    epub-utils book.epub metadata --format kv
    
    # Display plain text content (content command only)
    epub-utils book.epub content chapter1 --format plain
    
    # Pretty-print XML with proper indentation
    epub-utils book.epub package --pretty-print
    
    # Combine format and pretty-print options
    epub-utils book.epub metadata --format raw --pretty-print
    

Use as a Python library

from epub_utils import Document

# Load an EPUB document
doc = Document("path/to/book.epub")

Basic Document Access

Access the main components of an EPUB document:

# Get container information
container = doc.container
print(container.to_xml())  # Formatted XML with syntax highlighting
print(container.to_str())  # Raw XML content

# Get package information  
package = doc.package
print(package.to_xml())    # Formatted XML with syntax highlighting
print(package.to_str())    # Raw XML content

# Get table of contents
toc = doc.toc
if toc:  # TOC might be None if not present
    print(toc.to_xml())    # Formatted XML with syntax highlighting
    print(toc.to_str())    # Raw XML content

# Access specific navigation formats
ncx = doc.ncx  # NCX format (EPUB 2 or EPUB 3 with NCX)
if ncx:
    print("NCX navigation available")
    print(ncx.to_xml())

nav = doc.nav  # Navigation Document (EPUB 3 only)
if nav:
    print("Navigation Document available")
    print(nav.to_xml())
    print(toc.to_str())    # Raw XML content

Working with Metadata

Access and format metadata information:

# Access package metadata
metadata = doc.package.metadata

# Basic Dublin Core elements
print(f"Title: {metadata.title}")
print(f"Creator: {metadata.creator}")
print(f"Identifier: {metadata.identifier}")
print(f"Language: {metadata.language}")
print(f"Publisher: {metadata.publisher}")
print(f"Date: {metadata.date}")

# Dynamic attribute access for any metadata field
isbn = getattr(metadata, 'isbn', 'Not available')
series = getattr(metadata, 'series', 'Not available')

# Get formatted metadata output
print(metadata.to_xml())     # Formatted XML with syntax highlighting
print(metadata.to_str())     # Raw XML content  
print(metadata.to_kv())      # Key-value format for easy parsing

Working with Manifest

Access the manifest to see all files in the EPUB:

# Get manifest information
manifest = doc.package.manifest

# Access all manifest items
for item in manifest.items:
    print(f"ID: {item['id']}")
    print(f"File: {item['href']}")
    print(f"Type: {item['media_type']}")
    print(f"Properties: {item['properties']}")

# Find specific items
nav_item = manifest.find_by_property('nav')
chapter = manifest.find_by_id('chapter1')
xhtml_items = manifest.find_by_media_type('application/xhtml+xml')

# Get formatted manifest output
print(manifest.to_xml())     # Formatted XML with syntax highlighting
print(manifest.to_str())     # Raw XML content

Working with Spine

Access the spine to see the reading order:

# Get spine information
spine = doc.package.spine

# Access spine properties
print(f"TOC reference: {spine.toc}")
print(f"Page progression: {spine.page_progression_direction}")

# Access spine items in reading order
for itemref in spine.itemrefs:
    print(f"ID: {itemref['idref']}")
    print(f"Linear: {itemref['linear']}")
    print(f"Properties: {itemref['properties']}")

# Find specific spine item
spine_item = spine.find_by_idref('chapter1')

# Get formatted spine output
print(spine.to_xml())        # Formatted XML with syntax highlighting
print(spine.to_str())        # Raw XML content

Content Extraction

Extract content from specific documents within the EPUB:

# Access content by manifest item ID
try:
    content = doc.find_content_by_id('chapter1')
    
    # Get content in different formats
    print(content.to_xml())      # Formatted XHTML with syntax highlighting
    print(content.to_str())      # Raw XHTML content
    print(content.to_plain())    # Plain text with HTML tags stripped
    
    # Access the parsed content tree for advanced processing
    tree = content.tree
    inner_text = content.inner_text
    
except ValueError as e:
    print(f"Content not found: {e}")

# Find publication resources by ID (for non-spine items)
try:
    resource = doc.find_pub_resource_by_id('cover-image')
except ValueError as e:
    print(f"Resource not found: {e}")

File Operations

List and access files directly by their paths in the EPUB archive:

# Get information about all files
files_info = doc.get_files_info()
for file_info in files_info:
    print(f"Path: {file_info['path']}")
    print(f"Size: {file_info['size']} bytes")
    print(f"Compressed: {file_info['compressed_size']} bytes")
    print(f"Modified: {file_info['modified']}")

# Access specific file by path
try:
    # For XHTML files, returns XHTMLContent object
    xhtml_content = doc.get_file_by_path('OEBPS/chapter1.xhtml')
    print(xhtml_content.to_xml())
    print(xhtml_content.to_plain())
    
    # For other files, returns raw string content
    css_content = doc.get_file_by_path('OEBPS/styles/main.css')
    print(css_content)
    
except ValueError as e:
    print(f"File not found: {e}")

Output Formatting Options

All document components support flexible output formatting:

# Pretty-printed XML output
print(metadata.to_str(pretty_print=True))
print(manifest.to_xml(pretty_print=True))

# Syntax highlighting can be controlled
print(package.to_xml(highlight_syntax=True))   # With highlighting (default)
print(package.to_xml(highlight_syntax=False))  # Without highlighting

Industry Standards & Compliance

epub-utils provides comprehensive support for industry-standard ePub specifications and related technologies, ensuring broad compatibility across the digital publishing ecosystem.

Supported EPUB Standards

  • EPUB 2.0.1 (IDPF, 2010)

    • Complete OPF 2.0 package document support
    • NCX navigation control file support
    • Dublin Core metadata extraction
    • Legacy EPUB compatibility
  • EPUB 3.0+ (IDPF/W3C, 2011-present)

    • EPUB 3.3 specification compliance
    • HTML5-based content documents
    • Navigation document (nav.xhtml) support
    • Enhanced accessibility features
    • Media overlays and scripting support

Metadata Standards

  • Dublin Core Metadata Initiative (DCMI)

    • Dublin Core Metadata Element Set v1.1
    • Dublin Core Metadata Terms (DCTERMS)
  • Open Packaging Format (OPF)

    • OPF 2.0 specification (EPUB 2.0.1)
    • OPF 3.0 specification (EPUB 3.0+)

The library maintains strict adherence to published specifications while providing robust handling of real-world EPUB variations commonly found in commercial and open-source reading applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epub_utils-0.1.0a1.tar.gz (37.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epub_utils-0.1.0a1-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file epub_utils-0.1.0a1.tar.gz.

File metadata

  • Download URL: epub_utils-0.1.0a1.tar.gz
  • Upload date:
  • Size: 37.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for epub_utils-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 3930cfd58b1c2e526df774b94d800ffc19635b5e9a12e204d191046ed696987f
MD5 1fec2fc25ad9fa156ee56e6b721c82a8
BLAKE2b-256 0e3086f5b56f1d932fac1eb7bee558a6a9603bab53e54b233e1a99c2044d7ad9

See more details on using hashes here.

File details

Details for the file epub_utils-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: epub_utils-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for epub_utils-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 4edc35b98c221d13926c402a2b644bc172e82f450b6c2a7a4face43c4cecc30c
MD5 bc26d99de9f390e437b273a03f0310f7
BLAKE2b-256 91bae6e7736fa1d98c6b850c5117f76fa6860e9321a15ab318bc513eaada1fc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page