Skip to main content

Synthetic document rendering with parallel ALTO output

Project description

PangoLine

PangoLine is a basic tool to render raw (horizontal) text into PDF documents and create parallel ALTO files for each page containing baseline and bounding box information.

It is intended to support the rendering of most of the world's writing systems in order to create synthetic page-level training data for automatic text recognition systems. Functionality is fairly basic for now. PDF output is single column, justified text without word breaking. Paragraphs are split automatically once a page is full.

Installation

You'll need PyGObject and the Pango/Cairo libraries on your system. As PyGObject is only shipped in source form this also requires a C compiler and the usual build environment dependencies installed. An easier way is to use conda:

~> conda create --name pangoline-py3.11 -c conda-forge python=3.11
~> conda activate pangoline-py3.11
~> conda install -c conda-forge pygobject pango Cairo click jinja2 rich pypdfium2 lxml pillow

Afterwards either install from pypi:

~> pip install pangoline-tool

or directly from the checked out git repository:

~> pip install --no-deps .

Usage

Rendering

PangoLine renders text first into vector PDFs and ALTO facsimiles using some configurable "physical" dimensions.

~> pangoline render doc.txt
Rendering ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Various options to direct rendering such as page size, margins, language, and base direction can be manually set, for example:

~> pangoline render -p 216 279 -l en-us -f "Noto Sans 24" doc.txt

Text can also be styled with Pango Markup. Parsing is disabled per default but can be enabled with a switch. You'll need to escape any characters that are part of XML such as &, <, >, quotes, and various control characters using HTML entities.

~> pangoline render --markup doc.txt

It is possible to randomly insert stylization of Unicode word segments in the text. One or more styles will be randomly selected from a configurable list of styles:

~> pangoline render --random-markup-probability 0.01 doc.txt

The probability is the probability of at least one style being applied to any particular segment. A subset of the total available number of styles is enabled by default when a probability greater than 0 is given. To change the list of possible styles:

~> pangoline render --random-markup-probability 0.01 --random-markup style_italic --random-markup variant_smallcaps doc.txt

The semantics of each value can be found in the pango documentation.

Styling with color is treated slightly differently than other styles. In general, colors are selected with the foreground_* style. As a large number of colors are known to Pango, the foreground_random alias exists that enables all possible colors:

~> pangoline render  --random-markup-probability 0.01 --random-markup foreground_random doc.txt

When applying random styles to words, control characters in the source text should not be escaped as pangoline internally escapes any characters that require it.

Rasterization

In a second step those vector files can be rasterized into PNGs and the coordinates in the ALTO files scaled to the selected resolution (per default 300dpi):

~> pangoline rasterize doc.0.xml doc.1.xml ...
Rasterizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Rasterized files and their ALTOs can be used as is as ATR training data.

To obtain slightly more realistic input images it is possible to overlay the rasterized text into images of writing surfaces.

~> pangoline rasterize -w ~/background_1.jpg doc.0.xml doc.1.xml ...

Rasterization can be invoked with multiple background images in which case they will be sampled randomly for each output page. A tarball with 70 empty paper backgrounds of different origins, digitization qualities, and states of preservation can be found here.

For larger collections of texts it is advisable to parallelize processing, especially for rasterization with overlays:

~> pangoline --workers 8 render *.txt
~> pangoline --workers 8 rasterize *.xml

Limitations

In order to achieve proper typesetting quality, Pango requires placing the whole text into a single layout before splitting it into individual pages by translating each line of the layout onto a page surface. This approach limits to maximum print space of a single text to 739.8 meters, roughly 3000 pages depending on paper size and margins, before an overflow of the 32 bit integer baseline position y-offset will occur.

Funding

Co-financed by the European Union This project was funded in part by the European Union. (ERC, MiDRASH,project number 101071829).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pangoline_tool-0.3.0.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pangoline_tool-0.3.0-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file pangoline_tool-0.3.0.tar.gz.

File metadata

  • Download URL: pangoline_tool-0.3.0.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pangoline_tool-0.3.0.tar.gz
Algorithm Hash digest
SHA256 cb8d490c978c08d3cd278bb52fe504c64d3e8a82d1ec477e02d15e27a9fa49c1
MD5 bf7718936be9ea19a34c240119d02cc4
BLAKE2b-256 84d83e1b270df529b91f98cef4cf24056ccaa249e9961f56460609056bb15ff1

See more details on using hashes here.

Provenance

The following attestation bundles were made for pangoline_tool-0.3.0.tar.gz:

Publisher: publish.yml on mittagessen/pangoline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pangoline_tool-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pangoline_tool-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pangoline_tool-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d81ef7523359cc434a433e72a5dbfeb3eafbe006cac4b1d28439e1274f6a95eb
MD5 2fc9b259cbcfb1f8a904fc6f25affcf1
BLAKE2b-256 52d93d528f99e533f873bd8a69ae21b29fa546e27ea73457b3da180ec9c9f332

See more details on using hashes here.

Provenance

The following attestation bundles were made for pangoline_tool-0.3.0-py3-none-any.whl:

Publisher: publish.yml on mittagessen/pangoline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page