Synthetic document rendering with parallel ALTO output
Project description
PangoLine
PangoLine is a basic tool to render raw (horizontal) text into PDF documents and create parallel ALTO files for each page containing baseline and bounding box information.
It is intended to support the rendering of most of the world's writing systems in order to create synthetic page-level training data for automatic text recognition systems. Functionality is fairly basic for now. PDF output is single column, justified text without word breaking. Paragraphs are split automatically once a page is full.
Installation
You'll need PyGObject and the Pango/Cairo libraries on your system. As PyGObject is only shipped in source form this also requires a C compiler and the usual build environment dependencies installed. An easier way is to use conda:
~> conda create --name pangoline-py3.11 -c conda-forge python=3.11
~> conda activate pangoline-py3.11
~> conda install -c conda-forge pygobject pango Cairo click jinja2 rich pypdfium2 lxml pillow
Afterwards either install from pypi:
~> pip install pangoline-tool
or directly from the checked out git repository:
~> pip install --no-deps .
Usage
Rendering
PangoLine renders text first into vector PDFs and ALTO facsimiles using some configurable "physical" dimensions.
~> pangoline render doc.txt
Rendering ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Various options to direct rendering such as page size, margins, language, and base direction can be manually set, for example:
~> pangoline render -p 216 279 -l en-us -f "Noto Sans 24" doc.txt
Text can also be styled with Pango Markup. Parsing is disabled per default but can be enabled with a switch. You'll need to escape any characters that are part of XML such as &, <, >, quotes, and various control characters using HTML entities.
~> pangoline render --markup doc.txt
It is possible to randomly insert stylization of Unicode word segments in the text. One or more styles will be randomly selected from a configurable list of styles:
~> pangoline render --random-markup-probability 0.01 doc.txt
The probability is the probability of at least one style being applied to any particular segment. A subset of the total available number of styles is enabled by default when a probability greater than 0 is given. To change the list of possible styles:
~> pangoline render --random-markup-probability 0.01 --random-markup style_italic --random-markup variant_smallcaps doc.txt
The semantics of each value can be found in the pango documentation.
Styling with color is treated slightly differently than other styles. In
general, colors are selected with the foreground_* style. As a large number
of colors are known to Pango, the foreground_random alias exists that enables
all possible colors:
~> pangoline render --random-markup-probability 0.01 --random-markup foreground_random doc.txt
When applying random styles to words, control characters in the source text should not be escaped as pangoline internally escapes any characters that require it.
Rasterization
In a second step those vector files can be rasterized into PNGs and the coordinates in the ALTO files scaled to the selected resolution (per default 300dpi):
~> pangoline rasterize doc.0.xml doc.1.xml ...
Rasterizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Rasterized files and their ALTOs can be used as is as ATR training data.
To obtain slightly more realistic input images it is possible to overlay the rasterized text into images of writing surfaces.
~> pangoline rasterize -w ~/background_1.jpg doc.0.xml doc.1.xml ...
Rasterization can be invoked with multiple background images in which case they will be sampled randomly for each output page. A tarball with 70 empty paper backgrounds of different origins, digitization qualities, and states of preservation can be found here.
For larger collections of texts it is advisable to parallelize processing, especially for rasterization with overlays:
~> pangoline --workers 8 render *.txt
~> pangoline --workers 8 rasterize *.xml
Limitations
In order to achieve proper typesetting quality, Pango requires placing the whole text into a single layout before splitting it into individual pages by translating each line of the layout onto a page surface. This approach limits to maximum print space of a single text to 739.8 meters, roughly 3000 pages depending on paper size and margins, before an overflow of the 32 bit integer baseline position y-offset will occur.
Funding
| |
This project was funded in part by the European Union. (ERC, MiDRASH,project number 101071829). |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pangoline_tool-0.3.0.tar.gz.
File metadata
- Download URL: pangoline_tool-0.3.0.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb8d490c978c08d3cd278bb52fe504c64d3e8a82d1ec477e02d15e27a9fa49c1
|
|
| MD5 |
bf7718936be9ea19a34c240119d02cc4
|
|
| BLAKE2b-256 |
84d83e1b270df529b91f98cef4cf24056ccaa249e9961f56460609056bb15ff1
|
Provenance
The following attestation bundles were made for pangoline_tool-0.3.0.tar.gz:
Publisher:
publish.yml on mittagessen/pangoline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pangoline_tool-0.3.0.tar.gz -
Subject digest:
cb8d490c978c08d3cd278bb52fe504c64d3e8a82d1ec477e02d15e27a9fa49c1 - Sigstore transparency entry: 444295605
- Sigstore integration time:
-
Permalink:
mittagessen/pangoline@180677be0cf776f3971f23233b841eff9873c8fb -
Branch / Tag:
refs/tags/0.3 - Owner: https://github.com/mittagessen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@180677be0cf776f3971f23233b841eff9873c8fb -
Trigger Event:
push
-
Statement type:
File details
Details for the file pangoline_tool-0.3.0-py3-none-any.whl.
File metadata
- Download URL: pangoline_tool-0.3.0-py3-none-any.whl
- Upload date:
- Size: 19.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d81ef7523359cc434a433e72a5dbfeb3eafbe006cac4b1d28439e1274f6a95eb
|
|
| MD5 |
2fc9b259cbcfb1f8a904fc6f25affcf1
|
|
| BLAKE2b-256 |
52d93d528f99e533f873bd8a69ae21b29fa546e27ea73457b3da180ec9c9f332
|
Provenance
The following attestation bundles were made for pangoline_tool-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on mittagessen/pangoline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pangoline_tool-0.3.0-py3-none-any.whl -
Subject digest:
d81ef7523359cc434a433e72a5dbfeb3eafbe006cac4b1d28439e1274f6a95eb - Sigstore transparency entry: 444295624
- Sigstore integration time:
-
Permalink:
mittagessen/pangoline@180677be0cf776f3971f23233b841eff9873c8fb -
Branch / Tag:
refs/tags/0.3 - Owner: https://github.com/mittagessen
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@180677be0cf776f3971f23233b841eff9873c8fb -
Trigger Event:
push
-
Statement type: