Wordlist manipulation tools for non-ASCII characters

These details have not been verified by PyPI

Project links

Project description

preparelist

Wordlist manipulation tools for non-ASCII characters. Designed for security professionals and penetration testers who need to work with wordlists containing special characters from various character encodings.

Features

splitlist: Split wordlists into files with and without special characters
transformlist: Transform special characters according to configurable rules
Support for multiple character encodings (UTF-8, ISO-8859-2, CP852, etc.)
Both command-line tools and Python library API
Flexible character transformation rules via JSON configuration

Installation

From PyPI

pip install preparelist

From source

git clone https://github.com/kost/preparelist
cd preparelist
pip install -e .

Command-Line Usage

splitlist

Split a wordlist into two files: one containing words with special characters and one without.

# Basic usage
splitlist -i wordlist.txt -s special.txt -n normal.txt

# With specific input encoding
splitlist -i wordlist.txt -s special.txt -n normal.txt --input-encoding iso-8859-2

# With verbose output
splitlist -i wordlist.txt -s special.txt -n normal.txt -v

Options:

-i, --input: Input wordlist file (required)
-s, --special: Output file for words with special characters (required)
-n, --normal: Output file for words without special characters (required)
--input-encoding: Input file character encoding (default: utf-8)
--output-encoding: Output file character encoding (default: utf-8)
-v, --verbose: Verbose output

transformlist

Transform characters in a wordlist according to a configuration file.

# Basic usage
transformlist -i wordlist.txt -o output.txt -c transform_simple.json

# Case-insensitive transformations (applies to both cases)
transformlist -i wordlist.txt -o output.txt -c transform_phonetic.json --case-insensitive

# Only output lines where transformation occurred
transformlist -i wordlist.txt -o output.txt -c transform_to_unicode_digraphs.json --only-transformed

# With specific encodings
transformlist -i wordlist.txt -o output.txt -c config.json \
  --input-encoding iso-8859-2 --output-encoding ascii

# With verbose output
transformlist -i wordlist.txt -o output.txt -c config.json -v

Options:

-i, --input: Input wordlist file (required)
-o, --output: Output wordlist file (required)
-c, --config: Transformation configuration file in JSON format (required)
--input-encoding: Input file character encoding (default: utf-8)
--output-encoding: Output file character encoding (default: utf-8)
--case-insensitive: Apply transformations to both uppercase and lowercase
--handle-titlecase: Generate titlecase variants for multi-character sequences (e.g., "nj" also matches "Nj")
--only-transformed: Only output lines where transformation occurred
-v, --verbose: Verbose output

Transformation Configuration Files

Transformation rules are defined in JSON files. Two example configurations are provided:

transform_phonetic.json

Phonetic transformations that preserve sound:

{
  "Š": "Sh",
  "š": "sh",
  "Đ": "Dj",
  "đ": "dj",
  "Č": "Ch",
  "č": "ch",
  "Ć": "Ch",
  "ć": "ch",
  "Ž": "Z",
  "ž": "z",
  "Dž": "Dz",
  "dž": "dz"
}

transform_simple.json

Simple one-to-one character replacements:

{
  "Š": "S",
  "š": "s",
  "Đ": "D",
  "đ": "d",
  "Č": "C",
  "č": "c",
  "Ć": "C",
  "ć": "c",
  "Ž": "Z",
  "ž": "z",
  "Dž": "Dz",
  "dž": "dz"
}

transform_to_unicode_digraphs.json

Transform ASCII digraphs to Unicode equivalents (explicit all-case mapping):

{
  "NJ": "Ǌ",
  "Nj": "ǋ",
  "nj": "ǌ",
  "LJ": "Ǉ",
  "Lj": "ǈ",
  "lj": "ǉ",
  "DŽ": "Ǆ",
  "Dž": "ǅ",
  "dž": "ǆ"
}

transform_from_unicode_digraphs.json

Simplified config for use with --handle-titlecase flag (only lowercase specified):

{
  "nj": "ǌ",
  "lj": "ǉ",
  "dž": "ǆ"
}

When used with --handle-titlecase, this automatically handles titlecase variants like "Nj" → "ǋ".

Titlecase Handling

The --handle-titlecase flag automatically generates titlecase variants for multi-character sequences. This is useful when you only want to specify lowercase mappings in your config file, and have the tool automatically handle titlecase forms.

Example:

# Config file only contains: "nj": "ǌ", "lj": "ǉ"
transformlist -i wordlist.txt -o output.txt \
  -c examples/transform_from_unicode_digraphs.json \
  --handle-titlecase

Input:

njujork
Njujork
Ljubljana

Output:

ǌujork    (matches "nj" from config)
ǋujork    (matches generated "Nj" → "ǋ" titlecase variant)
Ǉubljana  (matches generated "Lj" → "ǉ" titlecase variant)

Note: --handle-titlecase only generates titlecase variants (first char upper, rest lower). For full uppercase support, use --case-insensitive or specify all variants explicitly in your config.

You can create your own configuration files with any character mappings you need.

Python Library Usage

Splitting wordlists

from preparelist import split_wordlist

# Split wordlist
special_count, normal_count = split_wordlist(
    input_file='wordlist.txt',
    output_special='special.txt',
    output_normal='normal.txt',
    input_encoding='utf-8',
    output_encoding='utf-8'
)

print(f"Words with special chars: {special_count}")
print(f"Words without special chars: {normal_count}")

Transforming wordlists

from preparelist import load_transformation_config, transform_wordlist

# Load transformation rules
transformations = load_transformation_config('transform_simple.json')

# Transform wordlist
line_count = transform_wordlist(
    input_file='wordlist.txt',
    output_file='transformed.txt',
    transformations=transformations,
    input_encoding='utf-8',
    output_encoding='ascii',
    case_sensitive=False  # Apply to both cases
)

print(f"Processed {line_count} lines")

Transforming individual text

from preparelist import transform_text, load_transformation_config

# Load config
config = load_transformation_config('transform_phonetic.json')

# Transform text
original = "Željko Šarić"
transformed = transform_text(original, config, case_sensitive=False)
print(f"{original} -> {transformed}")
# Output: Željko Šarić -> Zeljko Sharich

Checking for special characters

from preparelist import has_special_chars

print(has_special_chars("hello"))     # False
print(has_special_chars("Šime"))      # True
print(has_special_chars("café"))      # True

Supported Character Encodings

Common encodings include:

utf-8 (default)
iso-8859-1 (Latin-1)
iso-8859-2 (Latin-2, Central European)
cp852 (DOS Latin-2)
cp1250 (Windows Central European)
ascii (US-ASCII, 7-bit)

For a complete list, see Python's codec documentation.

Use Cases

Password Cracking: Transform wordlists to account for different character representations
Security Testing: Generate variants of wordlists for comprehensive testing
Data Cleaning: Normalize character encodings in text files
Localization: Adapt wordlists for different locales and character sets

Examples

Example 1: Processing a Croatian wordlist

# Split into special and normal
splitlist -i croatian_words.txt -s croatian_special.txt -n croatian_normal.txt

# Transform special characters to phonetic equivalents
transformlist -i croatian_special.txt -o croatian_phonetic.txt \
  -c examples/transform_phonetic.json --case-insensitive

Example 2: Converting DOS encoding to UTF-8

# Transform from CP852 to UTF-8
transformlist -i dos_wordlist.txt -o utf8_wordlist.txt \
  -c examples/transform_simple.json \
  --input-encoding cp852 --output-encoding utf-8

Example 3: Library usage for batch processing

import preparelist
from pathlib import Path

# Load transformation config once
config = preparelist.load_transformation_config('transform_simple.json')

# Process multiple files
wordlists = Path('wordlists').glob('*.txt')
for wordlist in wordlists:
    output = f"transformed_{wordlist.name}"
    preparelist.transform_wordlist(
        str(wordlist),
        output,
        config,
        case_sensitive=False
    )
    print(f"Processed {wordlist.name} -> {output}")

Example 4: Filtering wordlists with --only-transformed

The --only-transformed flag is useful for extracting only entries that contain specific characters:

# Extract only entries with ASCII digraphs (NJ, Nj, nj, LJ, Lj, lj, etc.)
transformlist -i mixed_wordlist.txt -o digraph_entries.txt \
  -c examples/transform_to_unicode_digraphs.json --only-transformed

# Result: only words like "Njujork", "Ljubljana" are in output,
# words like "password", "admin" are skipped

Input (mixed_wordlist.txt):

password
Njujork
admin
Ljubljana
test123

Output (digraph_entries.txt):

Ǌujork
Ǉubljana

This is particularly useful for:

Identifying entries with specific character patterns
Creating filtered wordlists for targeted testing
Extracting names or terms from a specific language
Quality control and validation

Development

Running tests

pip install -e ".[dev]"
pytest

Building the package

python -m build

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

kost - https://github.com/kost

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Nov 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preparelist-0.1.0.tar.gz (13.8 kB view details)

Uploaded Nov 25, 2025 Source

File details

Details for the file preparelist-0.1.0.tar.gz.

File metadata

Download URL: preparelist-0.1.0.tar.gz
Upload date: Nov 25, 2025
Size: 13.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for preparelist-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bb9fbf7c301f06b663f5517897502082e49302b9a136f2fad816b66d9ee823af`
MD5	`26538ad6dc4ddd53384f2d5aa66b851b`
BLAKE2b-256	`3f1871015166159402ab8730e5ae04e4d4aeb02e56354ac17411d24c1559124c`

See more details on using hashes here.

preparelist 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

preparelist

Features

Installation

From PyPI

From source

Command-Line Usage

splitlist

transformlist

Transformation Configuration Files

transform_phonetic.json

transform_simple.json

transform_to_unicode_digraphs.json

transform_from_unicode_digraphs.json

Titlecase Handling

Python Library Usage

Splitting wordlists

Transforming wordlists

Transforming individual text

Checking for special characters

Supported Character Encodings

Use Cases

Examples

Example 1: Processing a Croatian wordlist

Example 2: Converting DOS encoding to UTF-8

Example 3: Library usage for batch processing

Example 4: Filtering wordlists with --only-transformed

Development

Running tests

Building the package

License

Contributing

Author

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes