Skip to main content

Wordlist manipulation tools for non-ASCII characters

Project description

preparelist

Wordlist manipulation tools for non-ASCII characters. Designed for security professionals and penetration testers who need to work with wordlists containing special characters from various character encodings.

Features

  • splitlist: Split wordlists into files with and without special characters
  • transformlist: Transform special characters according to configurable rules
  • Support for multiple character encodings (UTF-8, ISO-8859-2, CP852, etc.)
  • Both command-line tools and Python library API
  • Flexible character transformation rules via JSON configuration

Installation

From PyPI

pip install preparelist

From source

git clone https://github.com/kost/preparelist
cd preparelist
pip install -e .

Command-Line Usage

splitlist

Split a wordlist into two files: one containing words with special characters and one without.

# Basic usage
splitlist -i wordlist.txt -s special.txt -n normal.txt

# With specific input encoding
splitlist -i wordlist.txt -s special.txt -n normal.txt --input-encoding iso-8859-2

# With verbose output
splitlist -i wordlist.txt -s special.txt -n normal.txt -v

Options:

  • -i, --input: Input wordlist file (required)
  • -s, --special: Output file for words with special characters (required)
  • -n, --normal: Output file for words without special characters (required)
  • --input-encoding: Input file character encoding (default: utf-8)
  • --output-encoding: Output file character encoding (default: utf-8)
  • -v, --verbose: Verbose output

transformlist

Transform characters in a wordlist according to a configuration file.

# Basic usage
transformlist -i wordlist.txt -o output.txt -c transform_simple.json

# Case-insensitive transformations (applies to both cases)
transformlist -i wordlist.txt -o output.txt -c transform_phonetic.json --case-insensitive

# Only output lines where transformation occurred
transformlist -i wordlist.txt -o output.txt -c transform_to_unicode_digraphs.json --only-transformed

# With specific encodings
transformlist -i wordlist.txt -o output.txt -c config.json \
  --input-encoding iso-8859-2 --output-encoding ascii

# With verbose output
transformlist -i wordlist.txt -o output.txt -c config.json -v

Options:

  • -i, --input: Input wordlist file (required)
  • -o, --output: Output wordlist file (required)
  • -c, --config: Transformation configuration file in JSON format (required)
  • --input-encoding: Input file character encoding (default: utf-8)
  • --output-encoding: Output file character encoding (default: utf-8)
  • --case-insensitive: Apply transformations to both uppercase and lowercase
  • --handle-titlecase: Generate titlecase variants for multi-character sequences (e.g., "nj" also matches "Nj")
  • --only-transformed: Only output lines where transformation occurred
  • -v, --verbose: Verbose output

Transformation Configuration Files

Transformation rules are defined in JSON files. Two example configurations are provided:

transform_phonetic.json

Phonetic transformations that preserve sound:

{
  "Š": "Sh",
  "š": "sh",
  "Đ": "Dj",
  "đ": "dj",
  "Č": "Ch",
  "č": "ch",
  "Ć": "Ch",
  "ć": "ch",
  "Ž": "Z",
  "ž": "z",
  "Dž": "Dz",
  "dž": "dz"
}

transform_simple.json

Simple one-to-one character replacements:

{
  "Š": "S",
  "š": "s",
  "Đ": "D",
  "đ": "d",
  "Č": "C",
  "č": "c",
  "Ć": "C",
  "ć": "c",
  "Ž": "Z",
  "ž": "z",
  "Dž": "Dz",
  "dž": "dz"
}

transform_to_unicode_digraphs.json

Transform ASCII digraphs to Unicode equivalents (explicit all-case mapping):

{
  "NJ": "NJ",
  "Nj": "Nj",
  "nj": "nj",
  "LJ": "LJ",
  "Lj": "Lj",
  "lj": "lj",
  "DŽ": "DŽ",
  "Dž": "Dž",
  "dž": "dž"
}

transform_from_unicode_digraphs.json

Simplified config for use with --handle-titlecase flag (only lowercase specified):

{
  "nj": "nj",
  "lj": "lj",
  "dž": "dž"
}

When used with --handle-titlecase, this automatically handles titlecase variants like "Nj" → "Nj".

Titlecase Handling

The --handle-titlecase flag automatically generates titlecase variants for multi-character sequences. This is useful when you only want to specify lowercase mappings in your config file, and have the tool automatically handle titlecase forms.

Example:

# Config file only contains: "nj": "nj", "lj": "lj"
transformlist -i wordlist.txt -o output.txt \
  -c examples/transform_from_unicode_digraphs.json \
  --handle-titlecase

Input:

njujork
Njujork
Ljubljana

Output:

njujork    (matches "nj" from config)
Njujork    (matches generated "Nj" → "Nj" titlecase variant)
LJubljana  (matches generated "Lj" → "lj" titlecase variant)

Note: --handle-titlecase only generates titlecase variants (first char upper, rest lower). For full uppercase support, use --case-insensitive or specify all variants explicitly in your config.

You can create your own configuration files with any character mappings you need.

Python Library Usage

Splitting wordlists

from preparelist import split_wordlist

# Split wordlist
special_count, normal_count = split_wordlist(
    input_file='wordlist.txt',
    output_special='special.txt',
    output_normal='normal.txt',
    input_encoding='utf-8',
    output_encoding='utf-8'
)

print(f"Words with special chars: {special_count}")
print(f"Words without special chars: {normal_count}")

Transforming wordlists

from preparelist import load_transformation_config, transform_wordlist

# Load transformation rules
transformations = load_transformation_config('transform_simple.json')

# Transform wordlist
line_count = transform_wordlist(
    input_file='wordlist.txt',
    output_file='transformed.txt',
    transformations=transformations,
    input_encoding='utf-8',
    output_encoding='ascii',
    case_sensitive=False  # Apply to both cases
)

print(f"Processed {line_count} lines")

Transforming individual text

from preparelist import transform_text, load_transformation_config

# Load config
config = load_transformation_config('transform_phonetic.json')

# Transform text
original = "Željko Šarić"
transformed = transform_text(original, config, case_sensitive=False)
print(f"{original} -> {transformed}")
# Output: Željko Šarić -> Zeljko Sharich

Checking for special characters

from preparelist import has_special_chars

print(has_special_chars("hello"))     # False
print(has_special_chars("Šime"))      # True
print(has_special_chars("café"))      # True

Supported Character Encodings

Common encodings include:

  • utf-8 (default)
  • iso-8859-1 (Latin-1)
  • iso-8859-2 (Latin-2, Central European)
  • cp852 (DOS Latin-2)
  • cp1250 (Windows Central European)
  • ascii (US-ASCII, 7-bit)

For a complete list, see Python's codec documentation.

Use Cases

  • Password Cracking: Transform wordlists to account for different character representations
  • Security Testing: Generate variants of wordlists for comprehensive testing
  • Data Cleaning: Normalize character encodings in text files
  • Localization: Adapt wordlists for different locales and character sets

Examples

Example 1: Processing a Croatian wordlist

# Split into special and normal
splitlist -i croatian_words.txt -s croatian_special.txt -n croatian_normal.txt

# Transform special characters to phonetic equivalents
transformlist -i croatian_special.txt -o croatian_phonetic.txt \
  -c examples/transform_phonetic.json --case-insensitive

Example 2: Converting DOS encoding to UTF-8

# Transform from CP852 to UTF-8
transformlist -i dos_wordlist.txt -o utf8_wordlist.txt \
  -c examples/transform_simple.json \
  --input-encoding cp852 --output-encoding utf-8

Example 3: Library usage for batch processing

import preparelist
from pathlib import Path

# Load transformation config once
config = preparelist.load_transformation_config('transform_simple.json')

# Process multiple files
wordlists = Path('wordlists').glob('*.txt')
for wordlist in wordlists:
    output = f"transformed_{wordlist.name}"
    preparelist.transform_wordlist(
        str(wordlist),
        output,
        config,
        case_sensitive=False
    )
    print(f"Processed {wordlist.name} -> {output}")

Example 4: Filtering wordlists with --only-transformed

The --only-transformed flag is useful for extracting only entries that contain specific characters:

# Extract only entries with ASCII digraphs (NJ, Nj, nj, LJ, Lj, lj, etc.)
transformlist -i mixed_wordlist.txt -o digraph_entries.txt \
  -c examples/transform_to_unicode_digraphs.json --only-transformed

# Result: only words like "Njujork", "Ljubljana" are in output,
# words like "password", "admin" are skipped

Input (mixed_wordlist.txt):

password
Njujork
admin
Ljubljana
test123

Output (digraph_entries.txt):

NJujork
LJubljana

This is particularly useful for:

  • Identifying entries with specific character patterns
  • Creating filtered wordlists for targeted testing
  • Extracting names or terms from a specific language
  • Quality control and validation

Development

Running tests

pip install -e ".[dev]"
pytest

Building the package

python -m build

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

kost - https://github.com/kost

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preparelist-0.1.0.tar.gz (13.8 kB view details)

Uploaded Source

File details

Details for the file preparelist-0.1.0.tar.gz.

File metadata

  • Download URL: preparelist-0.1.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for preparelist-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bb9fbf7c301f06b663f5517897502082e49302b9a136f2fad816b66d9ee823af
MD5 26538ad6dc4ddd53384f2d5aa66b851b
BLAKE2b-256 3f1871015166159402ab8730e5ae04e4d4aeb02e56354ac17411d24c1559124c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page