Wordlist manipulation tools for non-ASCII characters
Project description
preparelist
Wordlist manipulation tools for non-ASCII characters. Designed for security professionals and penetration testers who need to work with wordlists containing special characters from various character encodings.
Features
- splitlist: Split wordlists into files with and without special characters
- transformlist: Transform special characters according to configurable rules
- Support for multiple character encodings (UTF-8, ISO-8859-2, CP852, etc.)
- Both command-line tools and Python library API
- Flexible character transformation rules via JSON configuration
Installation
From PyPI
pip install preparelist
From source
git clone https://github.com/kost/preparelist
cd preparelist
pip install -e .
Command-Line Usage
splitlist
Split a wordlist into two files: one containing words with special characters and one without.
# Basic usage
splitlist -i wordlist.txt -s special.txt -n normal.txt
# With specific input encoding
splitlist -i wordlist.txt -s special.txt -n normal.txt --input-encoding iso-8859-2
# With verbose output
splitlist -i wordlist.txt -s special.txt -n normal.txt -v
Options:
-i, --input: Input wordlist file (required)-s, --special: Output file for words with special characters (required)-n, --normal: Output file for words without special characters (required)--input-encoding: Input file character encoding (default: utf-8)--output-encoding: Output file character encoding (default: utf-8)-v, --verbose: Verbose output
transformlist
Transform characters in a wordlist according to a configuration file.
# Basic usage
transformlist -i wordlist.txt -o output.txt -c transform_simple.json
# Case-insensitive transformations (applies to both cases)
transformlist -i wordlist.txt -o output.txt -c transform_phonetic.json --case-insensitive
# Only output lines where transformation occurred
transformlist -i wordlist.txt -o output.txt -c transform_to_unicode_digraphs.json --only-transformed
# With specific encodings
transformlist -i wordlist.txt -o output.txt -c config.json \
--input-encoding iso-8859-2 --output-encoding ascii
# With verbose output
transformlist -i wordlist.txt -o output.txt -c config.json -v
Options:
-i, --input: Input wordlist file (required)-o, --output: Output wordlist file (required)-c, --config: Transformation configuration file in JSON format (required)--input-encoding: Input file character encoding (default: utf-8)--output-encoding: Output file character encoding (default: utf-8)--case-insensitive: Apply transformations to both uppercase and lowercase--handle-titlecase: Generate titlecase variants for multi-character sequences (e.g., "nj" also matches "Nj")--only-transformed: Only output lines where transformation occurred-v, --verbose: Verbose output
Transformation Configuration Files
Transformation rules are defined in JSON files. Two example configurations are provided:
transform_phonetic.json
Phonetic transformations that preserve sound:
{
"Š": "Sh",
"š": "sh",
"Đ": "Dj",
"đ": "dj",
"Č": "Ch",
"č": "ch",
"Ć": "Ch",
"ć": "ch",
"Ž": "Z",
"ž": "z",
"Dž": "Dz",
"dž": "dz"
}
transform_simple.json
Simple one-to-one character replacements:
{
"Š": "S",
"š": "s",
"Đ": "D",
"đ": "d",
"Č": "C",
"č": "c",
"Ć": "C",
"ć": "c",
"Ž": "Z",
"ž": "z",
"Dž": "Dz",
"dž": "dz"
}
transform_to_unicode_digraphs.json
Transform ASCII digraphs to Unicode equivalents (explicit all-case mapping):
{
"NJ": "NJ",
"Nj": "Nj",
"nj": "nj",
"LJ": "LJ",
"Lj": "Lj",
"lj": "lj",
"DŽ": "DŽ",
"Dž": "Dž",
"dž": "dž"
}
transform_from_unicode_digraphs.json
Simplified config for use with --handle-titlecase flag (only lowercase specified):
{
"nj": "nj",
"lj": "lj",
"dž": "dž"
}
When used with --handle-titlecase, this automatically handles titlecase variants like "Nj" → "Nj".
Titlecase Handling
The --handle-titlecase flag automatically generates titlecase variants for multi-character sequences. This is useful when you only want to specify lowercase mappings in your config file, and have the tool automatically handle titlecase forms.
Example:
# Config file only contains: "nj": "nj", "lj": "lj"
transformlist -i wordlist.txt -o output.txt \
-c examples/transform_from_unicode_digraphs.json \
--handle-titlecase
Input:
njujork
Njujork
Ljubljana
Output:
njujork (matches "nj" from config)
Njujork (matches generated "Nj" → "Nj" titlecase variant)
LJubljana (matches generated "Lj" → "lj" titlecase variant)
Note: --handle-titlecase only generates titlecase variants (first char upper, rest lower). For full uppercase support, use --case-insensitive or specify all variants explicitly in your config.
You can create your own configuration files with any character mappings you need.
Python Library Usage
Splitting wordlists
from preparelist import split_wordlist
# Split wordlist
special_count, normal_count = split_wordlist(
input_file='wordlist.txt',
output_special='special.txt',
output_normal='normal.txt',
input_encoding='utf-8',
output_encoding='utf-8'
)
print(f"Words with special chars: {special_count}")
print(f"Words without special chars: {normal_count}")
Transforming wordlists
from preparelist import load_transformation_config, transform_wordlist
# Load transformation rules
transformations = load_transformation_config('transform_simple.json')
# Transform wordlist
line_count = transform_wordlist(
input_file='wordlist.txt',
output_file='transformed.txt',
transformations=transformations,
input_encoding='utf-8',
output_encoding='ascii',
case_sensitive=False # Apply to both cases
)
print(f"Processed {line_count} lines")
Transforming individual text
from preparelist import transform_text, load_transformation_config
# Load config
config = load_transformation_config('transform_phonetic.json')
# Transform text
original = "Željko Šarić"
transformed = transform_text(original, config, case_sensitive=False)
print(f"{original} -> {transformed}")
# Output: Željko Šarić -> Zeljko Sharich
Checking for special characters
from preparelist import has_special_chars
print(has_special_chars("hello")) # False
print(has_special_chars("Šime")) # True
print(has_special_chars("café")) # True
Supported Character Encodings
Common encodings include:
utf-8(default)iso-8859-1(Latin-1)iso-8859-2(Latin-2, Central European)cp852(DOS Latin-2)cp1250(Windows Central European)ascii(US-ASCII, 7-bit)
For a complete list, see Python's codec documentation.
Use Cases
- Password Cracking: Transform wordlists to account for different character representations
- Security Testing: Generate variants of wordlists for comprehensive testing
- Data Cleaning: Normalize character encodings in text files
- Localization: Adapt wordlists for different locales and character sets
Examples
Example 1: Processing a Croatian wordlist
# Split into special and normal
splitlist -i croatian_words.txt -s croatian_special.txt -n croatian_normal.txt
# Transform special characters to phonetic equivalents
transformlist -i croatian_special.txt -o croatian_phonetic.txt \
-c examples/transform_phonetic.json --case-insensitive
Example 2: Converting DOS encoding to UTF-8
# Transform from CP852 to UTF-8
transformlist -i dos_wordlist.txt -o utf8_wordlist.txt \
-c examples/transform_simple.json \
--input-encoding cp852 --output-encoding utf-8
Example 3: Library usage for batch processing
import preparelist
from pathlib import Path
# Load transformation config once
config = preparelist.load_transformation_config('transform_simple.json')
# Process multiple files
wordlists = Path('wordlists').glob('*.txt')
for wordlist in wordlists:
output = f"transformed_{wordlist.name}"
preparelist.transform_wordlist(
str(wordlist),
output,
config,
case_sensitive=False
)
print(f"Processed {wordlist.name} -> {output}")
Example 4: Filtering wordlists with --only-transformed
The --only-transformed flag is useful for extracting only entries that contain specific characters:
# Extract only entries with ASCII digraphs (NJ, Nj, nj, LJ, Lj, lj, etc.)
transformlist -i mixed_wordlist.txt -o digraph_entries.txt \
-c examples/transform_to_unicode_digraphs.json --only-transformed
# Result: only words like "Njujork", "Ljubljana" are in output,
# words like "password", "admin" are skipped
Input (mixed_wordlist.txt):
password
Njujork
admin
Ljubljana
test123
Output (digraph_entries.txt):
NJujork
LJubljana
This is particularly useful for:
- Identifying entries with specific character patterns
- Creating filtered wordlists for targeted testing
- Extracting names or terms from a specific language
- Quality control and validation
Development
Running tests
pip install -e ".[dev]"
pytest
Building the package
python -m build
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Author
kost - https://github.com/kost
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file preparelist-0.1.0.tar.gz.
File metadata
- Download URL: preparelist-0.1.0.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb9fbf7c301f06b663f5517897502082e49302b9a136f2fad816b66d9ee823af
|
|
| MD5 |
26538ad6dc4ddd53384f2d5aa66b851b
|
|
| BLAKE2b-256 |
3f1871015166159402ab8730e5ae04e4d4aeb02e56354ac17411d24c1559124c
|