A pandas accessor that automates 80-90% of standard tabular data preprocessing tasks
Project description
Pandas Auto Prep
A powerful pandas accessor that automates 80-90% of standard tabular data preprocessing tasks with a single command.
Installation
pip install pandas-auto-prep
Quick Start
import pandas as pd
import pandas_auto_prep # This registers the .prep accessor
# Load your data
df = pd.read_csv("your_data.csv")
# Preprocess with one line!
X_train, X_test, y_train, y_test = df.prep.auto(target='target_column')
Features
The .prep.auto() method performs the following preprocessing steps automatically:
| Step | Task | Details |
|---|---|---|
| 1 | Housekeeping | Remove duplicates, constant columns, high-missing columns (>50%) |
| 2 | Datetime Handling | Extract Year, Month, Day, DayOfWeek from datetime columns |
| 3 | Missing Value Imputation | Median for numeric, Mode for categorical |
| 4 | Skewness Handling | Log(x+1) transformation for highly skewed columns |
| 5 | Outlier Handling | Winsorization (clip to 1st-99th percentiles) |
| 6 | Categorical Encoding | One-Hot for low cardinality, Label Encoding for high cardinality |
| 7 | Scaling | StandardScaler for all numeric features |
| 8 | Train/Test Split | Stratified split (configurable) |
Parameters
df.prep.auto(
target='target_column', # Target column name (optional)
test_size=0.2, # Test split ratio
random_state=42, # Random seed
cardinality_thresh=10, # One-Hot vs Label Encoding threshold
missing_thresh=0.5, # Drop columns with >50% missing
skewness_thresh=1.0, # Skewness threshold for log transform
verbose=True # Print progress
)
Usage Examples
With Target (Supervised Learning)
import pandas as pd
import pandas_auto_prep
df = pd.read_csv("customer_churn.csv")
X_train, X_test, y_train, y_test = df.prep.auto(target='churn')
# Ready for model training!
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
Without Target (Data Cleaning Only)
import pandas as pd
import pandas_auto_prep
df = pd.read_csv("messy_data.csv")
clean_df = df.prep.auto() # Returns fully preprocessed DataFrame
Access Fitted Objects
# After running .prep.auto()
label_encoders = df.prep.get_label_encoders() # For inverse transforms
scaler = df.prep.get_scaler() # Fitted StandardScaler
Requirements
- Python >= 3.8
- pandas >= 1.3.0
- numpy >= 1.20.0
- scikit-learn >= 1.0.0
License
MIT License - see LICENSE for details.
Author
Aarush Gupta
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pandas_auto_prep-0.1.0.tar.gz.
File metadata
- Download URL: pandas_auto_prep-0.1.0.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5296059c89d0a5c51262f5437f9b1d1db55d4b314e7d3170e1ab3eb943d57ce
|
|
| MD5 |
f0f222403fe304bbd3c23a993a3641a2
|
|
| BLAKE2b-256 |
f488eafe7e90eb3df37710b4363bd95a26315b8c2154f80314f5e99cd86a7f38
|
File details
Details for the file pandas_auto_prep-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pandas_auto_prep-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
589cdfdb85f49d8ff2c55d4a43cec9e67bd8671588fddb7a17424ed5f65cdb6d
|
|
| MD5 |
39718276fa7cabf756c717c21b50abec
|
|
| BLAKE2b-256 |
91e646732810734196c7939f53410fdafda51b6946f488fd4f3c308c4de06598
|