Data Preprocessing fns
Project description
Installation
pip install dsfns
dsfns
Version History
1.1 - Outlier handling - IQR method / Winsorizer / Clip
1.2 - Missing value imputation - Simple Imputer / Mean, Median, Mode / Numerical data with mean and obj/cat with mode
1.3 - Outlier Counter, Columns with Outlier and high frequency column finder (single value repeating over 50%)
1.4 - Data Encoding (label, onehot) and Scaling (MinMax, Standard, Robust)
1.5 - Fixed general code issues
1.6 - Redundant Codes removed
1.7 - Fixed general code issues / issues with return statement
1.8 - Fixed general code issues / issues with df.sample()
1.9 - Added Outlier replacement (using Mean, Median or Mode) and Low Variance columns (cols with IQR = 0)
1.10 - Missing value Interpolation (use with Time series data, Continuous variables with trends etc)
2.0 - Visualize data with Single and Multiple Line plots
2.1 - Visualize data with Multiple Regression plots and VIF (Variance Inflation Factor)
2.2 - Compare model accuracy of multiple models
2.3 - Added function that outputs various metrics for model evaluation
2.4 - Minor code changes to accomodate new features
2.6 - Function added that can train, evaluate, and print results for any classifier, optionally using resampling.
FUNCTION DESCRIPTIONS
OUTLIER HANDLING
-
Replace Outliers using IQR, Upper Limit and Lower Limit: Identifies and handles outliers in the specified columns using the Interquartile Range (IQR) method.
Outlier_IQR(df, columns, whis=1.5)
- df: DataFrame — The input data in which outliers will be detected.
- columns: list — List of column names in which outliers need to be identified.
- whis: float, optional (default=1.5) — The multiplier to define the outlier limits as 1.5 x IQR by default.
-
Replace outliers with Winsorizer: Applies Winsorization to cap outliers in the specified column using either IQR or other capping methods.
Outlier_Winsorizer(df, column, capping_method='iqr', fold=1.5)
- df: DataFrame — The input data to apply Winsorization.
- column: str — The name of the column to apply the Winsorization to.
- capping_method: str, default 'iqr' — Method used to define the outlier thresholds (options: 'iqr' , std, 'quantiles' or 'mad').
- fold: float, optional (default=1.5) — The multiplier to define the fold.
-
Clip Outliers using Clip method: Clips extreme values to a predefined threshold in the specified columns, effectively handling outliers.
Outlier_Clip(df, columns, perc=0.05)
- df: DataFrame — The input data to clip outliers from.
- columns: list — List of columns in which to clip the outliers.
- perc: float, (default=0.05) — The percentile to clip outliers. Default value of 0.05 clips top 0.95 and bottom 0.05
-
Outlier Replacement with Mean, Median or Mode: This function is designed to identify and handle outliers in the specified columns of a given DataFrame. It uses the Interquartile Range (IQR) method to determine outliers and replaces the outliers with a user-defined statistic (mean, median, or mode).
Outlier_MMM(df, columns, type='median')
- df (DataFrame): The input pandas DataFrame that contains the data to be processed. columns (list of str): A list of column names in the DataFrame where outlier handling should be applied.
- type (optional): It can be one of 'mean', 'median', or 'mode'. The default is 'median'.
-
Outlier Columns: Returns a list of columns that contain outliers based on IQR.
outlierColumns(df, whis=1.5):
- df: DataFrame — The input data to check for outliers.
- whis: float, optional (default=1.5) — The multiplier to define the outlier limits as 1.5 x IQR by default
-
Outlier Counter: Counts the number of outliers in the specified columns.
outlierCount(df, columns)
- df: DataFrame — The input data to count outliers in.
- columns: list — List of columns to check for outliers.
MISSING VALUE IMPUTATION
-
Fill Missing Values with Mean, Median or Mode using df.replace: Replaces missing values in the specified columns using a chosen method.
MissingVal_Repl(df, columns, type='mean')
- df: DataFrame — The input data in which missing values will be replaced.
- columns: list — List of column names where missing values need to be replaced.
- type: str, default 'mean' — The method used for replacement ('mean', 'median', or mode)
-
Fill Missing Values with Mean, Median or Mode with Simple Imputer: The MissingVal_Imputer function is designed to handle missing values in specified columns of a pandas DataFrame using different imputation strategies. It replaces missing values (NaN) with appropriate values based on the chosen strategy.
MissingVal_Imputer(df,columns,strategy='mean')
- df (pandas.DataFrame): The input DataFrame where missing values need to be imputed.
- columns (list): A list of column names where missing value imputation is to be applied.
- strategy (str, default='mean'):The strategy for imputing missing values. Supported values: 'mean': Replaces missing values with the mean of the column. 'median': Replaces missing values with the median of the column. 'mode': Replaces missing values with the most frequent value in the column (converted to 'most_frequent' internally).
-
Fill Missing Values with Mean and/or Mode: Identifies and returns all rows in the DataFrame that contain missing values with mean for numeric columns and mode (with index[0]) for object.
MissingVal_Fillna(df)
- df: DataFrame — The input data to check for missing values.
-
Interpolate Missing Values: This function is designed to handle missing values in a DataFrame by applying interpolation methods to the numerical columns. NOTE 1: Interpolation only works for numeric columns. NOTE 2: Works better with continuous data
def MissingVal_Interpolate(df,type='linear')
- df (DataFrame): The input DataFrame that contains missing (NaN) values.
- type: Specifies the interpolation method to be used. Options include: 'linear': Uses linear interpolation (default). 'polynomial': Uses polynomial interpolation with degree 2 (quadratic). 'spline': Uses cubic spline interpolation.
OTHER FUNCTIONS
-
High Frequency Columns: Identifies and returns columns where more than the given percentage (default 50%) of values are identical, typically used to detect low-variance or high-frequency columns.
highFrequency(df, perc=0.5):
- df: DataFrame — The input data to identify high-frequency columns.
- perc: float, default 0.5 — The percentage threshold for identifying high-frequency columns.
-
Encoder: Encodes categorical columns into numeric labels for compatibility with machine learning algorithms.
Encoding(df, method='label')
- df: A Pandas DataFrame containing the dataset.
- method: 'label' for label encoding OR 'onehot' for OneHotEncoding
-
Scaler: Scales numerical data for better performance during machine learning model training.
Scaler(df, method='minmax')
- df: A Pandas DataFrame containing numeric data.
- method: Specifies the scaling technique to use. Options are: 'minmax' (default): Rescales data to a range of 0 to 1. 'standard': Standardizes data to have a mean of 0 and a standard deviation of 1. 'robust': Scales data using the median and interquartile range, making it robust to outliers.
-
Low Variance Columns: This function detects columns in a DataFrame with very low variance (i.e., columns where the values are almost constant or do not vary much). Columns with zero variance are identified as low-variance columns. Returns a list of column names that have low variance (IQR = 0)
LowVarianceCols(df)
- df (DataFrame): The input pandas DataFrame for which low variance columns need to be identified.
-
VIF(Variance Inflation Factor): The VIF function calculates the Variance Inflation Factor (VIF) for each predictor variable in a dataset, providing insights into multicollinearity. A high VIF (usually greater than 10) indicates that the variable is highly collinear with other predictors and might need to be addressed.
def VIF(X)
- X (DataFrame): A DataFrame containing the independent variables (predictor features) of the dataset. NOTE: The dataset should not include the target variable (dependent variable).
-
RowTransformer: The RowTransformer function is a utility that creates a custom transformer for scikit-learn pipelines. This transformer allows you to apply a custom transformation function (custom_transform_fn) to each row of a dataset. It can be useful when you need to perform row-wise operations, such as applying specific functions to individual rows of data in a machine learning pipeline
def RowTransformer(custom_transform_fn) - custom_transform_fn: A function that takes in a DataFrame or ndarray and performs custom transformations on the rows. This function is applied to each row during the transformation step. Example: # Sample custom transformation function def custom_transform_fn(X): return np.log(X + 1) #Creating the custom transformer using RowTransformer RowTransformer = RowTransformer(custom_transform_fn)
DATA VISUALIZATION
-
LinePlot Multiple: Creates a set of subplots where each input column (inpCol) is plotted against the output column (outCol) in individual subplots.
def Lineplot_Multi(df, inpCol, outCol, figsize=(15, 5))
- df (DataFrame): The input dataset containing the columns to plot.
- inpCol (list): A list of input columns (features) to plot against the output column.
- outCol (str): The output column (target variable) to plot against each input column.
- figsize (tuple): Tuple defining the size of the overall figure (default: (15, 5)).
-
LinePlot Single: Plots multiple input columns (inpCol) against the output column (outCol) on the same plot, using different lines for each input column, with a legend to identify them. NOTE: SCALE THE DATA FOR BETTER VISUALIZATION
def Lineplot_Single(df, inpCol, outCol)
- df (DataFrame): The input dataset containing the columns to plot.
- inpCol (list): A list of input columns (features) to plot against the output column.
- outCol (str): The output column (target variable) to plot against each input column.
-
RegressionPlot Multiple: Creates a set of subplots where each input column (inpCol) is plotted against the output column (outCol) in individual subplots.
def RegressionPlot_Multiple(df, inpCol, outCol, figsize=(15, 5))
- df (DataFrame): The input dataset containing the columns to plot.
- inpCol (list): A list of input columns (features) to plot against the output column.
- outCol (str): The output column (target variable) to plot against each input column.
- figsize (tuple): Tuple defining the size of the overall figure (default: (15, 5)).
EVALUATION
-
Compare Model Accuracy: The CompareAccuracy function is a utility to evaluate and compare the training and testing accuracy of multiple machine learning models on a given dataset. It provides a simple way to benchmark different models and understand their performance.
def CompareAccuracy(models, x_train, x_test, y_train, y_test):
-
models (dict): A dictionary where keys are model names (str) and values are their respective model objects. Eg: models = { "svm": SVC(kernel='linear', random_state=42), "knn": KNeighborsClassifier(n_neighbors=5) }
-
x_train: The feature set for training the models.
-
x_test: The feature set for testing the models.
-
y_train: The target labels for training the models.
-
y_test: The target labels for testing the models.
-
-
Metrics_Clf (Classification Metrics): The Metrics_Clf function computes the following classification metrics:
- Accuracy: The proportion of correct predictions out of all predictions.
- Precision: The proportion of true positive predictions out of all predicted positives.
- Recall: The proportion of true positive predictions out of all actual positives.
- F1 Score: The harmonic mean of precision and recall, providing a balance between them.
- ROC AUC: The area under the Receiver Operating Characteristic (ROC) curve, indicating the model's ability to distinguish between classes.
def Metrics_Clf(y_train, y_pred_train, y_test, y_pred_test):
- y_train: The true labels of the training set.
- y_pred_train: The predicted labels for the training set.
- y_test: The true labels of the testing set.
- y_pred_test: The predicted labels for the testing set.
-
Metrics_Reg (Regression Metrics): The Metrics_Reg function computes the following regression metrics:
- Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values.
- Mean Squared Error (MSE): The average of the squared differences between predicted and actual values.
- Root Mean Squared Error (RMSE): The square root of the Mean Squared Error, which gives an estimate of the standard deviation of the prediction error.
- R² Score (Coefficient of Determination): A measure of how well the model explains the variance in the data.
def Metrics_Reg(y_train, y_pred_train, y_test, y_pred_test):
- y_train: The true labels of the training set.
- y_pred_train: The predicted labels for the training set.
- y_test: The true labels of the testing set.
- y_pred_test: The predicted labels for the testing set.
-
TrainEvaluate_Classifier: A function to train, evaluate, and print results for any classifier, optionally using resampling and preprocessing.
def TrainEvaluate_Classifier(model, x_train, y_train, x_test, y_test, preprocessor=None, resampler=None): Parameters:
- model: Any classifier (e.g., XGBClassifier, RandomForest, etc.)
- x_train: Training features
- y_train: Training labels
- x_test: Testing features
- y_test: Testing labels
- preprocessor: Any preprocessing steps (like scaling, encoding, etc.)
- resampler: Optional resampling technique (e.g., SMOTE, ADASYN, etc.)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dsfns-2.6.tar.gz.
File metadata
- Download URL: dsfns-2.6.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
404969fe4a878fc004ce2eca15c43fcfc37a417385f8b177b024451b5e425bfa
|
|
| MD5 |
ba8462f782f2fd777f02d83c26ae357e
|
|
| BLAKE2b-256 |
6952951f4f30990598c2c94c3e8642e6ca903ee7f2a3444af816d2254496858e
|
File details
Details for the file dsfns-2.6-py3-none-any.whl.
File metadata
- Download URL: dsfns-2.6-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
edc2667a21b2c997076ae3c4fd9b7862c84b5955b6e95c32781a9a8209c06c53
|
|
| MD5 |
bf4506f7f8990c3d3a73d67c76f41bec
|
|
| BLAKE2b-256 |
82d217fa138cdf08933fe00fe9ce1b606e37e5357d90777bf998168f91c928d9
|