Skip to main content

Data Preprocessing fns

Project description

Installation

pip install dsfns

dsfns

Version History

1.1  - Outlier handling - IQR method / Winsorizer / Clip
1.2  - Missing value imputation - Simple Imputer / Mean, Median, Mode / Numerical data with mean and obj/cat with mode
1.3  - Outlier Counter, Columns with Outlier and high frequency column finder (single value repeating over 50%)
1.4  - Data Encoding (label, onehot) and Scaling (MinMax, Standard, Robust)
1.5  - Fixed general code issues
1.6  - Redundant Codes removed
1.7  - Fixed general code issues / issues with return statement
1.8  - Fixed general code issues / issues with df.sample()
1.9  - Added Outlier replacement (using Mean, Median or Mode) and Low Variance columns (cols with IQR = 0)
1.10 - Missing value Interpolation (use with Time series data, Continuous variables with trends etc)
2.0  - Visualize data with Single and Multiple Line plots
2.1  - Visualize data with Multiple Regression plots and VIF (Variance Inflation Factor)
2.2  - Compare model accuracy of multiple models
2.3  - Added function that outputs various metrics for model evaluation
2.4  - Minor code changes to accomodate new features
2.6  - Function added that can train, evaluate, and print results for any classifier, optionally using resampling.

FUNCTION DESCRIPTIONS

OUTLIER HANDLING

  1. Replace Outliers using IQR, Upper Limit and Lower Limit: Identifies and handles outliers in the specified columns using the Interquartile Range (IQR) method.

    Outlier_IQR(df, columns, whis=1.5)

    • df: DataFrame — The input data in which outliers will be detected.
    • columns: list — List of column names in which outliers need to be identified.
    • whis: float, optional (default=1.5) — The multiplier to define the outlier limits as 1.5 x IQR by default.
  2. Replace outliers with Winsorizer: Applies Winsorization to cap outliers in the specified column using either IQR or other capping methods.

    Outlier_Winsorizer(df, column, capping_method='iqr', fold=1.5)

    • df: DataFrame — The input data to apply Winsorization.
    • column: str — The name of the column to apply the Winsorization to.
    • capping_method: str, default 'iqr' — Method used to define the outlier thresholds (options: 'iqr' , std, 'quantiles' or 'mad').
    • fold: float, optional (default=1.5) — The multiplier to define the fold.
  3. Clip Outliers using Clip method: Clips extreme values to a predefined threshold in the specified columns, effectively handling outliers.

    Outlier_Clip(df, columns, perc=0.05)

    • df: DataFrame — The input data to clip outliers from.
    • columns: list — List of columns in which to clip the outliers.
    • perc: float, (default=0.05) — The percentile to clip outliers. Default value of 0.05 clips top 0.95 and bottom 0.05
  4. Outlier Replacement with Mean, Median or Mode: This function is designed to identify and handle outliers in the specified columns of a given DataFrame. It uses the Interquartile Range (IQR) method to determine outliers and replaces the outliers with a user-defined statistic (mean, median, or mode).

    Outlier_MMM(df, columns, type='median')

    • df (DataFrame): The input pandas DataFrame that contains the data to be processed. columns (list of str): A list of column names in the DataFrame where outlier handling should be applied.
    • type (optional): It can be one of 'mean', 'median', or 'mode'. The default is 'median'.
  5. Outlier Columns: Returns a list of columns that contain outliers based on IQR.

    outlierColumns(df, whis=1.5):

    • df: DataFrame — The input data to check for outliers.
    • whis: float, optional (default=1.5) — The multiplier to define the outlier limits as 1.5 x IQR by default
  6. Outlier Counter: Counts the number of outliers in the specified columns.

    outlierCount(df, columns)

    • df: DataFrame — The input data to count outliers in.
    • columns: list — List of columns to check for outliers.

MISSING VALUE IMPUTATION

  1. Fill Missing Values with Mean, Median or Mode using df.replace: Replaces missing values in the specified columns using a chosen method.

    MissingVal_Repl(df, columns, type='mean')

    • df: DataFrame — The input data in which missing values will be replaced.
    • columns: list — List of column names where missing values need to be replaced.
    • type: str, default 'mean' — The method used for replacement ('mean', 'median', or mode)
  2. Fill Missing Values with Mean, Median or Mode with Simple Imputer: The MissingVal_Imputer function is designed to handle missing values in specified columns of a pandas DataFrame using different imputation strategies. It replaces missing values (NaN) with appropriate values based on the chosen strategy.

    MissingVal_Imputer(df,columns,strategy='mean')

    • df (pandas.DataFrame): The input DataFrame where missing values need to be imputed.
    • columns (list): A list of column names where missing value imputation is to be applied.
    • strategy (str, default='mean'):The strategy for imputing missing values. Supported values: 'mean': Replaces missing values with the mean of the column. 'median': Replaces missing values with the median of the column. 'mode': Replaces missing values with the most frequent value in the column (converted to 'most_frequent' internally).
  3. Fill Missing Values with Mean and/or Mode: Identifies and returns all rows in the DataFrame that contain missing values with mean for numeric columns and mode (with index[0]) for object.

    MissingVal_Fillna(df)

    • df: DataFrame — The input data to check for missing values.
  4. Interpolate Missing Values: This function is designed to handle missing values in a DataFrame by applying interpolation methods to the numerical columns. NOTE 1: Interpolation only works for numeric columns. NOTE 2: Works better with continuous data

    def MissingVal_Interpolate(df,type='linear')

    • df (DataFrame): The input DataFrame that contains missing (NaN) values.
    • type: Specifies the interpolation method to be used. Options include: 'linear': Uses linear interpolation (default). 'polynomial': Uses polynomial interpolation with degree 2 (quadratic). 'spline': Uses cubic spline interpolation.

OTHER FUNCTIONS

  1. High Frequency Columns: Identifies and returns columns where more than the given percentage (default 50%) of values are identical, typically used to detect low-variance or high-frequency columns.

    highFrequency(df, perc=0.5):

    • df: DataFrame — The input data to identify high-frequency columns.
    • perc: float, default 0.5 — The percentage threshold for identifying high-frequency columns.
  2. Encoder: Encodes categorical columns into numeric labels for compatibility with machine learning algorithms.

    Encoding(df, method='label')

    • df: A Pandas DataFrame containing the dataset.
    • method: 'label' for label encoding OR 'onehot' for OneHotEncoding
  3. Scaler: Scales numerical data for better performance during machine learning model training.

    Scaler(df, method='minmax')

    • df: A Pandas DataFrame containing numeric data.
    • method: Specifies the scaling technique to use. Options are: 'minmax' (default): Rescales data to a range of 0 to 1. 'standard': Standardizes data to have a mean of 0 and a standard deviation of 1. 'robust': Scales data using the median and interquartile range, making it robust to outliers.
  4. Low Variance Columns: This function detects columns in a DataFrame with very low variance (i.e., columns where the values are almost constant or do not vary much). Columns with zero variance are identified as low-variance columns. Returns a list of column names that have low variance (IQR = 0)

    LowVarianceCols(df)

    • df (DataFrame): The input pandas DataFrame for which low variance columns need to be identified.
  5. VIF(Variance Inflation Factor): The VIF function calculates the Variance Inflation Factor (VIF) for each predictor variable in a dataset, providing insights into multicollinearity. A high VIF (usually greater than 10) indicates that the variable is highly collinear with other predictors and might need to be addressed.

    def VIF(X)

    • X (DataFrame): A DataFrame containing the independent variables (predictor features) of the dataset. NOTE: The dataset should not include the target variable (dependent variable).
  6. RowTransformer: The RowTransformer function is a utility that creates a custom transformer for scikit-learn pipelines. This transformer allows you to apply a custom transformation function (custom_transform_fn) to each row of a dataset. It can be useful when you need to perform row-wise operations, such as applying specific functions to individual rows of data in a machine learning pipeline

    def RowTransformer(custom_transform_fn)
    
    -   custom_transform_fn: A function that takes in a DataFrame or ndarray and performs custom transformations on the rows. This function is applied to each row during the transformation step.
    
    Example:
    # Sample custom transformation function
        def custom_transform_fn(X):
        return np.log(X + 1)
    
    #Creating the custom transformer using RowTransformer
    RowTransformer = RowTransformer(custom_transform_fn)
    

DATA VISUALIZATION

  1. LinePlot Multiple: Creates a set of subplots where each input column (inpCol) is plotted against the output column (outCol) in individual subplots.

    def Lineplot_Multi(df, inpCol, outCol, figsize=(15, 5))

    • df (DataFrame): The input dataset containing the columns to plot.
    • inpCol (list): A list of input columns (features) to plot against the output column.
    • outCol (str): The output column (target variable) to plot against each input column.
    • figsize (tuple): Tuple defining the size of the overall figure (default: (15, 5)).
  2. LinePlot Single: Plots multiple input columns (inpCol) against the output column (outCol) on the same plot, using different lines for each input column, with a legend to identify them. NOTE: SCALE THE DATA FOR BETTER VISUALIZATION

    def Lineplot_Single(df, inpCol, outCol)

    • df (DataFrame): The input dataset containing the columns to plot.
    • inpCol (list): A list of input columns (features) to plot against the output column.
    • outCol (str): The output column (target variable) to plot against each input column.
  3. RegressionPlot Multiple: Creates a set of subplots where each input column (inpCol) is plotted against the output column (outCol) in individual subplots.

    def RegressionPlot_Multiple(df, inpCol, outCol, figsize=(15, 5))

    • df (DataFrame): The input dataset containing the columns to plot.
    • inpCol (list): A list of input columns (features) to plot against the output column.
    • outCol (str): The output column (target variable) to plot against each input column.
    • figsize (tuple): Tuple defining the size of the overall figure (default: (15, 5)).

EVALUATION

  1. Compare Model Accuracy: The CompareAccuracy function is a utility to evaluate and compare the training and testing accuracy of multiple machine learning models on a given dataset. It provides a simple way to benchmark different models and understand their performance.

    def CompareAccuracy(models, x_train, x_test, y_train, y_test):

    • models (dict): A dictionary where keys are model names (str) and values are their respective model objects. Eg: models = { "svm": SVC(kernel='linear', random_state=42), "knn": KNeighborsClassifier(n_neighbors=5) }

    • x_train: The feature set for training the models.

    • x_test: The feature set for testing the models.

    • y_train: The target labels for training the models.

    • y_test: The target labels for testing the models.

  2. Metrics_Clf (Classification Metrics): The Metrics_Clf function computes the following classification metrics:

    1. Accuracy: The proportion of correct predictions out of all predictions.
    2. Precision: The proportion of true positive predictions out of all predicted positives.
    3. Recall: The proportion of true positive predictions out of all actual positives.
    4. F1 Score: The harmonic mean of precision and recall, providing a balance between them.
    5. ROC AUC: The area under the Receiver Operating Characteristic (ROC) curve, indicating the model's ability to distinguish between classes.

    def Metrics_Clf(y_train, y_pred_train, y_test, y_pred_test):

    • y_train: The true labels of the training set.
    • y_pred_train: The predicted labels for the training set.
    • y_test: The true labels of the testing set.
    • y_pred_test: The predicted labels for the testing set.
  3. Metrics_Reg (Regression Metrics): The Metrics_Reg function computes the following regression metrics:

    1. Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values.
    2. Mean Squared Error (MSE): The average of the squared differences between predicted and actual values.
    3. Root Mean Squared Error (RMSE): The square root of the Mean Squared Error, which gives an estimate of the standard deviation of the prediction error.
    4. R² Score (Coefficient of Determination): A measure of how well the model explains the variance in the data.

    def Metrics_Reg(y_train, y_pred_train, y_test, y_pred_test):

    • y_train: The true labels of the training set.
    • y_pred_train: The predicted labels for the training set.
    • y_test: The true labels of the testing set.
    • y_pred_test: The predicted labels for the testing set.
  4. TrainEvaluate_Classifier: A function to train, evaluate, and print results for any classifier, optionally using resampling and preprocessing.

    def TrainEvaluate_Classifier(model, x_train, y_train, x_test, y_test, preprocessor=None, resampler=None): Parameters:

    • model: Any classifier (e.g., XGBClassifier, RandomForest, etc.)
    • x_train: Training features
    • y_train: Training labels
    • x_test: Testing features
    • y_test: Testing labels
    • preprocessor: Any preprocessing steps (like scaling, encoding, etc.)
    • resampler: Optional resampling technique (e.g., SMOTE, ADASYN, etc.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsfns-2.6.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsfns-2.6-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file dsfns-2.6.tar.gz.

File metadata

  • Download URL: dsfns-2.6.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for dsfns-2.6.tar.gz
Algorithm Hash digest
SHA256 404969fe4a878fc004ce2eca15c43fcfc37a417385f8b177b024451b5e425bfa
MD5 ba8462f782f2fd777f02d83c26ae357e
BLAKE2b-256 6952951f4f30990598c2c94c3e8642e6ca903ee7f2a3444af816d2254496858e

See more details on using hashes here.

File details

Details for the file dsfns-2.6-py3-none-any.whl.

File metadata

  • Download URL: dsfns-2.6-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for dsfns-2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 edc2667a21b2c997076ae3c4fd9b7862c84b5955b6e95c32781a9a8209c06c53
MD5 bf4506f7f8990c3d3a73d67c76f41bec
BLAKE2b-256 82d217fa138cdf08933fe00fe9ce1b606e37e5357d90777bf998168f91c928d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page