import pandas as pd
# Sample data
data = {
'Feature1': [100, 200, 300, 400, 500],
'Feature2': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# Display original data
print("Original Data:")
print(df)
Original Data: Feature1 Feature2 0 100 1 1 200 2 2 300 3 3 400 4 4 500 5
⚖️ Standardization¶
📖 Click to Expand
Standardization (Z-score Normalization) is a scaling technique where the values are centered around the mean with a unit standard deviation.
$$ z = \frac{x - \mu}{\sigma} $$
- $x$ is the original feature value,
- $\mu$ is the mean of the feature,
- $\sigma$ is the standard deviation of the feature.
Why Use Standardization?
- It ensures the data has a mean of 0 and a standard deviation of 1, which is crucial for algorithms sensitive to scale, such as Support Vector Machines (SVM) and Principal Component Analysis (PCA).
- Standardization is particularly useful when features have different units or scales.
When to Use?
- It works well when the dataset has outliers, as it does not squash the values between a fixed range like Min-Max Scaling.
- Use it when your machine learning algorithm assumes normally distributed data.
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Standardization using StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
# Convert the scaled data back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
# Display standardized data
print("Standardized Data:")
print(scaled_df)
Standardized Data: Feature1 Feature2 0 -1.414214 -1.414214 1 -0.707107 -0.707107 2 0.000000 0.000000 3 0.707107 0.707107 4 1.414214 1.414214
📏 Normalization¶
📖 Click to Expand
Normalization (Min-Max Scaling) scales data to a fixed range, typically [0, 1], ensuring all features contribute equally to the model.
$$ x_{\text{scaled}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} $$
- $x_{\text{min}}$: Minimum value of the feature,
- $x_{\text{max}}$: Maximum value of the feature.
Advantages:
- Preserves the relative relationships between values.
- Suitable for algorithms sensitive to feature magnitudes, like k-NN and neural networks.
When to Use:
- Use when features have varying scales and you want to scale them to a consistent range.
from sklearn.preprocessing import MinMaxScaler
# Create a MinMaxScaler instance
scaler = MinMaxScaler()
# Apply Min-Max Normalization
normalized_data = scaler.fit_transform(df)
# Convert the normalized data back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=df.columns)
# Display normalized data
print("Normalized Data (Min-Max Scaling):")
print(normalized_df)
# min-max scaler
# df[column_name] = (df[column_name] - np.min(df[column_name])) / (np.max(df[column_name]) - np.min(df[column_name]))
Normalized Data (Min-Max Scaling): Feature1 Feature2 0 0.00 0.00 1 0.25 0.25 2 0.50 0.50 3 0.75 0.75 4 1.00 1.00
🛡️ Robust Scaling¶
📖 Click to Expand
Robust Scaling scales data using the median and Interquartile Range (IQR), making it less sensitive to outliers.
$$ x_{\text{scaled}} = \frac{x - \text{median}}{\text{IQR}} $$
Where:
- Median: 50th percentile,
- IQR: 75th percentile (Q3) - 25th percentile (Q1).
Advantages:
- Outlier Robust: Handles extreme values effectively.
- Suitable for datasets with outliers.
When to Use:
- Ideal when your data contains outliers that distort scaling.
from sklearn.preprocessing import RobustScaler
# Create a RobustScaler instance
scaler = RobustScaler()
# Apply Robust Scaling
robust_scaled_data = scaler.fit_transform(df)
# Convert the scaled data back to a DataFrame
robust_scaled_df = pd.DataFrame(robust_scaled_data, columns=df.columns)
# Display robust scaled data
print("Robust Scaled Data:")
print(robust_scaled_df)
Robust Scaled Data: Feature1 Feature2 0 -1.0 -1.0 1 -0.5 -0.5 2 0.0 0.0 3 0.5 0.5 4 1.0 1.0
📶 MaxAbs Scaling¶
📖 Click to Expand
MaxAbs Scaling scales each feature by its maximum absolute value, transforming the data to a range of [-1, 1]
. It preserves the sparsity of the data, making it suitable for sparse datasets.
$$ x_{\text{scaled}} = \frac{x}{\text{max}(|x|)} $$
- $x$ is the original feature value,
- $\text{max}(|x|)$ is the maximum absolute value of the feature.
Advantages:
- Sparse Data Friendly: Maintains the sparsity structure of the dataset.
- Scales data without shifting the mean.
When to Use:
- Ideal for datasets with sparse features or when values are already centered at zero.
- Useful for models like SVMs or logistic regression with sparse inputs.
from sklearn.preprocessing import MaxAbsScaler
# Create a MaxAbsScaler instance
scaler = MaxAbsScaler()
# Apply MaxAbs Scaling
maxabs_scaled_data = scaler.fit_transform(df)
# Convert the scaled data back to a DataFrame
maxabs_scaled_df = pd.DataFrame(maxabs_scaled_data, columns=df.columns)
# Display MaxAbs scaled data
print("MaxAbs Scaled Data:")
print(maxabs_scaled_df)
MaxAbs Scaled Data: Feature1 Feature2 0 0.2 0.2 1 0.4 0.4 2 0.6 0.6 3 0.8 0.8 4 1.0 1.0
📐 L1 Normalization¶
📖 Click to Expand
L1 Normalization (Least Absolute Deviations
) scales each sample (row) individually such that the sum of the absolute values of all features in a sample equals 1. It emphasizes the proportions between feature values rather than their magnitudes.
$$ x_{\text{normalized}} = \frac{x}{\sum |x|} $$
- $x$ is the original feature value,
- $\sum |x|$ is the sum of the absolute values of all features in the sample.
Advantages:
- Highlights the relative importance of features within each sample.
- Ensures all rows (samples) have the same total magnitude.
When to Use:
- Useful when feature proportions matter more than absolute values, such as in text classification (e.g., term frequency) or histogram-based features.
from sklearn.preprocessing import Normalizer
# Create a Normalizer instance with L1 norm
normalizer = Normalizer(norm='l1')
# Apply L1 Normalization
l1_normalized_data = normalizer.fit_transform(df)
# Convert the normalized data back to a DataFrame
l1_normalized_df = pd.DataFrame(l1_normalized_data, columns=df.columns)
# Display L1 normalized data
print("L1 Normalized Data:")
print(l1_normalized_df)
# scaling - l1 normalization
# df[column_name] = df[column_name] / df[column_name].abs().sum()
L1 Normalized Data: Feature1 Feature2 0 0.990099 0.009901 1 0.990099 0.009901 2 0.990099 0.009901 3 0.990099 0.009901 4 0.990099 0.009901
📏 L2 Normalization¶
📖 Click to Expand
L2 Normalization (Euclidean Norm
) scales each sample (row) individually so that the sum of the squared values of all features in a sample equals 1. It ensures that the magnitude (Euclidean norm) of each sample is 1.
$$ x_{\text{normalized}} = \frac{x}{\sqrt{\sum x^2}} $$
- $x$ is the original feature value,
- $\sqrt{\sum x^2}$ is the Euclidean norm (L2 norm) of the sample.
Advantages:
- Preserves the direction of data while normalizing its magnitude.
- Useful for machine learning models that rely on feature magnitude and direction, such as SVMs and k-NN.
When to Use:
- Ideal for datasets where the magnitude of feature vectors should be normalized while retaining their directional information, such as in text data or image feature
from sklearn.preprocessing import Normalizer
# Create a Normalizer instance with L2 norm
normalizer = Normalizer(norm='l2')
# Apply L2 Normalization
l2_normalized_data = normalizer.fit_transform(df)
# Convert the normalized data back to a DataFrame
l2_normalized_df = pd.DataFrame(l2_normalized_data, columns=df.columns)
# Display L2 normalized data
print("L2 Normalized Data:")
print(l2_normalized_df)
# scaling - l2 normalization
# df[column_name] = df[column_name] / np.sqrt((df[column_name]**2).sum())
L2 Normalized Data: Feature1 Feature2 0 0.99995 0.01 1 0.99995 0.01 2 0.99995 0.01 3 0.99995 0.01 4 0.99995 0.01
🔢 Log Transformation¶
📖 Click to Expand
Log Transformation reduces the impact of large outliers and makes skewed data more normal-like. This implementation provides flexibility to:
- Apply log transformation to specific columns.
- Choose a custom logarithmic base (e.g., natural log, base 2, base 10).
- Add a custom value to the data before applying the log transformation.
$$ x_{\text{transformed}} = \frac{\log(x + \text{add})}{\log(\text{base})} $$
Columns: Specify columns for transformation (optional, applies to all columns by default).
Base: Set the logarithmic base (e.g., natural log, base 10, or base 2).
Add: Add a custom value to the data before applying the logarithm (default is 0).
Advantages:
- Handles zeros or negative values by adding a custom constant.
- Adaptable to different logarithmic bases for varied use cases.
When to Use:
- Effective for datasets with skewed distributions, exponential growth, or large outliers.
- Adjust the
add
parameter for flexibility with zero or negative data points.
import numpy as np
def apply_log_transformation(dataframe, columns=None, base=np.e, add=0):
"""
Apply log transformation to specific columns of a DataFrame.
Parameters:
dataframe (pd.DataFrame): Input DataFrame.
columns (list): List of columns to apply log transformation. If None, apply to all columns.
base (float): Base of the logarithm. Default is natural log (e).
add (float): A value to add to the data before applying log. Default is 0.
Returns:
pd.DataFrame: DataFrame with log-transformed values.
"""
# Select columns for transformation
if columns is None:
columns = dataframe.columns
transformed_data = dataframe.copy()
for col in columns:
transformed_data[col] = np.log(transformed_data[col] + add) / np.log(base) # Adjust for log base
return transformed_data
# Example usage
# Assuming `df` is the DataFrame to be transformed
log_transformed_df = apply_log_transformation(df, columns=['Feature1', 'Feature2'], base=10, add=1)
# Display the transformed DataFrame
print("Generalized Log Transformed Data:")
print(log_transformed_df)
# scaling - log transform to the base e
# df[column_name] = np.log(df[column_name]) / np.log(math.e)
Generalized Log Transformed Data: Feature1 Feature2 0 2.004321 0.301030 1 2.303196 0.477121 2 2.478566 0.602060 3 2.603144 0.698970 4 2.699838 0.778151
📖 Click to Expand
Power Transformation is used to stabilize variance, reduce skewness, and make data more Gaussian-like. It is particularly useful for skewed datasets.
- Box-Cox:
- Requires all input data to be positive.
- Suitable for strictly positive data.
- Yeo-Johnson:
- Handles both positive and negative values.
Parameters:
- method:
'box-cox'
or'yeo-johnson'
. - standardize: Whether to center and scale the transformed data (default is
True
).
Advantages:
- Reduces skewness and stabilizes variance.
- Useful for linear regression and other algorithms sensitive to data distribution.
When to Use:
- Use when features are highly skewed or non-Gaussian.
from sklearn.preprocessing import PowerTransformer
# Create a PowerTransformer instance (default is Yeo-Johnson)
power_transformer = PowerTransformer(method='yeo-johnson', standardize=True)
# Apply Power Transformation
power_transformed_data = power_transformer.fit_transform(df)
# Convert the transformed data back to a DataFrame
power_transformed_df = pd.DataFrame(power_transformed_data, columns=df.columns)
# Display Power Transformed Data
print("Power Transformed Data:")
print(power_transformed_df)
Power Transformed Data: Feature1 Feature2 0 -1.500778 -1.472976 1 -0.647010 -0.669761 2 0.078865 0.055343 3 0.732301 0.727399 4 1.336622 1.359996
Quantile Transformation¶
(Rank-Based Scaling)
📖 Click to Expand
Quantile Transformation maps data to a uniform or normal distribution by ranking and scaling the values. It ensures all features follow a specified distribution.
- output_distribution: Distribution to map to (
'uniform'
or'normal'
). - random_state: Ensures reproducibility.
Advantages:
- Handles skewed or non-linear data effectively.
- Ensures uniform or normal distribution for each feature.
When to Use:
- Use for algorithms sensitive to feature distributions or to mitigate skewness.
from sklearn.preprocessing import QuantileTransformer
# Create a QuantileTransformer instance
quantile_transformer = QuantileTransformer(output_distribution='uniform', random_state=42)
# Apply Quantile Transformation
quantile_transformed_data = quantile_transformer.fit_transform(df)
# Convert the transformed data back to a DataFrame
quantile_transformed_df = pd.DataFrame(quantile_transformed_data, columns=df.columns)
# Display Quantile Transformed Data
print("Quantile Transformed Data:")
print(quantile_transformed_df)
Quantile Transformed Data: Feature1 Feature2 0 0.00 0.00 1 0.25 0.25 2 0.50 0.50 3 0.75 0.75 4 1.00 1.00
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/preprocessing/_data.py:2829: UserWarning: n_quantiles (1000) is greater than the total number of samples (5). n_quantiles is set to n_samples. warnings.warn(
✂️ Clipping¶
📖 Click to Expand
Clipping is a preprocessing technique that restricts data values within a specified range by capping them at a minimum and/or maximum value. It removes extreme values (outliers) by replacing them with boundary values.
For a value $x$: $$ x_{\text{clipped}} = \begin{cases} \text{min\_val}, & \text{if } x < \text{min\_val}, \\ x, & \text{if } \text{min\_val} \leq x \leq \text{max\_val}, \\ \text{max\_val}, & \text{if } x > \text{max\_val}. \end{cases} $$
- min_val: The minimum threshold. Values below this are set to
min_val
. - max_val: The maximum threshold. Values above this are set to
max_val
.
Advantages:
- Handles extreme outliers by limiting their impact.
- Keeps data within a meaningful range.
When to Use:
- Use when outliers might skew results or are invalid.
- Common for datasets with natural limits (e.g., sensor readings).
Example: If a dataset has values outside the range [0, 100], clipping will set all values below 0 to 0 and above 100 to 100.
import numpy as np
# Define thresholds
min_val = 0
max_val = 100
# Apply clipping to the DataFrame
clipped_df = df.clip(lower=min_val, upper=max_val)
# Display clipped data
print("Clipped Data:")
print(clipped_df)
Clipped Data: Feature1 Feature2 0 100 1 1 100 2 2 100 3 3 100 4 4 100 5
📊 Scaling by Range Adjustment¶
📖 Click to Expand
Scaling by range adjustment rescales data to fit within a specified range, typically [0, 1]. Each feature is adjusted proportionally to its range.
For a value $x$: $$ x_{\text{scaled}} = \left( \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} \right) \times (\text{desired\_max} - \text{desired\_min}) + \text{desired\_min} $$
Where:
- $x_{\text{min}}$ and $x_{\text{max}}$ are the minimum and maximum values of the feature.
Advantages:¶
- Scales all values into a consistent range.
- Preserves the relative distribution of data.
When to Use:¶
- Use when features have varying scales, and a consistent range is needed (e.g., neural networks).
# Define the desired range
desired_min = 0
desired_max = 1
# Apply range adjustment
scaled_by_range_df = (df - df.min()) / (df.max() - df.min()) * (desired_max - desired_min) + desired_min
# Display the scaled DataFrame
print("Scaled by Range Adjustment Data:")
print(scaled_by_range_df)
Scaled by Range Adjustment Data: Feature1 Feature2 0 0.00 0.00 1 0.25 0.25 2 0.50 0.50 3 0.75 0.75 4 1.00 1.00