Status: Complete Python Coverage License

📖 Scaling Features¶

  • ⚖️ Standardization
  • 📏 Normalization
  • 🛡️ Robust Scaling
  • 📶 MaxAbs Scaling
  • 📐 L1 Normalization
  • 📏 L2 Normalization
  • 🔢 Log Transformation
  • 🧮 Power Transformation
  • ✂️ Clipping
  • 📊 Scaling by Range Adjustment
In [1]:
import pandas as pd

# Sample data
data = {
    'Feature1': [100, 200, 300, 400, 500],
    'Feature2': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# Display original data
print("Original Data:")
print(df)
Original Data:
   Feature1  Feature2
0       100         1
1       200         2
2       300         3
3       400         4
4       500         5

Back to the top

⚖️ Standardization¶

📖 Click to Expand

Standardization (Z-score Normalization) is a scaling technique where the values are centered around the mean with a unit standard deviation.

$$ z = \frac{x - \mu}{\sigma} $$

  • $x$ is the original feature value,
  • $\mu$ is the mean of the feature,
  • $\sigma$ is the standard deviation of the feature.

Why Use Standardization?

  • It ensures the data has a mean of 0 and a standard deviation of 1, which is crucial for algorithms sensitive to scale, such as Support Vector Machines (SVM) and Principal Component Analysis (PCA).
  • Standardization is particularly useful when features have different units or scales.

When to Use?

  • It works well when the dataset has outliers, as it does not squash the values between a fixed range like Min-Max Scaling.
  • Use it when your machine learning algorithm assumes normally distributed data.
In [2]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Standardization using StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Convert the scaled data back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

# Display standardized data
print("Standardized Data:")
print(scaled_df)
Standardized Data:
   Feature1  Feature2
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2  0.000000  0.000000
3  0.707107  0.707107
4  1.414214  1.414214

Back to the top

📏 Normalization¶

📖 Click to Expand

Normalization (Min-Max Scaling) scales data to a fixed range, typically [0, 1], ensuring all features contribute equally to the model.

$$ x_{\text{scaled}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} $$

  • $x_{\text{min}}$: Minimum value of the feature,
  • $x_{\text{max}}$: Maximum value of the feature.

Advantages:

  • Preserves the relative relationships between values.
  • Suitable for algorithms sensitive to feature magnitudes, like k-NN and neural networks.

When to Use:

  • Use when features have varying scales and you want to scale them to a consistent range.
In [3]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler instance
scaler = MinMaxScaler()

# Apply Min-Max Normalization
normalized_data = scaler.fit_transform(df)

# Convert the normalized data back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=df.columns)

# Display normalized data
print("Normalized Data (Min-Max Scaling):")
print(normalized_df)


# min-max scaler
# df[column_name] = (df[column_name] - np.min(df[column_name])) / (np.max(df[column_name]) - np.min(df[column_name]))
Normalized Data (Min-Max Scaling):
   Feature1  Feature2
0      0.00      0.00
1      0.25      0.25
2      0.50      0.50
3      0.75      0.75
4      1.00      1.00

Back to the top

🛡️ Robust Scaling¶

📖 Click to Expand

Robust Scaling scales data using the median and Interquartile Range (IQR), making it less sensitive to outliers.

$$ x_{\text{scaled}} = \frac{x - \text{median}}{\text{IQR}} $$

Where:

  • Median: 50th percentile,
  • IQR: 75th percentile (Q3) - 25th percentile (Q1).

Advantages:

  • Outlier Robust: Handles extreme values effectively.
  • Suitable for datasets with outliers.

When to Use:

  • Ideal when your data contains outliers that distort scaling.
In [4]:
from sklearn.preprocessing import RobustScaler

# Create a RobustScaler instance
scaler = RobustScaler()

# Apply Robust Scaling
robust_scaled_data = scaler.fit_transform(df)

# Convert the scaled data back to a DataFrame
robust_scaled_df = pd.DataFrame(robust_scaled_data, columns=df.columns)

# Display robust scaled data
print("Robust Scaled Data:")
print(robust_scaled_df)
Robust Scaled Data:
   Feature1  Feature2
0      -1.0      -1.0
1      -0.5      -0.5
2       0.0       0.0
3       0.5       0.5
4       1.0       1.0

Back to the top

📶 MaxAbs Scaling¶

📖 Click to Expand

MaxAbs Scaling scales each feature by its maximum absolute value, transforming the data to a range of [-1, 1]. It preserves the sparsity of the data, making it suitable for sparse datasets.

$$ x_{\text{scaled}} = \frac{x}{\text{max}(|x|)} $$

  • $x$ is the original feature value,
  • $\text{max}(|x|)$ is the maximum absolute value of the feature.

Advantages:

  • Sparse Data Friendly: Maintains the sparsity structure of the dataset.
  • Scales data without shifting the mean.

When to Use:

  • Ideal for datasets with sparse features or when values are already centered at zero.
  • Useful for models like SVMs or logistic regression with sparse inputs.
In [5]:
from sklearn.preprocessing import MaxAbsScaler

# Create a MaxAbsScaler instance
scaler = MaxAbsScaler()

# Apply MaxAbs Scaling
maxabs_scaled_data = scaler.fit_transform(df)

# Convert the scaled data back to a DataFrame
maxabs_scaled_df = pd.DataFrame(maxabs_scaled_data, columns=df.columns)

# Display MaxAbs scaled data
print("MaxAbs Scaled Data:")
print(maxabs_scaled_df)
MaxAbs Scaled Data:
   Feature1  Feature2
0       0.2       0.2
1       0.4       0.4
2       0.6       0.6
3       0.8       0.8
4       1.0       1.0

Back to the top

📐 L1 Normalization¶

📖 Click to Expand

L1 Normalization (Least Absolute Deviations) scales each sample (row) individually such that the sum of the absolute values of all features in a sample equals 1. It emphasizes the proportions between feature values rather than their magnitudes.

$$ x_{\text{normalized}} = \frac{x}{\sum |x|} $$

  • $x$ is the original feature value,
  • $\sum |x|$ is the sum of the absolute values of all features in the sample.

Advantages:

  • Highlights the relative importance of features within each sample.
  • Ensures all rows (samples) have the same total magnitude.

When to Use:

  • Useful when feature proportions matter more than absolute values, such as in text classification (e.g., term frequency) or histogram-based features.
In [6]:
from sklearn.preprocessing import Normalizer

# Create a Normalizer instance with L1 norm
normalizer = Normalizer(norm='l1')

# Apply L1 Normalization
l1_normalized_data = normalizer.fit_transform(df)

# Convert the normalized data back to a DataFrame
l1_normalized_df = pd.DataFrame(l1_normalized_data, columns=df.columns)

# Display L1 normalized data
print("L1 Normalized Data:")
print(l1_normalized_df)

# scaling - l1 normalization
# df[column_name] = df[column_name] / df[column_name].abs().sum()
L1 Normalized Data:
   Feature1  Feature2
0  0.990099  0.009901
1  0.990099  0.009901
2  0.990099  0.009901
3  0.990099  0.009901
4  0.990099  0.009901

Back to the top

📏 L2 Normalization¶

📖 Click to Expand

L2 Normalization (Euclidean Norm) scales each sample (row) individually so that the sum of the squared values of all features in a sample equals 1. It ensures that the magnitude (Euclidean norm) of each sample is 1.

$$ x_{\text{normalized}} = \frac{x}{\sqrt{\sum x^2}} $$

  • $x$ is the original feature value,
  • $\sqrt{\sum x^2}$ is the Euclidean norm (L2 norm) of the sample.

Advantages:

  • Preserves the direction of data while normalizing its magnitude.
  • Useful for machine learning models that rely on feature magnitude and direction, such as SVMs and k-NN.

When to Use:

  • Ideal for datasets where the magnitude of feature vectors should be normalized while retaining their directional information, such as in text data or image feature
In [7]:
from sklearn.preprocessing import Normalizer

# Create a Normalizer instance with L2 norm
normalizer = Normalizer(norm='l2')

# Apply L2 Normalization
l2_normalized_data = normalizer.fit_transform(df)

# Convert the normalized data back to a DataFrame
l2_normalized_df = pd.DataFrame(l2_normalized_data, columns=df.columns)

# Display L2 normalized data
print("L2 Normalized Data:")
print(l2_normalized_df)


# scaling - l2 normalization
# df[column_name] = df[column_name] / np.sqrt((df[column_name]**2).sum())
L2 Normalized Data:
   Feature1  Feature2
0   0.99995      0.01
1   0.99995      0.01
2   0.99995      0.01
3   0.99995      0.01
4   0.99995      0.01

Back to the top

🔢 Log Transformation¶

📖 Click to Expand

Log Transformation reduces the impact of large outliers and makes skewed data more normal-like. This implementation provides flexibility to:

  • Apply log transformation to specific columns.
  • Choose a custom logarithmic base (e.g., natural log, base 2, base 10).
  • Add a custom value to the data before applying the log transformation.

$$ x_{\text{transformed}} = \frac{\log(x + \text{add})}{\log(\text{base})} $$

Columns: Specify columns for transformation (optional, applies to all columns by default).
Base: Set the logarithmic base (e.g., natural log, base 10, or base 2).
Add: Add a custom value to the data before applying the logarithm (default is 0).

Advantages:

  • Handles zeros or negative values by adding a custom constant.
  • Adaptable to different logarithmic bases for varied use cases.

When to Use:

  • Effective for datasets with skewed distributions, exponential growth, or large outliers.
  • Adjust the add parameter for flexibility with zero or negative data points.
In [8]:
import numpy as np

def apply_log_transformation(dataframe, columns=None, base=np.e, add=0):
    """
    Apply log transformation to specific columns of a DataFrame.
    
    Parameters:
        dataframe (pd.DataFrame): Input DataFrame.
        columns (list): List of columns to apply log transformation. If None, apply to all columns.
        base (float): Base of the logarithm. Default is natural log (e).
        add (float): A value to add to the data before applying log. Default is 0.
    
    Returns:
        pd.DataFrame: DataFrame with log-transformed values.
    """
    # Select columns for transformation
    if columns is None:
        columns = dataframe.columns

    transformed_data = dataframe.copy()
    for col in columns:
        transformed_data[col] = np.log(transformed_data[col] + add) / np.log(base)  # Adjust for log base

    return transformed_data

# Example usage
# Assuming `df` is the DataFrame to be transformed
log_transformed_df = apply_log_transformation(df, columns=['Feature1', 'Feature2'], base=10, add=1)

# Display the transformed DataFrame
print("Generalized Log Transformed Data:")
print(log_transformed_df)

# scaling - log transform to the base e
# df[column_name] = np.log(df[column_name]) / np.log(math.e)
Generalized Log Transformed Data:
   Feature1  Feature2
0  2.004321  0.301030
1  2.303196  0.477121
2  2.478566  0.602060
3  2.603144  0.698970
4  2.699838  0.778151

Back to the top

🧮 Power Transformation¶

AKA Box-Cox and Yeo-Johnson

📖 Click to Expand

Power Transformation is used to stabilize variance, reduce skewness, and make data more Gaussian-like. It is particularly useful for skewed datasets.

  1. Box-Cox:
    • Requires all input data to be positive.
    • Suitable for strictly positive data.
  2. Yeo-Johnson:
    • Handles both positive and negative values.

Parameters:

  • method: 'box-cox' or 'yeo-johnson'.
  • standardize: Whether to center and scale the transformed data (default is True).

Advantages:

  • Reduces skewness and stabilizes variance.
  • Useful for linear regression and other algorithms sensitive to data distribution.

When to Use:

  • Use when features are highly skewed or non-Gaussian.
In [9]:
from sklearn.preprocessing import PowerTransformer

# Create a PowerTransformer instance (default is Yeo-Johnson)
power_transformer = PowerTransformer(method='yeo-johnson', standardize=True)

# Apply Power Transformation
power_transformed_data = power_transformer.fit_transform(df)

# Convert the transformed data back to a DataFrame
power_transformed_df = pd.DataFrame(power_transformed_data, columns=df.columns)

# Display Power Transformed Data
print("Power Transformed Data:")
print(power_transformed_df)
Power Transformed Data:
   Feature1  Feature2
0 -1.500778 -1.472976
1 -0.647010 -0.669761
2  0.078865  0.055343
3  0.732301  0.727399
4  1.336622  1.359996

Back to the top

Quantile Transformation¶

(Rank-Based Scaling)

📖 Click to Expand

Quantile Transformation maps data to a uniform or normal distribution by ranking and scaling the values. It ensures all features follow a specified distribution.

  • output_distribution: Distribution to map to ('uniform' or 'normal').
  • random_state: Ensures reproducibility.

Advantages:

  • Handles skewed or non-linear data effectively.
  • Ensures uniform or normal distribution for each feature.

When to Use:

  • Use for algorithms sensitive to feature distributions or to mitigate skewness.
In [10]:
from sklearn.preprocessing import QuantileTransformer

# Create a QuantileTransformer instance
quantile_transformer = QuantileTransformer(output_distribution='uniform', random_state=42)

# Apply Quantile Transformation
quantile_transformed_data = quantile_transformer.fit_transform(df)

# Convert the transformed data back to a DataFrame
quantile_transformed_df = pd.DataFrame(quantile_transformed_data, columns=df.columns)

# Display Quantile Transformed Data
print("Quantile Transformed Data:")
print(quantile_transformed_df)
Quantile Transformed Data:
   Feature1  Feature2
0      0.00      0.00
1      0.25      0.25
2      0.50      0.50
3      0.75      0.75
4      1.00      1.00
/Users/ashrithreddy/anaconda3/lib/python3.11/site-packages/sklearn/preprocessing/_data.py:2829: UserWarning: n_quantiles (1000) is greater than the total number of samples (5). n_quantiles is set to n_samples.
  warnings.warn(

Back to the top

✂️ Clipping¶

📖 Click to Expand

Clipping is a preprocessing technique that restricts data values within a specified range by capping them at a minimum and/or maximum value. It removes extreme values (outliers) by replacing them with boundary values.

For a value $x$: $$ x_{\text{clipped}} = \begin{cases} \text{min\_val}, & \text{if } x < \text{min\_val}, \\ x, & \text{if } \text{min\_val} \leq x \leq \text{max\_val}, \\ \text{max\_val}, & \text{if } x > \text{max\_val}. \end{cases} $$

  • min_val: The minimum threshold. Values below this are set to min_val.
  • max_val: The maximum threshold. Values above this are set to max_val.

Advantages:

  • Handles extreme outliers by limiting their impact.
  • Keeps data within a meaningful range.

When to Use:

  • Use when outliers might skew results or are invalid.
  • Common for datasets with natural limits (e.g., sensor readings).

Example: If a dataset has values outside the range [0, 100], clipping will set all values below 0 to 0 and above 100 to 100.

In [11]:
import numpy as np

# Define thresholds
min_val = 0
max_val = 100

# Apply clipping to the DataFrame
clipped_df = df.clip(lower=min_val, upper=max_val)

# Display clipped data
print("Clipped Data:")
print(clipped_df)
Clipped Data:
   Feature1  Feature2
0       100         1
1       100         2
2       100         3
3       100         4
4       100         5

Back to the top

📊 Scaling by Range Adjustment¶

📖 Click to Expand

Scaling by range adjustment rescales data to fit within a specified range, typically [0, 1]. Each feature is adjusted proportionally to its range.

For a value $x$: $$ x_{\text{scaled}} = \left( \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} \right) \times (\text{desired\_max} - \text{desired\_min}) + \text{desired\_min} $$

Where:

  • $x_{\text{min}}$ and $x_{\text{max}}$ are the minimum and maximum values of the feature.

Advantages:¶

  • Scales all values into a consistent range.
  • Preserves the relative distribution of data.

When to Use:¶

  • Use when features have varying scales, and a consistent range is needed (e.g., neural networks).
In [12]:
# Define the desired range
desired_min = 0
desired_max = 1

# Apply range adjustment
scaled_by_range_df = (df - df.min()) / (df.max() - df.min()) * (desired_max - desired_min) + desired_min

# Display the scaled DataFrame
print("Scaled by Range Adjustment Data:")
print(scaled_by_range_df)
Scaled by Range Adjustment Data:
   Feature1  Feature2
0      0.00      0.00
1      0.25      0.25
2      0.50      0.50
3      0.75      0.75
4      1.00      1.00

Back to the top