Status: Complete Python Coverage License

๐Ÿ“– AB Testingยถ

๐Ÿ—‚๏ธ Data Setup

  • ๐Ÿงพ Sample Data
  • ๐Ÿ› ๏ธ Experiment Setup
  • โš™๏ธ Other Experiment Parameters
  • ๐Ÿ”ง Central Control Panel

๐Ÿ”€ Randomization Methods

  • ๐Ÿ”„ Simple Randomization
  • ๐Ÿงฌ Stratified Sampling
  • ๐Ÿ” Block Randomization
  • ๐Ÿงฏ Match Pair Randomization
  • ๐Ÿ—ƒ๏ธ Cluster Randomization
  • ๐Ÿ“‰ CUPED
  • ๐Ÿ•ธ๏ธ Network Effects

๐Ÿ“ˆ EDA

  • ๐Ÿ” Normality
  • ๐Ÿงช Variance Homogeneity Check
  • ๐Ÿงฌ Test Family

๐Ÿงช AA Testing

  • ๐Ÿงฌ Outcome Similarity Test
  • โš–๏ธ Sample Ratio Mismatch
  • ๐Ÿ“Š AA Test Visualization
  • ๐ŸŽฒ Type I Error Simulation

โšก Power Analysis

  • โš™๏ธ Setup Inputs + Config
  • ๐Ÿ“Š Baseline Estimation from Data
  • ๐Ÿ“ˆ Minimum Detectable Effect
  • ๐Ÿ“ Required Sample Size
  • ๐Ÿ“Š Power Analysis Summary

๐Ÿงช AB Testing

๐Ÿ“‰ Results

  • ๐Ÿงพ Summaries
  • ๐Ÿ“Š Visualization
  • ๐ŸŽฏ 95% Confidence Intervals
  • ๐Ÿ“ˆ Lift Analysis
  • โœ… Final Conclusion

โฑ๏ธ How Long?

  • ๐Ÿงญ Monitoring Dashboard Components

๐Ÿ” Post Hoc Analysis

  • ๐Ÿงฉ Segmented Lift
  • ๐Ÿšฆ Guardrail Metrics
  • ๐Ÿง  Correcting for Multiple Comparisons
  • ๐Ÿช„ Novelty Effects
  • ๐ŸŽฏ Primacy Effect
  • ๐Ÿ“ฆ Rollout Simulation
  • ๐Ÿงช A/B Test Holdouts
  • ๐Ÿšซ AB Limits

๐Ÿ—‚๏ธ Data Setupยถ

๐Ÿงพ Sample dataยถ

Inย [1]:
# Display Settings
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import display, HTML

# Set Seed 
my_seed=1995

# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
from scipy import stats
from scipy.stats import (
    ttest_ind,
    ttest_rel,
    chi2_contingency,
    mannwhitneyu,
    levene,
    shapiro
)
import statsmodels.api as sm
from statsmodels.stats.power import (
    TTestIndPower,
    TTestPower,
    FTestAnovaPower,
    NormalIndPower
)
from sklearn.model_selection import train_test_split
Inย [2]:
observations_count = 1000

np.random.seed(my_seed) # For reproducibility
users = pd.DataFrame({
    # identifiers
    'user_id': range(1, observations_count+1),

    # segmentation features
    'platform': np.random.choice(['iOS', 'Android'], size=observations_count, p=[0.6, 0.4]), # 60% iOS, 40% Android
    'device_type': np.random.choice(['mobile', 'desktop'], size=observations_count, p=[0.7, 0.3]),
    'user_tier': np.random.choice(['new', 'returning'], size=observations_count, p=[0.4, 0.6]),
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=observations_count, p=[0.25, 0.25, 0.25, 0.25]),
    'plan_type': np.random.choice(['basic', 'premium', 'pro'], size=observations_count, p=[0.6, 0.3, 0.1]), # 60% basic, 30% premium, 10% pro
    'city': np.random.choice(['ny', 'sf', 'chicago', 'austin'], size=observations_count),

    # outcome metrics
    'engagement_score': np.random.normal(50, 15, observations_count), # Simulated user engagement scores
    'converted': np.random.binomial(n=1, p=0.1, size=observations_count), # Simulated binary conversion: ~10% baseline
    'past_purchase_count': np.random.normal(loc=50, scale=10, size=observations_count), # pre_experiment_metric for CUPED randomization
    'bounce_rate': np.nan # will be simulated later
})

# Simulate  a guardrail metric (bounce_rate)
np.random.seed(my_seed)
users['bounce_rate'] = np.where(
    users['converted'] == 1,
    np.random.normal(loc=0.2, scale=0.05, size=observations_count),
    np.random.normal(loc=0.6, scale=0.10, size=observations_count)
)
users['bounce_rate'] = users['bounce_rate'].clip(0, 1) # Bound bounce_rate between 0 and 1

users
Out[2]:
user_id platform device_type user_tier region plan_type city engagement_score converted past_purchase_count bounce_rate
0 1 iOS mobile new North premium austin 53.437537 0 50.653869 0.660153
1 2 iOS mobile returning North basic ny 48.924673 1 26.451597 0.126471
2 3 iOS desktop returning South premium austin 80.179294 0 43.112520 0.552955
3 4 iOS mobile new North premium austin 40.441478 0 48.339368 0.665883
4 5 Android mobile returning West basic chicago 54.171571 0 56.578205 0.503212
... ... ... ... ... ... ... ... ... ... ... ...
995 996 iOS desktop returning North basic sf 35.814776 0 55.156336 0.722611
996 997 iOS mobile new South pro ny 35.693639 0 48.434500 0.522644
997 998 Android desktop returning South premium chicago 33.913119 0 30.591967 0.644316
998 999 iOS mobile new West premium sf 40.789684 0 44.760253 0.540618
999 1000 iOS mobile returning South basic austin 71.341543 0 30.985932 0.542153

1000 rows ร— 11 columns

๐Ÿ› ๏ธ Experiment Setupยถ

Inย [3]:
# 1. Main outcome variable you're testing
outcome_metric_col = 'engagement_score'

# 2. Metric type: 'binary', 'continuous', or 'categorical'
outcome_metric_datatype = 'continuous'

# 3. Group assignment (to be generated)
group_labels = ('control', 'treatment')

# 4. Experimental design variant: independent or paired
variant = 'independent'  # Options: 'independent' (supported), 'paired' (not supported yet)

# 5. Optional: Unique identifier for each observation (can be user_id, session_id, etc.)
observation_id_col = 'user_id'

# 6. Optional: Pre-experiment metric for CUPED, if used
pre_experiment_metric = 'past_purchase_count'  # Can be None

โš™๏ธ Other Experiment Parametersยถ

Inย [4]:
# Number of groups in the experiment (e.g., 2 for A/B test, 3 for A/B/C test)
group_count = len(group_labels)

# Column name used to store assigned group after randomization
group_col = 'group'

# Randomization method to assign users to groups
# Options: 'simple', 'stratified', 'block', 'matched_pair', 'cluster', 'cuped'
randomization_method = "simple"

๐Ÿ”ง Central Control Panelยถ

Inย [5]:
test_config = {
    # Core experiment setup
    'outcome_metric_col'     : outcome_metric_col,         # Main metric to analyze (e.g., 'engagement_score')
    'outcome_metric_datatype': outcome_metric_datatype,    # One of: 'binary', 'continuous', 'categorical'
    'group_labels'           : group_labels,               # Tuple of (control, treatment) group names
    'group_count'            : group_count,                # Number of groups (usually 2 for A/B tests)
    'variant'                : variant,                    # 'independent' or 'paired'
    'observation_id_col'     : observation_id_col,         # Unique identifier for each observation
    'pre_experiment_metric'  : pre_experiment_metric,      # Used for CUPED adjustment (if any)

    # Diagnostic results โ€” filled after EDA/assumptions check
    'normality'              : None,  # Will be set based on Shapiro-Wilk or visual tests
    'equal_variance'         : None,  # Will be set using Leveneโ€™s/Bartlettโ€™s test
    'family'                 : None   # Test family โ†’ 'z_test', 't_test', 'anova', 'chi_square', etc.
}

from IPython.display import HTML
display(HTML(f"<pre style='color:teal; font-size:14px;'>{json.dumps(test_config, indent=4)}</pre>"))
{
    "outcome_metric_col": "engagement_score",
    "outcome_metric_datatype": "continuous",
    "group_labels": [
        "control",
        "treatment"
    ],
    "group_count": 2,
    "variant": "independent",
    "observation_id_col": "user_id",
    "pre_experiment_metric": "past_purchase_count",
    "normality": null,
    "equal_variance": null,
    "family": null
}

Back to the top ___

๐Ÿ”€ Randomization Methods

๐Ÿ“– Click to Expand

Randomization is used to ensure that observed differences in outcome metrics are due to the experiment, not pre-existing differences.

  • Prevents selection bias (e.g., users self-selecting into groups)
  • Balances confounding factors like platform, region, or past behavior
  • Enables valid inference through statistical testing

๐Ÿ”„ Simple Randomizationยถ

๐Ÿ“– Click to Expand

Each user is assigned to control or treatment with equal probability, independent of any characteristics.

โœ… When to Use:

  • Sample size is large enough to ensure natural balance
  • No strong concern about confounding variables
  • Need a quick, default assignment strategy

๐Ÿ› ๏ธ How It Works:

  • Assign each user randomly (e.g., 50/50 split)
  • No grouping, segmentation, or blocking involved
  • Groups are expected to balance out on average
Inย [6]:
def apply_simple_randomization(df, group_labels=group_labels, group_col=group_col, seed=my_seed):
    """
    Randomly assigns each row to one of the specified groups.

    Parameters:
    - df: pandas DataFrame containing observations
    - group_labels: tuple of group names (default = ('control', 'treatment'))
    - group_col: name of the column to store group assignments
    - seed: random seed for reproducibility

    Returns:
    - DataFrame with an added group assignment column
    """
    np.random.seed(seed)
    df[group_col] = np.random.choice(group_labels, size=len(df), replace=True)
    return df

๐Ÿ”„ Stratified Sampling

๐Ÿ“– Click to Expand

Ensures that key segments (e.g., platform, region) are evenly represented across control and treatment.

When to Use
  • User base is naturally skewed (e.g., 70% mobile, 30% desktop)
  • Important to control for known confounders like geography or device
  • You want balance within subgroups, not just overall
How It Works
  • Pick a stratification variable (e.g., platform)
  • Split population into strata (groups)
  • Randomly assign users within each stratum
Inย [7]:
def apply_stratified_randomization(df, stratify_col, group_labels=group_labels, group_col=group_col, seed=my_seed):
    """
    Performs stratified randomization to assign rows into multiple groups while maintaining balance across strata.

    Parameters:
    - df: pandas DataFrame to assign groups to
    - stratify_col: column to balance across (e.g., platform, region)
    - group_labels: list or tuple of group names
    - group_col: name of the column to store group assignments
    - seed: random seed for reproducibility

    Returns:
    - DataFrame with a new group assignment column
    """
    np.random.seed(seed)
    df[group_col] = None
    n_groups = len(group_labels)

    # Stratify and assign
    for stratum_value, stratum_df in df.groupby(stratify_col):
        shuffled = stratum_df.sample(frac=1, random_state=seed)
        group_assignments = np.tile(group_labels, int(np.ceil(len(shuffled) / n_groups)))[:len(shuffled)]
        df.loc[shuffled.index, group_col] = group_assignments

    return df

๐Ÿ”„ Block Randomization

๐Ÿ“– Click to Expand

Groups users into fixed-size blocks and randomly assigns groups within each block.

When to Use
  • Users arrive in time-based batches (e.g., daily cohorts)
  • Sample size is small and needs enforced balance
  • You want to minimize temporal or ordering effects
How It Works
  • Create blocks based on order or ID (e.g., every 10 users)
  • Randomize assignments within each block
  • Ensures near-equal split in every batch
Inย [8]:
def apply_block_randomization(df, observation_id_col, group_col=group_col, block_size=10, group_labels=group_labels, seed=my_seed):
    """
    Assigns group labels using block randomization to ensure balance within fixed-size blocks.

    Parameters:
    - df: DataFrame to assign groups
    - observation_id_col: Unique ID to sort and block on (e.g., user_id)
    - group_col: Name of column to store assigned group labels
    - block_size: Number of observations in each block
    - group_labels: Tuple or list of group names (e.g., ('control', 'treatment', 'variant_B'))
    - seed: Random seed for reproducibility

    Returns:
    - DataFrame with a new column [group_col] indicating assigned group
    """
    np.random.seed(seed)
    df = df.sort_values(observation_id_col).reset_index(drop=True).copy()
    n_groups = len(group_labels)

    # Create block ID per row
    df['_block'] = df.index // block_size

    # Assign groups within each block
    group_assignments = []
    for _, block_df in df.groupby('_block'):
        block_n = len(block_df)
        reps = int(np.ceil(block_n / n_groups))
        candidates = np.tile(group_labels, reps)[:block_n]
        np.random.shuffle(candidates)
        group_assignments.extend(candidates)

    df[group_col] = group_assignments
    df = df.drop(columns=['_block'])

    return df

๐Ÿ”„ Match Pair Randomization

๐Ÿ“– Click to Expand

Participants are paired based on similar characteristics before random group assignment. This reduces variance and improves statistical power by ensuring balance on key covariates.

When to Use
  • Small sample size with high risk of confounding
  • Outcomes influenced by user traits (e.g., age, income, tenure)
  • Need to minimize variance across groups
How It Works
  1. Identify important covariates (e.g., age, purchase history)
  2. Sort users by those variables
  3. Create matched pairs (or small groups)
  4. Randomly assign one to control, the other to treatment
Inย [9]:
def apply_matched_pair_randomization(df, sort_col, group_col=group_col, group_labels=group_labels):
    """
    Assigns groups using matched-pair randomization based on a sorting variable.

    Parameters:
    - df: pandas DataFrame to assign groups to
    - sort_col: column used to sort users before pairing (e.g., engagement score)
    - group_col: name of the column to store group assignments
    - group_labels: tuple of group names (e.g., ('control', 'treatment'))

    Returns:
    - DataFrame with alternating group assignments within sorted pairs
    """
    # Sort by matching variable so similar users are adjacent
    df = df.sort_values(by=sort_col).reset_index(drop=True)

    # Cycle through group labels for each row
    df[group_col] = [group_labels[i % len(group_labels)] for i in range(len(df))]

    return df

๐Ÿ”„ Cluster Randomization

๐Ÿ“– Click to Expand

Entire groups or clusters (e.g., cities, stores, schools) are assigned to control or treatment. Used when it's impractical or risky to randomize individuals within a cluster.

When to Use
  • Users naturally exist in groups (e.g., teams, locations, devices)
  • There's a risk of interference between users (e.g., word-of-mouth)
  • Operational or tech constraints prevent individual-level randomization
How It Works
  1. Define the cluster unit (e.g., store, city)
  2. Randomly assign each cluster to control or treatment
  3. All users within the cluster inherit the group assignment
Inย [10]:
def apply_cluster_randomization(df, cluster_col, group_col=group_col, group_labels=group_labels, seed=my_seed):
    """
    Assigns groups using cluster-level randomization โ€” all observations in a cluster
    receive the same group assignment.

    Parameters:
    - df: pandas DataFrame to assign groups to
    - cluster_col: column representing the cluster unit (e.g., city, store)
    - group_col: name of the column where group labels will be stored
    - group_labels: tuple of group names to randomly assign (e.g., ('control', 'treatment'))
    - seed: random seed for reproducibility

    Returns:
    - DataFrame with assigned groups at the cluster level
    """
    np.random.seed(seed)

    # Unique clusters (e.g., unique city/store values)
    unique_clusters = df[cluster_col].unique()

    # Randomly assign each cluster to a group
    cluster_assignments = dict(
        zip(unique_clusters, np.random.choice(group_labels, size=len(unique_clusters)))
    )

    # Map group assignments to full DataFrame
    df[group_col] = df[cluster_col].map(cluster_assignments)

    return df

๐Ÿ”„ CUPED

๐Ÿ“– Click to Expand

Controlled Pre-Experiment Data: A statistical adjustment that uses pre-experiment behavior to reduce variance and improve power. It helps detect smaller effects without increasing sample size.

When to Use
  • You have reliable pre-experiment metrics (e.g., past spend, engagement)
  • You want to reduce variance and improve test sensitivity
  • Youโ€™re dealing with small lifts or costly sample sizes
How It Works
  1. Identify a pre-period metric correlated with your outcome
  2. Use regression to compute an adjustment (theta)
  3. Subtract the correlated component from your outcome metric
  4. Analyze the adjusted metric instead of the raw one
Inย [11]:
def apply_cuped(
    df,
    pre_metric,
    outcome_metric_col,  # observed outcome column (e.g., engagement_score)
    outcome_col=None,
    group_col=group_col,
    group_labels=group_labels,
    seed=my_seed
):
    """
    Applies CUPED (Controlled Pre-Experiment Data) adjustment to reduce variance
    in the outcome metric using a pre-experiment covariate.

    CUPED is a post-randomization technique that reduces variance by adjusting the 
    observed outcome using a baseline (pre-metric) variable that is correlated 
    with the outcome.

    Parameters:
    ----------
    df : pandas.DataFrame
        Input DataFrame containing experiment data.
    pre_metric : str
        Column name of the pre-experiment covariate (e.g., 'past_purchase_count').
        This is the variable used to compute the adjustment factor (theta).
    outcome_metric_col : str
        Column name of the original observed outcome (e.g., 'engagement_score') 
        that you are comparing across groups.
    outcome_col : str, default=None
        Name of the new column where the adjusted outcome will be stored.
    group_col : str
        Column indicating the experiment group assignment (e.g., 'control' vs 'treatment').
    group_labels : tuple
        Tuple containing the names of the experiment groups.
    seed : int
        Random seed for reproducibility (used only if randomness is introduced later).

    Returns:
    -------
    df : pandas.DataFrame
        DataFrame with an additional column [outcome_col] containing the CUPED-adjusted outcome.
    """
    np.random.seed(seed)

    # Step 1: Use actual observed experiment outcome
    y = df[outcome_metric_col].values

    # Step 2: Regress outcome on pre-metric to estimate correction factor (theta)
    X = sm.add_constant(df[[pre_metric]])
    theta = sm.OLS(y, X).fit().params[pre_metric]

    # Step 3: Apply CUPED adjustment and save in new column
    if outcome_col is None:
        outcome_col = f'{outcome_metric_col}_cuped_adjusted'
    df[outcome_col] = y - theta * df[pre_metric]

    return df
Inย [12]:
# Apply randomization method
if randomization_method == "simple":
    users = apply_simple_randomization(users, group_col=group_col, seed=my_seed)

elif randomization_method == "stratified":
    users = apply_stratified_randomization(users, stratify_col='platform', group_col=group_col, seed=my_seed)

elif randomization_method == "block":
    users = apply_block_randomization(users, observation_id_col='user_id', group_col=group_col, block_size=10, seed=my_seed)

elif randomization_method == "matched_pair":
    users = apply_matched_pair_randomization(users, sort_col=outcome_metric_col, group_col=group_col, seed=my_seed)

elif randomization_method == "cluster":
    users = apply_cluster_randomization(users, cluster_col='city', group_col=group_col, seed=my_seed)

elif randomization_method == "cuped":
    users = apply_cuped(users, pre_metric='past_purchase_count', outcome_metric_col=outcome_metric_col, group_col=group_col, group_labels=group_labels, seed=my_seed)
    # Update global outcome to CUPED-adjusted version
    outcome_metric_col = f"{outcome_metric_col}_cuped_adjusted"
else:
    raise ValueError(f"โŒ Unsupported randomization method: {randomization_method}")

users
Out[12]:
user_id platform device_type user_tier region plan_type city engagement_score converted past_purchase_count bounce_rate group
0 1 iOS mobile new North premium austin 53.437537 0 50.653869 0.660153 control
1 2 iOS mobile returning North basic ny 48.924673 1 26.451597 0.126471 control
2 3 iOS desktop returning South premium austin 80.179294 0 43.112520 0.552955 control
3 4 iOS mobile new North premium austin 40.441478 0 48.339368 0.665883 treatment
4 5 Android mobile returning West basic chicago 54.171571 0 56.578205 0.503212 treatment
... ... ... ... ... ... ... ... ... ... ... ... ...
995 996 iOS desktop returning North basic sf 35.814776 0 55.156336 0.722611 treatment
996 997 iOS mobile new South pro ny 35.693639 0 48.434500 0.522644 control
997 998 Android desktop returning South premium chicago 33.913119 0 30.591967 0.644316 control
998 999 iOS mobile new West premium sf 40.789684 0 44.760253 0.540618 treatment
999 1000 iOS mobile returning South basic austin 71.341543 0 30.985932 0.542153 control

1000 rows ร— 12 columns

๐Ÿ•ธ๏ธ Network Effects & SUTVA Violations

๐Ÿ“– When Randomization Assumptions Break (Click to Expand)

Most A/B tests assume the Stable Unit Treatment Value Assumption (SUTVA) โ€” meaning:

  • A user's outcome depends only on their own treatment assignment.
  • One unit's treatment does not influence another unitโ€™s outcome.
๐Ÿงช Why It Mattersยถ

If users in different groups interact:

  • Control group behavior may be influenced by treatment group exposure.
  • This biases your estimates and dilutes treatment effect.
  • Standard tests may incorrectly accept the null hypothesis due to spillover.

This assumption breaks down in experiments involving social behavior, multi-user platforms, or ecosystem effects.

โš ๏ธ Common Violation Scenariosยถ
  • ๐Ÿ›๏ธ Marketplace platforms (e.g., sellers and buyers interact)
  • ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ Social features (e.g., follows, likes, comments, feeds)
  • ๐Ÿ“ฒ Referrals / network effects (e.g., invites, rewards)
  • ๐Ÿ’ฌ Chat and collaboration tools (e.g., Slack, Teams)
๐Ÿงฉ Solutions (If You Suspect Interference)ยถ
Strategy Description
Cluster Randomization Randomize at group level (e.g., friend group, region, org ID)
Isolation Experiments Only roll out to fully disconnected segments (e.g., one region only)
Network-Based Metrics Include network centrality / exposure as covariates
Post-Experiment Checks Monitor if control group was exposed indirectly (e.g., referrals, shared UIs)
Simulation-Based Designs Use agent-based or graph simulations to estimate contamination risk

Back to the top ___

๐Ÿ“ˆ EDAยถ

Exploratory Data Analysis validates core statistical assumptions before testing begins.

๐Ÿ” Normality

๐Ÿ“– Click to Expand

Checks whether your outcome metric follows a normal distribution, which is a key assumption for parametric tests like t-test or ANOVA.

  • Use Shapiro-Wilk test or visual tools (histograms, Q-Q plots).
  • Helps determine whether to use parametric or non-parametric tests.
  • If data is non-normal, switch to Mann-Whitney U or Wilcoxon.
Inย [13]:
def test_normality(df, outcome_metric_col, group_col, group_labels):
    results = {}
    for group in group_labels:
        group_data = df[df[group_col] == group][outcome_metric_col]
        stat, p = shapiro(group_data)
        results[group] = {'statistic': stat, 'p_value': p, 'normal': p > 0.05}
    return results
Inย [14]:
normality_results = test_normality(users, outcome_metric_col=outcome_metric_col, group_col='group', group_labels=group_labels)

print("Normality test (Shapiro-Wilk) results:")
for group, result in normality_results.items():
    print(f"{group}: p = {result['p_value']:.4f} โ†’ {'Normal' if result['normal'] else 'Non-normal'}")
Normality test (Shapiro-Wilk) results:
control: p = 0.2230 โ†’ Normal
treatment: p = 0.6053 โ†’ Normal
Inย [15]:
# Assume both groups must be normal to proceed with parametric tests
test_config['normality'] = all(result['normal'] for result in normality_results.values())
test_config
Out[15]:
{'outcome_metric_col': 'engagement_score',
 'outcome_metric_datatype': 'continuous',
 'group_labels': ('control', 'treatment'),
 'group_count': 2,
 'variant': 'independent',
 'observation_id_col': 'user_id',
 'pre_experiment_metric': 'past_purchase_count',
 'normality': True,
 'equal_variance': None,
 'family': None}

๐Ÿ” Variance Homogeneity Check

๐Ÿ“– Click to Expand

Tests whether the variances between groups are equal, which affects the validity of t-tests and ANOVA.

  • Performed using Leveneโ€™s test or Bartlettโ€™s test.
  • If variances are unequal, use Welch's t-test instead.
  • Unequal variances do not invalidate analysis but change the test used.
Inย [16]:
def test_equal_variance(df, outcome_metric_col, group_col, group_labels):
    group_data = [df[df[group_col] == label][outcome_metric_col] for label in group_labels]
    stat, p = levene(*group_data)
    return {'statistic': stat, 'p_value': p, 'equal_variance': p > 0.05}
Inย [17]:
variance_result = test_equal_variance(users, outcome_metric_col=outcome_metric_col, group_col='group', group_labels=group_labels)
variance_result
Out[17]:
{'statistic': 0.08918799756611763,
 'p_value': 0.7652741675085144,
 'equal_variance': True}
Inย [18]:
print(f"Leveneโ€™s test: p = {variance_result['p_value']:.4f} โ†’ {'Equal variances' if variance_result['equal_variance'] else 'Unequal variances'}")
test_config['equal_variance'] = variance_result['equal_variance']
test_config
Leveneโ€™s test: p = 0.7653 โ†’ Equal variances
Out[18]:
{'outcome_metric_col': 'engagement_score',
 'outcome_metric_datatype': 'continuous',
 'group_labels': ('control', 'treatment'),
 'group_count': 2,
 'variant': 'independent',
 'observation_id_col': 'user_id',
 'pre_experiment_metric': 'past_purchase_count',
 'normality': True,
 'equal_variance': True,
 'family': None}

๐Ÿ” Test Family

๐Ÿ“– Click to Expand

Selects the appropriate statistical test based on:

  • Outcome data type (binary, continuous, categorical)
  • Distributional assumptions (normality, variance)
  • Number of groups and experiment structure (independent vs paired)

This step automatically maps to the correct test (e.g., t-test, z-test, chi-square, ANOVA).

๐Ÿงช Experiment Type โ†’ Test Family Mapping
Outcome MetricNormalityGroup CountSelected Test Family
binaryโ€”2z_test
binaryโ€”3+chi_square
continuousโœ…2t_test
continuousโœ…3+anova
continuousโŒ2non_parametric (Mann-Whitney U)
continuousโŒ3+non_parametric (Kruskal-Wallis)
categoricalโ€”2chi_square
categoricalโ€”3+chi_square
Inย [19]:
def determine_test_family(test_config):
    """
    Decide which family of statistical test to use based on:
    - outcome data type: binary / continuous / categorical
    - group count: 2 or 3+
    - variant: independent or paired (optional for family level)
    - normality assumption: passed or not
    """

    data_type = test_config['outcome_metric_datatype']
    group_count = test_config['group_count']
    variant = test_config['variant']
    normality = test_config['normality']

    # Binary outcome โ†’ Z-test for 2 groups, Chi-square for 3+ groups
    if data_type == 'binary':
        if group_count == 2:
            return 'z_test'           # Compare proportions across 2 groups
        else:
            return 'chi_square'      # 2x3+ contingency test

    # Continuous outcome โ†’ check for normality and group count
    elif data_type == 'continuous':
        if not normality:
            return 'non_parametric'  # Mann-Whitney U or Kruskal-Wallis
        if group_count == 2:
            return 't_test'          # Independent or paired t-test
        else:
            return 'anova'           # One-way ANOVA

    # Categorical outcome โ†’ Chi-square always
    elif data_type == 'categorical':
        return 'chi_square'

    else:
        raise ValueError(f"Unsupported outcome_metric_datatype: {data_type}")
Inย [20]:
test_config['family'] = determine_test_family(test_config)
test_config

print(f"โœ… Selected test family: {test_config['family']}")
Out[20]:
{'outcome_metric_col': 'engagement_score',
 'outcome_metric_datatype': 'continuous',
 'group_labels': ('control', 'treatment'),
 'group_count': 2,
 'variant': 'independent',
 'observation_id_col': 'user_id',
 'pre_experiment_metric': 'past_purchase_count',
 'normality': True,
 'equal_variance': True,
 'family': 't_test'}
โœ… Selected test family: t_test

Back to the top ___

๐Ÿงช AA Testing

๐Ÿ“– Click to Expand

A/A testing is a preliminary experiment where both groups (e.g., โ€œcontrolโ€ and โ€œtreatmentโ€) receive the exact same experience. It's used to validate the experimental setup before running an actual A/B test.

What Are We Checking?

  • Are users being assigned fairly and randomly?
  • Are key outcome metrics statistically similar across groups?
  • Can we trust the experimental framework?

Why A/A Testing Matters

  • Validates Randomization โ€” Confirms the groups are balanced at baseline (no bias or leakage)
  • Detects SRM (Sample Ratio Mismatch) โ€” Ensures the actual split (e.g., 50/50) matches what was intended
  • Estimates Variability โ€” Helps calibrate variance for accurate power calculations later
  • Trust Check โ€” Catches bugs in assignment logic, event tracking, or instrumentation

A/A Test Process

  1. Randomly assign users into two equal groups โ€” Just like you would for an A/B test (e.g., control vs treatment)
  2. Measure key outcome โ€” This depends on your experiment type:
    • binary โ†’ conversion rate
    • continuous โ†’ avg. revenue, time spent
    • categorical โ†’ feature adoption, plan selected
  3. Run statistical test:
    • binary โ†’ Z-test or Chi-square
    • continuous โ†’ t-test
    • categorical โ†’ Chi-square test
  4. Check SRM โ€” Use a chi-square goodness-of-fit test to detect assignment imbalances

Possible Outcomes

Result Interpretation
No significant difference โœ… Randomization looks good. Test setup is sound.
Statistically significant difference โš ๏ธ Somethingโ€™s off โ€” check assignment logic, instrumentation, or sample leakage

Run A/A tests whenever you launch a new experiment framework, roll out a new randomizer, or need to build stakeholder trust.

๐Ÿงฌ Outcome Similarity Test

๐Ÿ“– Click to Expand

Compares the outcome metric across groups to ensure no significant differences exist when there shouldn't be any โ€” usually used during A/A testing or pre-experiment validation.

  • Helps detect setup issues like biased group assignment or data leakage.
  • Null Hypothesis: No difference in outcomes between control and treatment.
  • Uses the same statistical test as the main A/B test (e.g., t-test, z-test, chi-square).
Inย [21]:
def run_outcome_similarity_test(
    df,
    group_col,
    metric_col,
    test_family,
    variant=None,
    group_labels=('control', 'treatment'),
    alpha=0.05,
    verbose=True
):
    """
    Runs a similarity test between two groups based on test_family and variant.

    Parameters:
    - df: pandas DataFrame
    - group_col: column with group assignment
    - metric_col: outcome metric
    - test_family: one of ['z_test', 't_test', 'chi_square', 'anova', 'non_parametric']
    - variant: 'independent' or 'paired' (required for t-test)
    - group_labels: tuple of (control, treatment)
    - alpha: significance threshold
    - verbose: print detailed interpretation
    """

    if verbose:
        print("๐Ÿ“ Outcome Similarity Check\n")

    group1 = df[df[group_col] == group_labels[0]][metric_col]
    group2 = df[df[group_col] == group_labels[1]][metric_col]

    # --- Run appropriate test ---
    if test_family == 'z_test':
        conv1, conv2 = group1.mean(), group2.mean()
        n1, n2 = len(group1), len(group2)
        pooled_prob = (group1.sum() + group2.sum()) / (n1 + n2)
        se = np.sqrt(pooled_prob * (1 - pooled_prob) * (1/n1 + 1/n2))
        z_score = (conv2 - conv1) / se
        p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
        test_name = "z-test for proportions"

    elif test_family == 't_test':
        if variant == 'independent':
            t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
            test_name = "independent t-test"
        elif variant == 'paired':
            if len(group1) != len(group2):
                print("โŒ Paired t-test requires equal-length samples.")
                return None
            t_stat, p_value = stats.ttest_rel(group1, group2)
            test_name = "paired t-test"
        else:
            raise ValueError("Missing or invalid variant for t-test.")

    elif test_family == 'chi_square':
        contingency = pd.crosstab(df[group_col], df[metric_col])
        chi2_stat, p_value, _, _ = stats.chi2_contingency(contingency)
        test_name = "chi-square test"

    elif test_family == 'anova':
        f_stat, p_value = stats.f_oneway(group1, group2)
        test_name = "one-way ANOVA"

    elif test_family == 'non_parametric':
        u_stat, p_value = stats.mannwhitneyu(group1, group2, alternative='two-sided')
        test_name = "Mann-Whitney U test"

    else:
        raise ValueError(f"โŒ Unsupported test family: {test_family}")

    # --- Detailed Interpretation ---
    if verbose:
        print("\n๐Ÿง  Interpretation:")

        if test_family == 'z_test':
            print(f"Used a {test_name} to compare conversion rates between groups.")
            print("Null Hypothesis: Conversion rates are equal across groups.")

        elif test_family == 't_test':
            if variant == 'independent':
                print(f"Used an {test_name} to compare means of '{metric_col}' across independent groups.")
                print("Null Hypothesis: Group means are equal.")
            elif variant == 'paired':
                print(f"Used a {test_name} to compare within-user differences in '{metric_col}'.")
                print("Null Hypothesis: Mean difference between pairs is zero.")

        elif test_family == 'chi_square':
            print(f"Used a {test_name} to test whether '{metric_col}' distribution depends on group.")
            print("Null Hypothesis: No association between group and category.")

        elif test_family == 'anova':
            print(f"Used a {test_name} to compare group means of '{metric_col}' across 3+ groups.")
            print("Null Hypothesis: All group means are equal.")

        elif test_family == 'non_parametric':
            print(f"Used a {test_name} to compare medians of '{metric_col}' across groups (non-parametric).")
            print("Null Hypothesis: Distributions are identical across groups.")

        print(f"\nWe use ฮฑ = {alpha:.2f}")
        if p_value < alpha:
            print(f"โžก๏ธ p = {p_value:.4f} < ฮฑ โ†’ Reject null hypothesis. Statistically significant difference.")
        else:
            print(f"โžก๏ธ p = {p_value:.4f} โ‰ฅ ฮฑ โ†’ Fail to reject null. No statistically significant difference.")

    return p_value

๐Ÿงฌ Sample Ratio Mismatch

๐Ÿ“– Click to Expand

Is group assignment balanced?

  • SRM (Sample Ratio Mismatch) checks whether the observed group sizes match the expected ratio.
  • In a perfect world, random assignment to 'A1' and 'A2' should give ~50/50 split.
  • SRM helps catch bugs in randomization, data logging, or user eligibility filtering.

Real-World Experiment Split Ratios

Scenario Split Why
Default A/B 50 / 50 Maximizes power and ensures fairness
Risky feature 10 / 90 or 20 / 80 Limits user exposure to minimize risk
Ramp-up Step-wise (1-5-25-50โ€ฆ) Gradual rollout to catch issues early
A/B/C Test 33 / 33 / 33 or weighted Compare multiple variants fairly or with bias
High control confidence needed 70 / 30 or 60 / 40 More stability in baseline comparisons
Inย [22]:
def run_aa_testing_generalized(
    df,
    group_col,
    metric_col,
    group_labels,
    test_family,
    variant=None,
    alpha=0.05,
    visualize=True
):
    """
    Runs A/A test: SRM check + similarity test + optional visualization.
    All logic routed by test_family + variant (no experiment_type).
    """
    print(f"\n๐Ÿ“Š A/A Test Summary for metric: '{metric_col}' [{test_family}, {variant}]\n")

    check_sample_ratio_mismatch(df, group_col, group_labels, alpha=alpha, expected_ratios=[0.5, 0.5])

    group1 = df[df[group_col] == group_labels[0]][metric_col]
    group2 = df[df[group_col] == group_labels[1]][metric_col]

    p_value = run_outcome_similarity_test(
        df=df,
        group_col=group_col,
        metric_col=metric_col,
        test_family=test_family,
        variant=variant,
        group_labels=group_labels,
        alpha=alpha
    )

    if visualize and p_value is not None:
        visualize_aa_distribution(
            df, group1, group2,
            group_col=group_col,
            metric_col=metric_col,
            test_family=test_family,
            variant=variant,
            group_labels=group_labels
        )
Inย [23]:
def check_sample_ratio_mismatch(df, group_col, group_labels, expected_ratios=None, alpha=0.05):
    """
    Checks for Sample Ratio Mismatch (SRM) using a Chi-Square test.

    Parameters:
    - df: DataFrame with group assignments
    - group_col: Column containing group assignment
    - group_labels: List or tuple of group names (e.g., ['control', 'treatment'])
    - expected_ratios: Expected proportions per group (e.g., [0.5, 0.5])
    - alpha: Significance level

    Prints observed vs expected distribution and test results.
    """
    print("๐Ÿ” Sample Ratio Mismatch (SRM) Check")

    observed_counts = df[group_col].value_counts().reindex(group_labels, fill_value=0)

    if expected_ratios is None:
        expected_ratios = [1 / len(group_labels)] * len(group_labels)
    else:
        total = sum(expected_ratios)
        expected_ratios = [r / total for r in expected_ratios]  # normalize to sum to 1

    expected_counts = [len(df) * ratio for ratio in expected_ratios]

    # Print group-wise summary
    for grp, expected in zip(group_labels, expected_counts):
        observed = observed_counts.get(grp, 0)
        pct = observed / len(df) * 100
        print(f"Group {grp}: {observed} users ({pct:.2f}%) โ€” Expected: {expected:.1f}")

    # Run Chi-square test
    chi2_stat, chi2_p = stats.chisquare(f_obs=observed_counts, f_exp=expected_counts)
    print(f"\nChi2 Statistic: {chi2_stat:.4f}")
    print(f"P-value       : {chi2_p:.4f}")

    if chi2_p < alpha:
        print("โš ๏ธ SRM Detected โ€” group assignment might be biased.\n")
    else:
        print("โœ… No SRM โ€” group sizes look balanced.\n")

๐Ÿ“Š AA Test Visualizationยถ

Inย [24]:
def visualize_aa_distribution(df, group1, group2, group_col, metric_col, test_family, variant=None, group_labels=('control', 'treatment')):
    if test_family in ['t_test', 'anova', 'non_parametric']:
        plt.hist(group1, bins=30, alpha=0.5, label=group_labels[0])
        plt.hist(group2, bins=30, alpha=0.5, label=group_labels[1])
        plt.title(f"A/A Test: {metric_col} Distribution")
        plt.xlabel(metric_col)
        plt.ylabel("Frequency")
        plt.legend()
        plt.show()

    elif test_family == 'z_test':
        rates = [group1.mean(), group2.mean()]
        plt.bar(group_labels, rates)
        for i, rate in enumerate(rates):
            plt.text(i, rate + 0.01, f"{rate:.2%}", ha='center')
        plt.title("A/A Test: Conversion Rate by Group")
        plt.ylabel("Conversion Rate")
        plt.ylim(0, 1)
        plt.grid(axis='y', linestyle='--', alpha=0.7)
        plt.show()

    elif test_family == 'chi_square':
        contingency = pd.crosstab(df[group_col], df[metric_col], normalize='index')
        contingency.plot(kind='bar', stacked=True)
        plt.title(f"A/A Test: {metric_col} Distribution by Group")
        plt.ylabel("Proportion")
        plt.xlabel(group_col)
        plt.legend(title=metric_col)
        plt.grid(axis='y', linestyle='--', alpha=0.7)
        plt.show()
Inย [25]:
run_aa_testing_generalized(
    df=users,
    group_col='group',
    metric_col=test_config['outcome_metric_col'],
    group_labels=test_config['group_labels'],
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    alpha=0.05
)
๐Ÿ“Š A/A Test Summary for metric: 'engagement_score' [t_test, independent]

๐Ÿ” Sample Ratio Mismatch (SRM) Check
Group control: 510 users (51.00%) โ€” Expected: 500.0
Group treatment: 490 users (49.00%) โ€” Expected: 500.0

Chi2 Statistic: 0.4000
P-value       : 0.5271
โœ… No SRM โ€” group sizes look balanced.

๐Ÿ“ Outcome Similarity Check


๐Ÿง  Interpretation:
Used an independent t-test to compare means of 'engagement_score' across independent groups.
Null Hypothesis: Group means are equal.

We use ฮฑ = 0.05
โžก๏ธ p = 0.4657 โ‰ฅ ฮฑ โ†’ Fail to reject null. No statistically significant difference.
No description has been provided for this image

๐ŸŽฒ Type I Error Simulation

๐Ÿ“– Click to Expand
๐Ÿ” Repeated A/A Tests

While a single A/A test helps detect obvious flaws in group assignment (like SRM or data leakage), itโ€™s still a one-off check. To gain confidence in your randomization method, we simulate multiple A/A tests using the same logic:

  • Each run reassigns users randomly into control and treatment (with no actual change)
  • We then run the statistical test between groups for each simulation
  • We track how often the test reports a false positive (p < ฮฑ), which estimates the Type I error rate
In theory, if your setup is unbiased and ฮฑ = 0.05, you'd expect about 5% of simulations to return a significant result โ€” this validates your A/B framework isnโ€™t "trigger-happy."
๐Ÿ“Š What this tells you:
  • Too many significant p-values โ†’ your framework is too noisy (bad randomization, poor test choice)
  • Near 5% = healthy noise level, expected by design

This step is optional but highly recommended when you're:

  • Trying out a new randomization strategy
  • Validating an internal experimentation framework
  • Stress-testing your end-to-end pipeline
Inย [26]:
def simulate_aa_type1_error_rate(
    df,
    metric_col,
    group_labels,
    test_family,
    variant=None,
    runs=100,
    alpha=0.05,
    seed=42,
    verbose=False
):
    """
    Simulates repeated A/A tests to estimate empirical Type I error rate.

    Returns:
    - p_values: list of p-values from each simulation
    """
    np.random.seed(seed)
    p_values = []

    for i in range(runs):
        shuffled_df = df.copy()
        shuffled_df['group'] = np.random.choice(group_labels, size=len(df), replace=True)

        p = run_outcome_similarity_test(
            df=shuffled_df,
            group_col='group',
            metric_col=metric_col,
            test_family=test_family,
            variant=variant,
            group_labels=group_labels,
            alpha=alpha,
            verbose=False
        )

        if p is not None:
            p_values.append(p)

        if verbose:
            print(f"Run {i+1}: p = {p:.4f}")

    significant = sum(p < alpha for p in p_values)
    error_rate = significant / runs

    print(f"\n๐Ÿ“ˆ Type I Error Rate Estimate: {significant}/{runs} = {error_rate:.2%}")

    # Interpretation Block
    print(f"""
            ๐Ÿง  Summary Interpretation:
            We simulated {runs} A/A experiments using random group assignment (no actual treatment).

            Test: {test_family.upper()}{' (' + variant + ')' if variant else ''}
            Metric: {metric_col}
            Alpha: {alpha}

            False positives (p < ฮฑ): {significant} / {runs}
            โ†’ Estimated Type I Error Rate: {error_rate:.2%}

            This is within expected range for ฮฑ = {alpha}.
            โ†’ โœ… Test framework is behaving correctly โ€” no bias or sensitivity inflation.
            """)

    plot_p_value_distribution(p_values, alpha=alpha)

    return p_values
Inย [27]:
def plot_p_value_distribution(p_values, alpha=0.05):
    plt.figure(figsize=(8, 4))
    plt.hist(p_values, bins=20, edgecolor='black', alpha=0.7)
    plt.axvline(x=alpha, color='red', linestyle='--', label=f"ฮฑ = {alpha}")
    plt.title("P-value Distribution Across A/A Tests")
    plt.xlabel("P-value")
    plt.ylabel("Frequency")
    plt.legend()
    plt.grid(axis='y', linestyle='--', alpha=0.6)
    plt.show()
Inย [28]:
_ = simulate_aa_type1_error_rate(
    df=users,
    metric_col=test_config['outcome_metric_col'],
    group_labels=test_config['group_labels'],
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    runs=100,
    alpha=0.05
)
๐Ÿ“ˆ Type I Error Rate Estimate: 4/100 = 4.00%

            ๐Ÿง  Summary Interpretation:
            We simulated 100 A/A experiments using random group assignment (no actual treatment).

            Test: T_TEST (independent)
            Metric: engagement_score
            Alpha: 0.05

            False positives (p < ฮฑ): 4 / 100
            โ†’ Estimated Type I Error Rate: 4.00%

            This is within expected range for ฮฑ = 0.05.
            โ†’ โœ… Test framework is behaving correctly โ€” no bias or sensitivity inflation.
            
No description has been provided for this image

Back to the top ___

โšก Power Analysis

๐Ÿ“– Click to Expand

Power analysis helps determine the minimum sample size required to detect a true effect with statistical confidence.

Why It Matters:
  • Avoids underpowered tests (risk of missing real effects)
  • Balances tradeoffs between Sample size, Minimum Detectable Effect (MDE), Significance level (ฮฑ), Statistical power (1 - ฮฒ)
Key Inputs:
Parameter Meaning
alpha (ฮฑ) Significance level (probability of false positive), e.g. 0.05
Power (1 - ฮฒ) Probability of detecting a true effect, e.g. 0.80 or 0.90
Baseline Current outcome (e.g., 10% conversion, $50 revenue)
MDE Minimum detectable effect โ€” the smallest meaningful lift (e.g., +2% or +$5)
Std Dev Standard deviation of the metric (for continuous outcomes)
Effect Size Optional: Cohen's d (for t-tests) or f (for ANOVA)
Groups Number of groups (relevant for ANOVA)

This notebook automatically selects the correct formula based on experiment_type variable.

โš™๏ธ Setup Inputs + Config Values

๐Ÿ“– Click to Expand

These are the core experiment design parameters required for power analysis and statistical testing.

  • alpha: Significance level โ€” the tolerance for false positives (commonly set at 0.05).
  • power: Probability of detecting a true effect โ€” typically 0.80 or 0.90.
  • group_labels: The names of the experimental groups (e.g., 'control', 'treatment').
  • metric_col: Outcome metric column you're analyzing.
  • test_family: Chosen statistical test (e.g., 't_test', 'z_test', 'chi_square') based on assumptions.
  • variant: Experimental design structure โ€” 'independent' or 'paired'.

These inputs drive sample size estimation, test choice, and downstream analysis logic.

Inย [29]:
# Define Core Inputs

# Use values from your config or plug in manually
alpha = 0.05  # False positive tolerance (Type I error)
power = 0.80  # Statistical power (1 - Type II error)
group_labels = test_config['group_labels']
metric_col = test_config['outcome_metric_col']
test_family = test_config['family']
variant = test_config.get('variant')

๐Ÿ“ˆ Baseline Estimation from Data

๐Ÿ“– Click to Expand

Before we calculate required sample size, we need a baseline value from historical or current data.

  • For binary metrics (e.g., conversion), the baseline is the current conversion rate.
  • For continuous metrics (e.g., revenue, engagement), we estimate the mean and standard deviation from the control group.
  • These values help translate the Minimum Detectable Effect (MDE) into a usable effect size.
โš ๏ธ Be cautious with outliers or extreme skew when computing baselines โ€” they directly influence sample size estimates.
Inย [30]:
# ๐Ÿงฎ Data-Driven Baseline Metric

if test_family == 'z_test':
    # For binary outcome (e.g., conversion): baseline = conversion rate in data
    baseline_rate = users[metric_col].mean()
    print(f"๐Ÿ“Š Baseline conversion rate: {baseline_rate:.2%}")

elif test_family in ['t_test', 'anova', 'non_parametric']:
    # For continuous metrics (e.g., revenue, engagement)
    control_data = users[users['group'] == group_labels[0]][metric_col]
    baseline_mean = control_data.mean()
    std_dev = control_data.std()
    print(f"๐Ÿ“Š Control group mean: {baseline_mean:.2f}")
    print(f"๐Ÿ“ Control group std dev: {std_dev:.2f}")

else:
    baseline_rate = None
    std_dev = None
๐Ÿ“Š Control group mean: 51.03
๐Ÿ“ Control group std dev: 15.61

๐Ÿ“ˆ Minimum Detectable Effect

๐Ÿ“– Click to Expand

๐ŸŽฏ Minimum Detectable Effect (MDE) is the smallest business-relevant difference you want your test to catch.

  • It reflects what matters โ€” not what the data happens to show
  • Drives required sample size:
    • Smaller MDE โ†’ larger sample
    • Larger MDE โ†’ smaller sample

๐Ÿง  Choose an MDE based on:

  • What level of uplift would justify launching the feature?
  • What's a meaningful change in your metric โ€” not just statistical noise?
Inย [31]:
# Minimum Detectable Effect (MDE)
# This is NOT data-driven โ€” it reflects the minimum improvement you care about detecting.
# It should be small enough to catch valuable changes, but large enough to avoid inflating sample size.

# Examples by Metric Type:
# - Binary       : 0.02 โ†’ detect a 2% lift in conversion rate (e.g., from 10% to 12%)
# - Categorical  : 0.05 โ†’ detect a 5% shift in plan preference (e.g., more users choosing 'premium' over 'basic')
# - Continuous   : 3.0  โ†’ detect a 3-point gain in engagement score (e.g., from 50 to 53 avg. score)

mde = 5  # Change this based on business relevance

๐Ÿ“ Required Sample Sizeยถ

Inย [32]:
def calculate_power_sample_size(
    test_family,
    variant=None,
    alpha=0.05,
    power=0.80,
    baseline_rate=None,  # required for z-test
    mde=None,
    std_dev=None,
    effect_size=None,
    num_groups=2  # placeholder for future ANOVA support
):
    """
    Calculate required sample size per group based on test type and assumptions.

    Supported families:
    - 'z_test'              : Binary outcomes (proportions)
    - 't_test'              : Continuous outcomes (independent or paired)
    - 'non_parametric'      : Mann-Whitney (approximated as t-test)
    - 'anova'               : Not implemented (default to t-test)
    - 'chi_square'          : Categorical outcomes (not used in this version)
    """
    # -- Z-Test for Binary Proportions --
    if test_family == 'z_test':
        if baseline_rate is None or mde is None:
            raise ValueError("baseline_rate and mde are required for z-test (binary outcome).")

        z_alpha = stats.norm.ppf(1 - alpha / 2)
        z_beta = stats.norm.ppf(power)
        p1 = baseline_rate
        p2 = p1 + mde
        pooled_std = np.sqrt(2 * p1 * (1 - p1))

        n = ((z_alpha + z_beta) ** 2 * pooled_std ** 2) / (mde ** 2)
        return int(np.ceil(n))

    # -- T-Test for Continuous (Independent or Paired) --
    elif test_family in ['t_test', 'non_parametric', 'anova']:
        if effect_size is None:
            if std_dev is None or mde is None:
                raise ValueError("For continuous outcomes, provide either effect_size or both std_dev and mde.")
            effect_size = mde / std_dev  # Cohen's d

        if variant == 'independent':
            analysis = TTestIndPower()
        elif variant == 'paired':
            analysis = TTestPower()
        else:
            raise ValueError("variant must be 'independent' or 'paired' for t-test.")

        n = analysis.solve_power(effect_size=effect_size, power=power, alpha=alpha)
        return int(np.ceil(n))

    else:
        raise ValueError(f"โŒ Unsupported test family: {test_family}")
Inย [33]:
required_sample_size = calculate_power_sample_size(
    test_family=test_family,
    variant=variant,
    alpha=alpha,
    power=power,
    baseline_rate=baseline_rate if test_family == 'z_test' else None,
    mde=mde,
    std_dev=std_dev if test_family in ['t_test', 'anova', 'non_parametric'] else None,
    effect_size=None,  # Let it compute internally via mde/std
    num_groups=2
)

test_config['required_sample_size'] = required_sample_size
print(f"โœ… Required sample size per group: {required_sample_size}")
print(f"๐Ÿ‘ฅ Total sample size: {required_sample_size * 2}")
โœ… Required sample size per group: 154
๐Ÿ‘ฅ Total sample size: 308

๐Ÿ“Š Power Analysis Summaryยถ

Inย [34]:
def print_power_summary(
    test_family,
    variant,
    alpha,
    power,
    baseline_rate=None,
    mde=None,
    std_dev=None,
    required_sample_size=None
):
    print("๐Ÿ“ˆ Power Analysis Summary")
    print(f"- Test: {test_family.upper()}{' (' + variant + ')' if variant else ''}")
    print(f"- Significance level (ฮฑ): {alpha}")
    print(f"- Statistical power (1 - ฮฒ): {power}")

    if test_family == 'z_test':
        print(f"- Baseline conversion rate: {baseline_rate:.2%}")
        print(f"- MDE: {mde:.2%}")
        print(f"\nโœ… To detect a lift from {baseline_rate:.2%} to {(baseline_rate + mde):.2%},")
        print(f"you need {required_sample_size} users per group โ†’ total {required_sample_size * 2} users.")

    elif test_family == 't_test':
        print(f"- Std Dev (control group): {std_dev:.2f}")
        print(f"- MDE (mean difference): {mde}")
        print(f"- Cohen's d: {mde / std_dev:.2f}")
        print(f"\nโœ… To detect a {mde}-unit lift in mean outcome,")
        print(f"you need {required_sample_size} users per group โ†’ total {required_sample_size * 2} users.")

    else:
        print("โš ๏ธ Unsupported family for summary.")

print_power_summary(
    test_family=test_family,
    variant=variant,
    alpha=alpha,
    power=power,
    baseline_rate=baseline_rate if test_family == 'z_test' else None,
    mde=mde,
    std_dev=std_dev if test_family == 't_test' else None,
    required_sample_size=required_sample_size
)
๐Ÿ“ˆ Power Analysis Summary
- Test: T_TEST (independent)
- Significance level (ฮฑ): 0.05
- Statistical power (1 - ฮฒ): 0.8
- Std Dev (control group): 15.61
- MDE (mean difference): 5
- Cohen's d: 0.32

โœ… To detect a 5-unit lift in mean outcome,
you need 154 users per group โ†’ total 308 users.

Back to the top ___

๐Ÿงช A/B Testing

๐Ÿ”— For test selection (e.g., Z-test, t-test), refer to ๐Ÿ“– Hypothesis Testing Notebook

๐Ÿ“– Click to Expand
๐Ÿงช A/B Testing - Outcome Comparison

This section compares the outcome metric between control and treatment groups using the appropriate statistical test based on the experiment type.

๐Ÿ“Œ Metric Tracked:
  • Primary metric: Depends on use case:
    • Binary: Conversion rate (clicked or not)
    • Continuous: Average engagement, revenue, time spent
    • Categorical: Plan type, user tier, etc.
  • Unit of analysis: Unique user or unique observation
๐Ÿ”ฌ Outcome Analysis Steps:
  • Choose the right statistical test based on experiment_type:
    • 'binary' โ†’ Z-test for proportions
    • 'continuous_independent' โ†’ Two-sample t-test
    • 'continuous_paired' โ†’ Paired t-test
    • 'categorical' โ†’ Chi-square test of independence
  • Calculate test statistics, p-values, and confidence intervals
  • Visualize the comparison to aid interpretation
Inย [35]:
def run_ab_test(
    df,
    group_col,
    metric_col,
    group_labels,
    test_family,
    variant=None,
    alpha=0.05
):
    """
    Runs the correct statistical test based on test_family + variant combo.

    Returns:
    - result dict with summary stats, test used, p-value, and test-specific values
    """
    group1, group2 = group_labels
    data1 = df[df[group_col] == group1][metric_col]
    data2 = df[df[group_col] == group2][metric_col]

    result = {
        'test_family': test_family,
        'variant': variant,
        'group_labels': group_labels,
        'alpha': alpha,
        'summary': {}
    }

    # --- Summary Stats ---
    result['summary'][group1] = {
        'n': len(data1),
        'mean': data1.mean(),
        'std': data1.std() if test_family in ['t_test', 'non_parametric'] else None,
        'sum': data1.sum() if test_family == 'z_test' else None
    }
    result['summary'][group2] = {
        'n': len(data2),
        'mean': data2.mean(),
        'std': data2.std() if test_family in ['t_test', 'non_parametric'] else None,
        'sum': data2.sum() if test_family == 'z_test' else None
    }

    # --- Binary Proportions (Z-Test) ---
    if test_family == 'z_test':
        x1, n1 = data1.sum(), len(data1)
        x2, n2 = data2.sum(), len(data2)
        p_pooled = (x1 + x2) / (n1 + n2)
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
        z_stat = (x2/n2 - x1/n1) / se
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
        result.update({'test': 'z-test for proportions', 'z_stat': z_stat, 'p_value': p_value})

    # --- Continuous (T-Test) ---
    elif test_family == 't_test':
        if variant == 'independent':
            t_stat, p_value = stats.ttest_ind(data1, data2, equal_var=False)
            result.update({'test': 'independent t-test', 't_stat': t_stat, 'p_value': p_value})
        elif variant == 'paired':
            if len(data1) != len(data2):
                raise ValueError("Paired test requires equal-length matching samples.")
            t_stat, p_value = stats.ttest_rel(data1, data2)
            result.update({'test': 'paired t-test', 't_stat': t_stat, 'p_value': p_value})
        else:
            raise ValueError("Missing or invalid variant for t-test.")

    # --- Continuous (Non-parametric) ---
    elif test_family == 'non_parametric':
        u_stat, p_value = stats.mannwhitneyu(data1, data2, alternative='two-sided')
        result.update({'test': 'Mann-Whitney U Test', 'u_stat': u_stat, 'p_value': p_value})

    # --- Categorical (Chi-square) ---
    elif test_family == 'chi_square':
        contingency = pd.crosstab(df[group_col], df[metric_col])
        chi2, p_value, _, _ = stats.chi2_contingency(contingency)
        result.update({'test': 'chi-square test', 'chi2_stat': chi2, 'p_value': p_value})

    else:
        raise ValueError(f"โŒ Unsupported test_family: {test_family}")

    return result
Inย [36]:
result = run_ab_test(
    df=users,
    group_col='group',
    metric_col=test_config['outcome_metric_col'],
    group_labels=test_config['group_labels'],
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    alpha=0.05
)
result
Out[36]:
{'test_family': 't_test',
 'variant': 'independent',
 'group_labels': ('control', 'treatment'),
 'alpha': 0.05,
 'summary': {'control': {'n': 510,
   'mean': 51.02515924585304,
   'std': 15.60949221088591,
   'sum': None},
  'treatment': {'n': 490,
   'mean': 50.31059463655223,
   'std': 15.354069556161138,
   'sum': None}},
 'test': 'independent t-test',
 't_stat': 0.7297273039756266,
 'p_value': 0.4657282187278262}

Back to the top ___

๐Ÿ“‰ Resultsยถ

๐Ÿงพ Summariesยถ

Inย [37]:
def summarize_ab_test_result(result):
    """
    Prints A/B test results summary with statistical test outputs and lift analysis.
    """
    test_family = result['test_family']
    variant = result.get('variant')
    group1, group2 = result['group_labels']
    p_value = result.get('p_value')
    alpha = result.get('alpha', 0.05)

    print("\n" + "="*45)
    print(f"๐Ÿงช A/B Test Result Summary [{test_family.upper()}]")
    print("="*45)

    # ---- Hypothesis Test Output ----
    print("\n๐Ÿ“Š Hypothesis Test Result")
    print(f"Test used: {result.get('test', 'N/A')}")
    if 'z_stat' in result:
        print(f"Z-statistic: {result['z_stat']:.4f}")
    elif 't_stat' in result:
        print(f"T-statistic: {result['t_stat']:.4f}")
    elif 'chi2_stat' in result:
        print(f"Chi2-statistic: {result['chi2_stat']:.4f}")
    elif 'u_stat' in result:
        print(f"U-statistic: {result['u_stat']:.4f}")

    if p_value is not None:
        print(f"P-value    : {p_value:.4f}")
        print("โœ… Statistically significant difference detected." if p_value < alpha else "๐Ÿšซ No significant difference detected.")
    else:
        print("โš ๏ธ P-value not found.")

    # ---- Summary Table ----
    print("\n๐Ÿ“‹ Group Summary:\n")
    display(pd.DataFrame(result['summary']).T)

    # ---- Lift Analysis (for Z-test or T-test (independent)) ----
    if test_family in ['z_test', 't_test'] and (variant == 'independent' or test_family == 'z_test'):
        group1_mean = result['summary'][group1]['mean']
        group2_mean = result['summary'][group2]['mean']
        lift = group2_mean - group1_mean
        pct_lift = lift / group1_mean if group1_mean else np.nan

        print("\n๐Ÿ“ˆ Lift Analysis")
        print(f"- Absolute Lift   : {lift:.4f}")
        print(f"- Percentage Lift : {pct_lift:.2%}")

        try:
            n1 = result['summary'][group1]['n']
            n2 = result['summary'][group2]['n']

            if test_family == 'z_test':
                se = np.sqrt(group1_mean * (1 - group1_mean) / n1 + group2_mean * (1 - group2_mean) / n2)
            else:
                sd1 = result['summary'][group1].get('std')
                sd2 = result['summary'][group2].get('std')
                se = np.sqrt((sd1 ** 2) / n1 + (sd2 ** 2) / n2)

            z = 1.96
            ci_low = lift - z * se
            ci_high = lift + z * se
            print(f"- 95% CI for Lift : [{ci_low:.4f}, {ci_high:.4f}]")
        except Exception as e:
            print(f"โš ๏ธ Could not compute confidence interval: {e}")

    print("="*45 + "\n")
Inย [38]:
summarize_ab_test_result(result)
=============================================
๐Ÿงช A/B Test Result Summary [T_TEST]
=============================================

๐Ÿ“Š Hypothesis Test Result
Test used: independent t-test
T-statistic: 0.7297
P-value    : 0.4657
๐Ÿšซ No significant difference detected.

๐Ÿ“‹ Group Summary:

n mean std sum
control 510.0 51.025159 15.609492 NaN
treatment 490.0 50.310595 15.354070 NaN
๐Ÿ“ˆ Lift Analysis
- Absolute Lift   : -0.7146
- Percentage Lift : -1.40%
- 95% CI for Lift : [-2.6338, 1.2047]
=============================================


๐Ÿ“Š Visualizationยถ

Inย [39]:
def plot_ab_test_results(result):
    """
    Plots A/B test results by group mean or distribution depending on test family.
    """
    test_family = result['test_family']
    variant = result.get('variant')
    group1, group2 = result['group_labels']

    print("\n๐Ÿ“Š Visualization:")

    if test_family in ['z_test', 't_test', 'non_parametric']:
        labels = [group1, group2]
        values = [result['summary'][group1]['mean'], result['summary'][group2]['mean']]
        plt.bar(labels, values, color=['gray', 'skyblue'])

        for i, val in enumerate(values):
            label = f"{val:.2%}" if test_family == 'z_test' else f"{val:.2f}"
            plt.text(i, val + 0.01, label, ha='center')

        ylabel = "Conversion Rate" if test_family == 'z_test' else "Average Value"
        plt.ylabel(ylabel)
        plt.title(f"{ylabel} by Group")
        plt.ylim(0, max(values) * 1.2)
        plt.grid(axis='y', linestyle='--', alpha=0.6)
        plt.show()

    elif test_family == 'chi_square':
        dist = pd.DataFrame(result['summary'])
        dist.T.plot(kind='bar', stacked=True)
        plt.title(f"Categorical Distribution by Group")
        plt.ylabel("Proportion")
        plt.xlabel("Group")
        plt.grid(axis='y', linestyle='--', alpha=0.6)
        plt.show()
Inย [40]:
plot_ab_test_results(result)
๐Ÿ“Š Visualization:
No description has been provided for this image

๐ŸŽฏ 95% Confidence Intervals
for outcome in groups

๐Ÿ“– Click to Expand
  • The 95% confidence interval gives a range in which we expect the true conversion rate to fall for each group.
  • If the confidence intervals do not overlap, it's strong evidence that the difference is statistically significant.
  • If they do overlap, it doesn't guarantee insignificance โ€” you still need the p-value to decide โ€” but it suggests caution when interpreting lift.
Inย [41]:
def plot_confidence_intervals(result, z=1.96):
    """
    Plot 95% confidence intervals for group means (conversion rate or continuous).
    """
    test_family = result['test_family']
    variant = result.get('variant')
    group1, group2 = result['group_labels']
    summary = result['summary']

    if test_family not in ['z_test', 't_test']:
        print(f"โš ๏ธ CI plotting not supported for test family: {test_family}")
        return
    if test_family == 't_test' and variant != 'independent':
        print(f"โš ๏ธ CI plotting only supported for independent t-tests.")
        return

    p1, p2 = summary[group1]['mean'], summary[group2]['mean']
    n1, n2 = summary[group1]['n'], summary[group2]['n']

    if test_family == 'z_test':
        se1 = np.sqrt(p1 * (1 - p1) / n1)
        se2 = np.sqrt(p2 * (1 - p2) / n2)
        ylabel = "Conversion Rate"
    else:
        sd1 = summary[group1]['std']
        sd2 = summary[group2]['std']
        se1 = sd1 / np.sqrt(n1)
        se2 = sd2 / np.sqrt(n2)
        ylabel = "Mean Outcome"

    ci1 = (p1 - z * se1, p1 + z * se1)
    ci2 = (p2 - z * se2, p2 + z * se2)

    plt.errorbar([group1, group2],
                 [p1, p2],
                 yerr=[[p1 - ci1[0], p2 - ci2[0]], [ci1[1] - p1, ci2[1] - p2]],
                 fmt='o', capsize=10, color='black')
    plt.ylabel(ylabel)
    plt.title(f"{ylabel} with 95% Confidence Intervals")
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()
Inย [42]:
plot_confidence_intervals(result)
No description has been provided for this image

๐Ÿ“ˆ Lift Analysis
AKA 95% Confidence Intervals for (difference in outcomes)

๐Ÿ“– Click to Expand

This confidence interval helps quantify uncertainty around the observed lift between treatment and control groups. It answers:

  • How large is the difference between groups?
  • How confident are we in this lift estimate?

We compute a 95% CI for the difference in means (or proportions), not just for each group. If this interval does not include 0, we can reasonably trust there's a true difference. If it does include 0, the observed difference might be due to random chance.

This complements the p-value โ€” while p-values tell us if the difference is significant, CIs tell us how big the effect is, and how uncertain we are.

Inย [43]:
def compute_lift_confidence_interval(result):
    """
    Compute CI for lift in binary or continuous-independent tests.
    """
    test_family = result['test_family']
    variant = result.get('variant')
    group1, group2 = result['group_labels']
    alpha = result.get('alpha', 0.05)
    z = 1.96

    print("\n" + "="*45)
    print(f"๐Ÿ“ˆ 95% CI for Difference in Outcome [{test_family}]")
    print("="*45)

    if test_family == 'z_test' or (test_family == 't_test' and variant == 'independent'):
        m1 = result['summary'][group1]['mean']
        m2 = result['summary'][group2]['mean']
        lift = m2 - m1
        n1 = result['summary'][group1]['n']
        n2 = result['summary'][group2]['n']

        if test_family == 'z_test':
            se = np.sqrt(m1 * (1 - m1) / n1 + m2 * (1 - m2) / n2)
        else:
            sd1 = result['summary'][group1]['std']
            sd2 = result['summary'][group2]['std']
            se = np.sqrt((sd1 ** 2) / n1 + (sd2 ** 2) / n2)

        ci_low = lift - z * se
        ci_high = lift + z * se

        print(f"- Absolute Lift         : {lift:.4f}")
        print(f"- 95% Confidence Interval: [{ci_low:.4f}, {ci_high:.4f}]")

        if ci_low > 0:
            print("โœ… Likely positive impact (CI > 0)")
        elif ci_high < 0:
            print("๐Ÿšซ Likely negative impact (CI < 0)")
        else:
            print("๐Ÿคท CI includes 0 โ€” not statistically significant.")

    elif test_family == 't_test' and variant == 'paired':
        print("- Paired test: CI already accounted for in test logic.")

    elif test_family == 'chi_square':
        print("- Categorical test: per-category lift analysis required (not implemented).")

    print("="*45 + "\n")
Inย [44]:
compute_lift_confidence_interval(result)
=============================================
๐Ÿ“ˆ 95% CI for Difference in Outcome [t_test]
=============================================
- Absolute Lift         : -0.7146
- 95% Confidence Interval: [-2.6338, 1.2047]
๐Ÿคท CI includes 0 โ€” not statistically significant.
=============================================


โœ… Final Conclusionยถ

Inย [45]:
def print_final_ab_test_summary(result):
    """
    Final wrap-up of results with summary stats and verdict.
    """
    test_family = result['test_family']
    variant = result.get('variant')
    group1, group2 = result['group_labels']
    p_value = result.get('p_value')
    alpha = result.get('alpha', 0.05)

    print("="*40)
    print("          ๐Ÿ“Š FINAL A/B TEST SUMMARY")
    print("="*40)

    if test_family == 'z_test' or (test_family == 't_test' and variant == 'independent'):
        mean1 = result['summary'][group1]['mean']
        mean2 = result['summary'][group2]['mean']
        lift = mean2 - mean1
        pct_lift = lift / mean1 if mean1 else np.nan

        label = "Conversion rate" if test_family == 'z_test' else "Avg outcome"
        test_name = result.get("test", "A/B test")

        print(f"๐Ÿ‘ฅ  {group1.capitalize()} {label:<20}:  {mean1:.4f}")
        print(f"๐Ÿงช  {group2.capitalize()} {label:<20}:  {mean2:.4f}")
        print(f"๐Ÿ“ˆ  Absolute lift              :  {lift:.4f}")
        print(f"๐Ÿ“Š  Percentage lift            :  {pct_lift:.2%}")
        print(f"๐Ÿงช  P-value (from {test_name}) :  {p_value:.4f}")

    elif test_family == 't_test' and variant == 'paired':
        print("๐Ÿงช Paired T-Test was used to compare within-user outcomes.")
        print(f"๐Ÿงช P-value: {p_value:.4f}")

    elif test_family == 'chi_square':
        print("๐Ÿงช Chi-square test was used to compare categorical distributions.")
        print(f"๐Ÿงช P-value: {p_value:.4f}")

    else:
        print("โš ๏ธ Unsupported test type.")

    print("-" * 40)

    if p_value is not None:
        if p_value < alpha:
            print("โœ… RESULT: Statistically significant difference detected.")
        else:
            print("โŒ RESULT: No statistically significant difference detected.")
    else:
        print("โš ๏ธ No p-value available.")

    print("="*40 + "\n")
Inย [46]:
print_final_ab_test_summary(result)
========================================
          ๐Ÿ“Š FINAL A/B TEST SUMMARY
========================================
๐Ÿ‘ฅ  Control Avg outcome         :  51.0252
๐Ÿงช  Treatment Avg outcome         :  50.3106
๐Ÿ“ˆ  Absolute lift              :  -0.7146
๐Ÿ“Š  Percentage lift            :  -1.40%
๐Ÿงช  P-value (from independent t-test) :  0.4657
----------------------------------------
โŒ RESULT: No statistically significant difference detected.
========================================

Back to the top ___

โฑ๏ธ How Long

to run the test?

๐Ÿ“– Click to Expand

The duration of an A/B test depends on how quickly you reach the required sample size per group, as estimated during your power analysis.

โœ… Key Inputs
  • Daily volume of eligible observations (users, sessions, or orders โ€” depends on your unit of analysis)
  • Required sample size per group (from power analysis)
  • Traffic split ratio (e.g., 50/50, 10/90, 33/33/33)
๐Ÿงฎ Formula
Test Duration (in days) =
Required Sample Size per Group รท (Daily Eligible Observations ร— Group Split Proportion)

This ensures the experiment runs long enough to detect the expected effect with the desired confidence and power.

๐Ÿ’ก Planning Tips
  1. Estimate required sample size using power analysis (based on effect size, baseline, alpha, and power)
  2. Understand your traffic:
    • Whatโ€™s your average daily eligible traffic?
    • What unit of analysis is used (user, session, impression)?
  3. Apply group split:
    • e.g., for a 50/50 A/B test, each group gets 50% of traffic
  4. Estimate days using the formula above.
๐Ÿง  Real-World Considerations
  • โœ… Ramp-Up Period
    Gradually increase traffic exposure: 5% โ†’ 25% โ†’ 50% โ†’ full traffic.
    Helps catch bugs, stability issues, and confounding edge cases early.
  • โœ… Cool-Down Buffer
    Avoid ending tests on weekends, holidays, or during unusual traffic spikes.
    Add buffer days so your conclusions arenโ€™t skewed by anomalies.
  • โœ… Trust Checks Before Analysis
    • A/A testing to verify setup
    • SRM checks to confirm user distribution
    • Monitor guardrail metrics (e.g., bounce rate, latency, load time)
๐Ÿ—ฃ๏ธ Common Practitioner Advice
โ€œWe calculate sample size using power analysis, then divide by daily traffic per group. But we always factor in buffer days โ€” for ramp-up, trust checks, and stability. Better safe than sorry.โ€

โ€œPower analysis is the starting point. But we donโ€™t blindly stop when we hit N. We monitor confidence intervals, metric stability, and coverage to make sure weโ€™re making decisions the business can trust.โ€
Inย [47]:
def estimate_test_duration(
    required_sample_size_per_group,
    daily_eligible_users,
    allocation_ratios=(0.5, 0.5),
    buffer_days=2,
    test_family=None  # renamed from experiment_type
):
    """
    Estimate test duration based on sample size, traffic, and allocation.

    Parameters:
    - required_sample_size_per_group: int
    - daily_eligible_users: int โ€” total incoming traffic per day
    - allocation_ratios: tuple โ€” traffic share per group (e.g., 50/50)
    - buffer_days: int โ€” extra time for ramp-up or anomalies
    - test_family: str โ€” optional metadata for clarity

    Returns:
    - dict with group durations and total estimated runtime
    """
    group_durations = []
    for alloc in allocation_ratios:
        users_per_day = daily_eligible_users * alloc
        days = required_sample_size_per_group / users_per_day if users_per_day else float('inf')
        group_durations.append(np.ceil(days))

    longest_group_runtime = int(max(group_durations))
    total_with_buffer = longest_group_runtime + buffer_days

    print("\n๐Ÿงฎ Estimated Test Duration")
    if test_family:
        print(f"- Test family               : {test_family}")
    print(f"- Required sample per group : {required_sample_size_per_group}")
    print(f"- Daily eligible traffic    : {daily_eligible_users}")
    print(f"- Allocation ratio          : {allocation_ratios}")
    print(f"- Longest group runtime     : {longest_group_runtime} days")
    print(f"- Buffer days               : {buffer_days}")
    print(f"โœ… Total estimated duration : {total_with_buffer} days\n")

    return {
        'test_family': test_family,
        'per_group_days': group_durations,
        'longest_group_runtime': longest_group_runtime,
        'recommended_total_duration': total_with_buffer
    }
Inย [48]:
daily_eligible_users = 1000
allocation_ratios = (0.5, 0.5)
buffer_days = 2

test_duration_result = estimate_test_duration(
    required_sample_size_per_group=test_config['required_sample_size'],
    daily_eligible_users=daily_eligible_users,
    allocation_ratios=allocation_ratios,
    buffer_days=buffer_days,
    test_family=test_config['family']
)
test_duration_result
๐Ÿงฎ Estimated Test Duration
- Test family               : t_test
- Required sample per group : 154
- Daily eligible traffic    : 1000
- Allocation ratio          : (0.5, 0.5)
- Longest group runtime     : 1 days
- Buffer days               : 2
โœ… Total estimated duration : 3 days

Out[48]:
{'test_family': 't_test',
 'per_group_days': [1.0, 1.0],
 'longest_group_runtime': 1,
 'recommended_total_duration': 3}

๐Ÿงญ Monitoring Dashboard Components

๐Ÿ“– Click to Expand
  • Overall Test Health
    • Start/end date, traffic ramp-up %, time remaining
    • SRM (Sample Ratio Mismatch) indicator
    • P-value and effect size summary (updated daily)
  • Primary Metric Tracking
    • Daily trends for primary outcome (conversion, revenue, etc.)
    • Cumulative lift + confidence intervals
    • Statistical significance tracker (p-value, test stat)
  • Guardrail Metrics
    • Bounce rate, load time, checkout errors, etc.
    • Alert thresholds (e.g., +10% increase in latency)
    • Trend vs baseline and prior experiments
  • Segment Drilldowns
    • Platform (iOS vs Android), geography, user tier
    • Detect heterogeneous treatment effects
    • Option to toggle test results per segment
  • Cohort Coverage
    • Total users assigned vs eligible
    • Daily inclusion and exclusion trends
    • Debugging filters (e.g., why user X didnโ€™t get assigned)
  • Variance & Stability Checks
    • Volatility of key metrics
    • Pre vs post baseline comparisons
    • Funnel conversion variance analysis
  • Notes & Annotations
    • Manual tagging of major incidents (e.g., bug fix deployed, pricing change)
    • Timeline of changes affecting experiment interpretation

Back to the top ___

๐Ÿ” Post Hoc Analysis

๐Ÿ“– Click to Expand
After statistical significance, post-hoc analysis helps connect results to business confidence.
It's not just did it work โ€” but how, for whom, and at what cost or benefit?

๐Ÿง  Why Post Hoc Analysis Matters

  • Segments may respond differently โ€” average lift may hide underperformance in subgroups
  • Guardrails may show collateral damage (e.g., slower load time, higher churn)
  • Stakeholders need impact translation โ€” what does this mean in revenue, retention, or strategy?

๐Ÿ”Ž Typical Post Hoc Questions

  • Segment Lift
    • Did certain platforms, geos, cohorts, or user types benefit more?
    • Any negative lift in high-value user segments?
  • Guardrail Checks
    • Did the treatment impact non-primary metrics (e.g., latency, engagement, bounce rate)?
    • Were alert thresholds breached?
  • Business Impact Simulation
    • How does the observed lift scale to 100% of eligible users?
    • Whatโ€™s the projected change in conversions, revenue, or user satisfaction?
  • Edge Case Discovery
    • Any bugs, instrumentation gaps, or unexpected usage patterns?
    • Did any user types get excluded disproportionately?

๐Ÿ“Š What to Report

Area What to Show
Segment Analysis Table or chart showing lift per segment, sorted by effect size or risk
Guardrail Metrics Summary table of guardrails vs baseline, with thresholds or annotations
Revenue Simulation Projected uplift ร— traffic volume ร— conversion = business impact
Confidence Range 95% CI for key metrics per segment (wherever possible)
Rollout Readiness Any blockers, mitigations, or next steps if full rollout is considered

๐Ÿ’ก Pro Tip
Even if your p-value says โ€œyes,โ€ business rollout is a risk-based decision.
Post-hoc analysis is where statistical rigor meets product judgment.

๐Ÿงฉ Segmented Lift

๐Ÿ“– Click to Expand

Segmented lift tells us how different user segments responded to the treatment.

Why It Matters:

  • Uncovers hidden heterogeneity โ€” The overall average might mask variation across platforms, geographies, or user tiers.
  • Identifies high-risk or high-reward cohorts โ€” Some segments might benefit more, while others could be negatively impacted.
  • Guides rollout and targeting decisions โ€” Helps decide where to prioritize feature exposure, or where to mitigate risk.

Typical Segments:

  • Device type (e.g., mobile vs desktop)
  • Region (e.g., North vs South)
  • User lifecycle (e.g., new vs returning)
  • Platform (e.g., iOS vs Android)
"Segmentation answers who is benefiting (or suffering) โ€” not just whether it worked on average."
Inย [49]:
def visualize_segment_lift(df_segment, segment_col):
    """
    Plots horizontal bar chart of mean lift per segment (Treatment - Control).
    """
    df_viz = df_segment.dropna(subset=['lift']).sort_values(by='lift', ascending=False)
    if df_viz.empty:
        print(f"โš ๏ธ No lift data to visualize for '{segment_col}'\n")
        return

    plt.figure(figsize=(8, 0.4 * len(df_viz) + 2))
    bars = plt.barh(df_viz[segment_col], df_viz['lift'], color='skyblue')
    for bar, val in zip(bars, df_viz['lift']):
        plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, f"{val:.2f}", va='center', ha='left', fontsize=9)
    plt.axvline(0, color='gray', linestyle='--')
    plt.title(f"Lift from Control to Treatment by {segment_col}")
    plt.xlabel("Mean Difference (Treatment โ€“ Control)")
    plt.grid(axis='x', linestyle='--', alpha=0.6)
    plt.tight_layout()
    plt.show()
Inย [50]:
def analyze_segment_lift(
    df,
    test_config,
    segment_cols=['platform', 'device_type', 'user_tier', 'region'],
    min_count_per_group=30,
    visualize=True
):
    """
    Post-hoc lift analysis per segment (e.g., by platform or region).
    """

    group_col = 'group'
    group1, group2 = test_config['group_labels']
    metric_col = test_config['outcome_metric_col']
    outcome_type = test_config['outcome_metric_datatype']
    variant = test_config['variant']
    test_family = test_config['family']

    for segment in segment_cols:
        print(f"\n๐Ÿ”Ž Segmenting by: {segment}")
        seg_data = []

        for val in df[segment].dropna().unique():
            subset = df[df[segment] == val]
            g1 = subset[subset[group_col] == group1][metric_col]
            g2 = subset[subset[group_col] == group2][metric_col]

            if len(g1) < min_count_per_group or len(g2) < min_count_per_group:
                print(f"โš ๏ธ Skipping '{val}' under '{segment}' โ€” too few users.")
                continue

            lift = g2.mean() - g1.mean()
            p_value = None

            if test_family == 'z_test':
                # Binary: z-test on proportions
                p1, n1 = g1.mean(), len(g1)
                p2, n2 = g2.mean(), len(g2)
                pooled_p = (g1.sum() + g2.sum()) / (n1 + n2)
                se = np.sqrt(pooled_p * (1 - pooled_p) * (1/n1 + 1/n2))
                p_value = 2 * (1 - stats.norm.cdf(abs((p2 - p1) / se)))

            elif test_family == 't_test':
                if variant == 'independent':
                    _, p_value = stats.ttest_ind(g1, g2)
                elif variant == 'paired':
                    print(f"โš ๏ธ Paired test not supported in segmented lift โ€” skipped '{val}' under '{segment}'.")
                    lift, p_value = np.nan, None

            elif test_family == 'chi_square':
                print(f"โš ๏ธ Categorical data โ€” lift not defined for '{val}' in '{segment}'.")
                lift, p_value = np.nan, None

            seg_data.append({
                segment: val,
                'count_control': len(g1),
                'count_treatment': len(g2),
                'mean_control': g1.mean(),
                'mean_treatment': g2.mean(),
                'std_control': g1.std(),
                'std_treatment': g2.std(),
                'lift': lift,
                'p_value_lift': p_value
            })

        df_segment = pd.DataFrame(seg_data)
        display(df_segment)

        if visualize:
            visualize_segment_lift(df_segment, segment)
Inย [51]:
analyze_segment_lift(
    df=users,
    test_config=test_config,
    segment_cols=['platform', 'device_type', 'user_tier', 'region'],
    min_count_per_group=30,
    visualize=True
)
๐Ÿ”Ž Segmenting by: platform
platform count_control count_treatment mean_control mean_treatment std_control std_treatment lift p_value_lift
0 iOS 306 283 51.473511 51.038554 15.720604 15.505342 -0.434958 0.735711
1 Android 204 207 50.352631 49.315366 15.455377 15.125414 -1.037265 0.492071
No description has been provided for this image
๐Ÿ”Ž Segmenting by: device_type
device_type count_control count_treatment mean_control mean_treatment std_control std_treatment lift p_value_lift
0 mobile 343 346 50.996739 50.737705 15.483445 15.166347 -0.259034 0.824508
1 desktop 167 144 51.083532 49.284344 15.912052 15.802427 -1.799188 0.319324
No description has been provided for this image
๐Ÿ”Ž Segmenting by: user_tier
user_tier count_control count_treatment mean_control mean_treatment std_control std_treatment lift p_value_lift
0 new 188 192 50.270166 52.095652 15.670024 15.297974 1.825486 0.251242
1 returning 322 298 51.465963 49.160490 15.581513 15.305550 -2.305472 0.063864
No description has been provided for this image
๐Ÿ”Ž Segmenting by: region
region count_control count_treatment mean_control mean_treatment std_control std_treatment lift p_value_lift
0 North 135 120 51.205860 51.554078 15.118047 15.206938 0.348218 0.854882
1 South 147 113 50.736052 50.185396 16.563319 15.340048 -0.550655 0.784045
2 West 106 126 50.974336 50.275715 13.928319 15.292805 -0.698620 0.718471
3 East 122 131 51.217713 49.313070 16.501732 15.653598 -1.904643 0.347038
No description has been provided for this image

๐Ÿšฆ Guardrail Metrics

๐Ÿ“– Click to Expand

Guardrail metrics are non-primary metrics tracked during an experiment to ensure the feature doesn't create unintended negative consequences.

We monitor them alongside the main success metric to:

  • ๐Ÿ“‰ Catch regressions in user behavior or system performance
  • ๐Ÿ” Detect trade-offs (e.g., conversion โ†‘ but bounce rate โ†‘ too)
  • ๐Ÿ›‘ Block rollouts if a feature does more harm than good
๐Ÿงช How We Check
  • Run statistical tests on each guardrail metric just like we do for the primary metric
  • Use the same experiment type (binary, continuous, etc.) for evaluation
  • Report p-values and lift to assess significance and direction
  • Focus more on risk detection than optimization
๐Ÿ“Š Common Guardrail Metrics
TypeExamples
UX HealthBounce Rate, Session Length, Engagement
PerformancePage Load Time, API Latency, CPU Usage
ReliabilityError Rate, Crash Rate, Timeout Errors
BehavioralScroll Depth, Page Views per Session
โœ… When to Act
  • If the treatment significantly worsens a guardrail metric โ†’ investigate
  • If the primary metric improves but guardrails suffer, assess trade-offs
  • Use p-values, lift, and domain context to guide decision-making
๐Ÿง  Why Guardrails Matter
โ€œWe donโ€™t just care if a metric moves โ€” we care what else it moved. Guardrails give us confidence that improvements arenโ€™t hiding regressions elsewhere.โ€
Inย [52]:
# Quick average check by group
guardrail_avg = users.groupby('group')['bounce_rate'].mean()

print("๐Ÿšฆ Average Bounce Rate by Group:")
for grp, val in guardrail_avg.items():
    print(f"- {grp}: {val:.4f}")
๐Ÿšฆ Average Bounce Rate by Group:
- control: 0.5464
- treatment: 0.5679
Inย [53]:
def evaluate_guardrail_metric(
    df,
    test_config,
    guardrail_metric_col='bounce_rate',
    alpha=0.05
):
    """
    Checks for statistically significant changes in guardrail metric (e.g., bounce rate).

    Parameters:
    - df : pd.DataFrame โ€” experiment dataset
    - test_config : dict โ€” contains group info, variant, etc.
    - guardrail_metric_col : str โ€” column name of guardrail metric
    - alpha : float โ€” significance level (default 0.05)

    Returns:
    - None (prints result)
    """

    group_col = 'group'
    control, treatment = test_config['group_labels']

    control_vals = df[df[group_col] == control][guardrail_metric_col]
    treatment_vals = df[df[group_col] == treatment][guardrail_metric_col]

    mean_control = control_vals.mean()
    mean_treatment = treatment_vals.mean()
    diff = mean_treatment - mean_control

    t_stat, p_val = ttest_ind(treatment_vals, control_vals)

    print(f"\n๐Ÿšฆ Guardrail Metric Check โ†’ '{guardrail_metric_col}'\n")
    print(f"- {control:10}: {mean_control:.4f}")
    print(f"- {treatment:10}: {mean_treatment:.4f}")
    print(f"- Difference   : {diff:+.4f}")
    print(f"- P-value (t-test): {p_val:.4f}")

    if p_val < alpha:
        if diff > 0:
            print("โŒ Significant *increase* โ€” potential negative impact on guardrail.")
        else:
            print("โœ… Significant *decrease* โ€” potential positive impact.")
    else:
        print("๐ŸŸก No statistically significant change โ€” guardrail looks stable.")
Inย [54]:
evaluate_guardrail_metric(
    df=users,
    test_config=test_config,
    guardrail_metric_col='bounce_rate',
    alpha=0.05
)
๐Ÿšฆ Guardrail Metric Check โ†’ 'bounce_rate'

- control   : 0.5464
- treatment : 0.5679
- Difference   : +0.0215
- P-value (t-test): 0.0325
โŒ Significant *increase* โ€” potential negative impact on guardrail.

๐Ÿง  Correcting for Multiple Comparisonsยถ

๐Ÿ“– Why p-values can't always be trusted

When we test multiple segments, multiple metrics or multiple variants, we increase the risk of false positives (Type I errors). This is known as the Multiple Comparisons Problem โ€” and itโ€™s dangerous in data-driven decision-making.

๐Ÿ“‰ Example Scenario:ยถ

We run A/B tests on:

  • Overall population โœ…
  • By platform โœ…
  • By user tier โœ…
  • By region โœ…

If we test 10 hypotheses at 0.05 significance level, the chance of at least one false positive โ‰ˆ 40%.

โœ… Correction Methodsยถ
Method Use Case Risk
Bonferroni Very strict, controls Family-Wise Error Rate (FWER) โ„๏ธ Conservative
Benjamini-Hochberg Controls False Discovery Rate (FDR) ๐Ÿ”ฅ Balanced
๐Ÿง  In Practice:ยถ

We calculate raw p-values for each segment, and then apply corrections to get adjusted p-values.

If even the adjusted p-values are significant โ†’ result is robust.

โ„๏ธ Bonferroni Correctionยถ
๐Ÿ“– FWER Control (Click to Expand) Bonferroni is the most **conservative** correction method. It adjusts the p-value threshold by dividing it by the number of comparisons.
  • Formula: adjusted_alpha = alpha / num_tests
  • Or: adjusted_p = p * num_tests
  • If even one adjusted p-value < 0.05, itโ€™s very likely real

๐Ÿ“Œ Best for: High-risk decisions (e.g., medical trials, irreversible launches)
โš ๏ธ Drawback: May miss true positives (higher Type II error)

๐Ÿ”ฌ Benjamini-Hochberg (BH) Procedureยถ
๐Ÿ“– FDR Control (Click to Expand)

BH controls the expected proportion of false discoveries (i.e., false positives among all positives). It:

  • Ranks p-values from smallest to largest
  • Compares each to (i/m) * alpha, where:
    • i = rank
    • m = total number of tests

๐Ÿง  Important: After adjustment, BH enforces monotonicity by capping earlier (smaller) ranks to not exceed later ones.

In simple terms: adjusted p-values can only decrease as rank increases.

The largest p-value that satisfies this inequality becomes the threshold โ€” all smaller p-values are considered significant.

๐Ÿ“Œ Best for: Exploratory research, product experiments with many segments
๐Ÿ’ก Advantage: More power than Bonferroni, still controls errors

Inย [55]:
import pandas as pd
from statsmodels.stats.multitest import multipletests

# Original inputs
segment_names = ['North', 'South', 'East', 'West']
p_vals = [0.03, 0.06, 0.02, 0.10]

# Create DataFrame and sort by raw p-values BEFORE correction
df = pd.DataFrame({
    'Segment': segment_names,
    'Raw_pValue': p_vals
}).sort_values('Raw_pValue').reset_index(drop=True)

# Apply corrections to the sorted p-values
_, bonf, _, _ = multipletests(df['Raw_pValue'], alpha=0.05, method='bonferroni')
_, bh, _, _ = multipletests(df['Raw_pValue'], alpha=0.05, method='fdr_bh')

# Add to DataFrame
df['Bonferroni_Adj_pValue'] = bonf
df['BH_Adj_pValue'] = bh
df
Out[55]:
Segment Raw_pValue Bonferroni_Adj_pValue BH_Adj_pValue
0 East 0.02 0.08 0.06
1 North 0.03 0.12 0.06
2 South 0.06 0.24 0.08
3 West 0.10 0.40 0.10
Inย [56]:
# Plot p values - raw and adjusted

plt.figure(figsize=(8, 5))

# Plot lines
plt.plot(df.index + 1, df['Raw_pValue'], marker='o', label='Raw p-value')
plt.plot(df.index + 1, df['Bonferroni_Adj_pValue'], marker='^', label='Bonferroni Adj p-value')
plt.plot(df.index + 1, df['BH_Adj_pValue'], marker='s', label='BH Adj p-value')

# Add value labels next to each point
for i in range(len(df)):
    x = i + 1
    plt.text(x + 0.05, df['Raw_pValue'][i], f"{df['Raw_pValue'][i]:.2f}", va='center')
    plt.text(x + 0.05, df['Bonferroni_Adj_pValue'][i], f"{df['Bonferroni_Adj_pValue'][i]:.2f}", va='center')
    plt.text(x + 0.05, df['BH_Adj_pValue'][i], f"{df['BH_Adj_pValue'][i]:.2f}", va='center')

# Axis & labels
plt.xticks(df.index + 1, df['Segment']);
plt.axhline(0.05, color='gray', linestyle='--', label='ฮฑ = 0.05');
plt.xlabel("Segment (Ranked by Significance)");
plt.ylabel("p-value");
plt.title("p-value Correction: Bonferroni vs Benjamini Hochberg (FDR)");
plt.legend();
plt.tight_layout();
plt.show();
No description has been provided for this image

๐Ÿช„ Novelty Effects & Behavioral Decay

๐Ÿ“– Why First Impressions Might Lie (Click to Expand)
๐Ÿช„ Novelty Effects & Behavioral Decayยถ

Even if an A/B test shows a statistically significant lift, that improvement may not last.

This often happens due to novelty effects โ€” short-term spikes in engagement driven by:

  • Curiosity (โ€œWhatโ€™s this new feature?โ€)
  • Surprise (โ€œThis looks different!โ€)
  • Visual attention (e.g., placement or color changes)
๐Ÿ“‰ Common Signs of Novelty Effectsยถ
  • Strong lift in week 1 โ†’ drops by week 3.
  • High initial usage โ†’ no long-term retention.
  • Positive metrics in one segment only (e.g., โ€œnew usersโ€).
๐Ÿงญ What We Do About Itยถ

To address this risk during rollouts:

  • โœ… Monitor metrics over time post-launch (e.g., 7, 14, 28-day retention)
  • โœ… Compare results across early adopters vs late adopters
  • โœ… Run holdout experiments during phased rollout to detect fading impact

๐ŸŽฏ Primacy Effect & Order Bias

๐Ÿ“– When First = Best (Click to Expand)

Sometimes, the position of a variant or option can distort results โ€” especially if it's shown first. This is called the primacy effect, a type of cognitive bias.

It often shows up in:

  • Feed ranking or content ordering experiments
  • Option selection (e.g., first dropdown item)
  • Surveys or in-app prompts
๐Ÿšฉ Common Indicatorsยถ
  • Variant A always performs better regardless of content
  • Metrics drop when position is swapped
  • Discrepancy between test and real-world usage
๐Ÿงญ What We Do About Itยถ

To minimize primacy bias:

  • โœ… Randomize order of options or content
  • โœ… Use position-aware metrics (e.g., click-through by slot)
  • โœ… Validate with follow-up tests using rotated or reversed orders

๐ŸŽฒ Rollout Simulation

๐Ÿ“– Click to Expand

Once statistical significance is established, it's useful to simulate business impact from full rollout.

Assume full exposure to eligible daily traffic, and estimate incremental impact from the observed lift.

This helps stakeholders understand the real-world benefit of implementing the change.

We typically estimate:

  • ๐Ÿ“ˆ Daily lift (e.g., additional conversions, dollars, sessions)
  • ๐Ÿ“ˆ Monthly extrapolation (daily lift ร— 30)
Inย [57]:
def simulate_rollout_impact(
    experiment_result,
    daily_eligible_observations,
    metric_unit='conversions'
):
    """
    Estimate potential impact of rolling out the treatment to all eligible traffic.

    Parameters:
    - experiment_result: dict
        Output of `run_ab_test()` โ€” must contain summary + group_labels
    - daily_eligible_observations: int
        Number of eligible units per day (users, sessions, transactions, etc.)
    - metric_unit: str
        What the metric represents (e.g., 'conversions', 'revenue', 'clicks')

    Prints daily and monthly lift estimates.
    """

    group1, group2 = experiment_result['group_labels']
    summary = experiment_result['summary']

    # Extract means
    mean_control = summary[group1]['mean']
    mean_treatment = summary[group2]['mean']
    observed_lift = mean_treatment - mean_control

    # Impact calculation
    daily_impact = observed_lift * daily_eligible_observations
    monthly_impact = daily_impact * 30

    # Output
    print("\n๐Ÿ“ฆ Rollout Simulation")
    print(f"- Outcome Metric      : {metric_unit}")
    print(f"- Observed Lift       : {observed_lift:.4f} per unit")
    print(f"- Daily Eligible Units: {daily_eligible_observations}")
    print(f"- Estimated Daily Impact   : {daily_impact:,.0f} {metric_unit}/day")
    print(f"- Estimated Monthly Impact : {monthly_impact:,.0f} {metric_unit}/month\n")
Inย [58]:
# Derive daily volume from actual data
daily_traffic_estimate = users.shape[0]  # Assuming full traffic per day

simulate_rollout_impact(
    experiment_result=result,                         # Output from run_ab_test()
    daily_eligible_observations=daily_traffic_estimate,
    metric_unit=test_config['outcome_metric_col']     # Dynamic label like 'engagement_score' or 'revenue'
)
๐Ÿ“ฆ Rollout Simulation
- Outcome Metric      : engagement_score
- Observed Lift       : -0.7146 per unit
- Daily Eligible Units: 1000
- Estimated Daily Impact   : -715 engagement_score/day
- Estimated Monthly Impact : -21,437 engagement_score/month


๐Ÿงช A/B Test Holdouts

๐Ÿ“– Why We Sometimes Don't Ship to 100% (Click to Expand)
๐Ÿงช A/B Test Holdoutsยถ

Even after a successful A/B test, we often maintain a small holdout group during rollout.

This helps us:

  • Track long-term impact beyond the experiment window.
  • Detect novelty fade or unexpected side effects.
  • Maintain a clean โ€œcontrolโ€ for system-wide benchmarking.
๐Ÿข Industry Practiceยถ
  • Common at large orgs like Facebook, where teams share a holdout pool for all feature launches.
  • Holdouts help leadership evaluate true impact during performance reviews and roadmap planning.
โš ๏ธ When We Skip Holdoutsยถ
  • Bug fixes or critical updates (e.g., spam, abuse, policy violations).
  • Sensitive changes like content filtering (e.g., child safety flags).

๐Ÿšซ Limits & Alteratives

๐Ÿ“‰ When Not to A/B Test & What to Do Instead (Click to Expand)
๐Ÿ™…โ€โ™€๏ธ When Not to A/B Testยถ
  • Lack of infrastructure โ†’ No tracking, engineering, or experiment setup.
  • Lack of impact โ†’ Not worth the effort if the feature has minimal upside, shipping features has downstream implications (support, bugs, operations)..
  • Lack of traffic โ†’ Canโ€™t reach stat sig in a reasonable time.
  • Lack of conviction โ†’ No strong hypothesis; testing dozens of variants blindly.
  • Lack of isolation โ†’ Hard to contain exposure (e.g., testing a new logo everyone sees).
๐Ÿงช Alternatives & Edge Casesยถ
  • Use user interviews or logs to gather directional signals.
  • Leverage retrospective data for pre/post comparisons.
  • Consider sequential testing or soft rollouts for low-risk changes.
  • Use design experiments (e.g., multivariate, observational) when randomization isn't feasible.

Back to the top ___