Status: Complete Python Coverage License

๐Ÿ“– AB Testing

๐Ÿ—‚๏ธ Data Setup

  • โš™๏ธ Environment Setup
  • ๐Ÿ› ๏ธ Experiment Setup
  • ๐Ÿ”ง Central Control Panel
  • ๐Ÿ“ฅ Read/Generate Data

โšก Power Analysis

  • โš™๏ธ Setup Inputs + Config Values
  • ๐Ÿ“ˆ Baseline Estimation from Data
  • ๐Ÿ“ˆ Minimum Detectable Effect
  • ๐Ÿ” Test Family
  • ๐Ÿ“ Required Sample Size

๐Ÿ”€ Randomization

  • ๐ŸŽฒ Apply Randomization
  • ๐Ÿ•ธ๏ธ Network Effects & SUTVA Violations
  • โš–๏ธ Sample Ratio Mismatch

๐Ÿงช AA Testing

  • ๐Ÿงฌ Outcome Similarity Test
  • ๐Ÿ“Š AA Test Visualization
  • ๐ŸŽฒ Type I Error Simulation

๐Ÿงช A/B Testing

  • ๐Ÿงพ Summaries
  • ๐Ÿ“Š Visualization
  • ๐ŸŽฏ 95% Confidence Intervals
  • ๐Ÿ“ˆ Lift Analysis
  • โœ… Final Conclusion
  • โฑ๏ธ How Long

๐Ÿ” Post Hoc Analysis

  • ๐Ÿงฉ Segmented Lift
  • ๐Ÿšฆ Guardrail Metrics
  • ๐Ÿ”„ CUPED
  • ๐Ÿง  Correcting for Multiple Comparisons
  • ๐Ÿช„ Novelty Effects & Behavioral Decay
  • ๐ŸŽฏ Primacy Effect & Order Bias
  • ๐ŸŽฒ Rollout Simulation
  • ๐Ÿงช A/B Test Holdouts
  • ๐Ÿšซ Limits & Alteratives

๐Ÿ—‚๏ธ Data Setup

โš™๏ธ Environment Setup

Inย [1]:
# Display Settings
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import display, HTML

# Set Seed 
my_seed=1995

# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
from scipy import stats
from scipy.stats import (
    ttest_ind,
    ttest_rel,
    chi2_contingency,
    mannwhitneyu,
    levene,
    shapiro
)
import statsmodels.api as sm
from statsmodels.stats.power import (
    TTestIndPower,
    TTestPower,
    FTestAnovaPower,
    NormalIndPower
)
from statsmodels.stats.multitest import multipletests
from sklearn.model_selection import train_test_split

from ab_utils_01_data_setup import *
from ab_utils_02_power_analysis import *
from ab_utils_03_randomization import *
from ab_utils_04_aa_testing import *
from ab_utils_05_ab_testing import *
from ab_utils_06_post_hoc import *
import sys
sys.path.insert(0, '..')
from Hypothesis_Testing.ht_utils import print_config_summary

๐Ÿ› ๏ธ Experiment Setup

Inย [2]:
# 1. Main outcome variable you're testing
outcome_metric_col = 'engagement_score'

# 2. Metric type: 'binary', 'continuous', or 'categorical'
outcome_metric_datatype = 'continuous'

# 3. Group assignment (to be generated)
group_labels = ('control', 'treatment')

# 3b. Number of groups in the experiment (e.g., 2 for A/B test, 3 for A/B/C test)
group_count = len(group_labels)

# 4. Experimental design variant: independent or paired
variant = 'independent'  # Options: 'independent' (supported), 'paired' (not supported yet, todo)

# 5. Optional: Unique identifier for each observation (can be user_id, session_id, etc.)
observation_id_col = 'user_id'

# 6. Optional: Pre-experiment metric for CUPED, if used
pre_experiment_metric = 'past_purchase_count'  # Can be None

# Column name used to store assigned group after randomization
group_col = 'group'

# Randomization method to assign users to groups
# Options: 'simple', 'stratified', 'block', 'matched_pair', 'cluster', 'cuped'
randomization_method = "simple"

# Optional: guardrail metric column for simulated outcome data. Set to None to omit.
guardrail_metric_col = 'bounce_rate'

๐Ÿ”ง Central Control Panel

Inย [3]:
test_config = {
    # Core experiment setup
    'outcome_metric_col'     : outcome_metric_col,         # Main metric to analyze (e.g., 'engagement_score')
    'observation_id_col'     : observation_id_col,         # Unique identifier for each observation
    'pre_experiment_metric'  : pre_experiment_metric,      # Used for CUPED adjustment (if any)
    'outcome_metric_datatype': outcome_metric_datatype,    # One of: 'binary', 'continuous', 'categorical'
    'group_labels'           : group_labels,               # Tuple of (control, treatment) group names
    'group_count'            : group_count,                # Number of groups (usually 2 for A/B tests)
    'variant'                : variant,                    # 'independent' or 'paired'
    'guardrail_metric_col'   : guardrail_metric_col,       # Optional: e.g. 'bounce_rate'; None to omit

    # Diagnostic results โ€” filled after EDA/assumptions check
    'normality'              : None,  # Will be set based on Shapiro-Wilk or visual tests
    'equal_variance'         : None,  # Will be set using Leveneโ€™s/Bartlettโ€™s test
    'family'                 : None   # Test family โ†’ 'z_test', 't_test', 'anova', 'chi_square', etc.
}

print_config_summary(test_config)
๐Ÿ“‹ Hypothesis Test Configuration Summary

๐Ÿ”ธ Outcome Metric Col        : engagement_score
๐Ÿ”ธ Observation Id Col        : user_id
๐Ÿ”ธ Pre Experiment Metric     : past_purchase_count
๐Ÿ”ธ Outcome Metric Datatype   : continuous
๐Ÿ”ธ Group Labels              : ('control', 'treatment')
๐Ÿ”ธ Group Count               : 2
๐Ÿ”ธ Variant                   : independent
๐Ÿ”ธ Guardrail Metric Col      : bounce_rate
๐Ÿ”ธ Normality                 : None
๐Ÿ”ธ Equal Variance            : None
๐Ÿ”ธ Family                    : None

๐Ÿ“ฅ Read/Generate Data

Inย [4]:
observations_count = 1000
df = create_dummy_ab_data(observations_count, seed=my_seed, outcome_metric_col=outcome_metric_col, guardrail_metric_col=guardrail_metric_col)
historical_df = create_historical_df(df, outcome_metric_col, guardrail_metric_col, seed=my_seed)
df
historical_df
Out[4]:
user_id engagement_score bounce_rate past_purchase_count platform device_type
0 1 NaN NaN 37.593666 Android mobile
1 2 NaN NaN 35.294211 Android mobile
2 3 NaN NaN 71.011905 iOS desktop
3 4 NaN NaN 35.351784 Android mobile
4 5 NaN NaN 58.179221 Android mobile
... ... ... ... ... ... ...
995 996 NaN NaN 56.227450 iOS mobile
996 997 NaN NaN 57.966997 iOS mobile
997 998 NaN NaN 50.674093 Android mobile
998 999 NaN NaN 51.941265 iOS mobile
999 1000 NaN NaN 38.515887 iOS desktop

1000 rows ร— 6 columns

Out[4]:
user_id engagement_score bounce_rate past_purchase_count platform device_type
0 1 31.390500 0.560153 37.593666 Android mobile
1 2 27.941317 0.547708 35.294211 Android mobile
2 3 81.517858 0.452955 71.011905 iOS desktop
3 4 28.027676 0.565883 35.351784 Android mobile
4 5 62.268831 0.403212 58.179221 Android mobile
... ... ... ... ... ... ...
995 996 59.341175 0.622611 56.227450 iOS mobile
996 997 61.950496 0.422644 57.966997 iOS mobile
997 998 51.011140 0.544316 50.674093 Android mobile
998 999 52.911898 0.440618 51.941265 iOS mobile
999 1000 32.773830 0.442153 38.515887 iOS desktop

1000 rows ร— 6 columns

Back to the top


โšก Power Analysis

๐Ÿ“– Click to Expand

Power analysis helps determine the minimum sample size required to detect a true effect with statistical confidence.

Why It Matters:
  • Avoids underpowered tests (risk of missing real effects)
  • Balances tradeoffs between Sample size, Minimum Detectable Effect (MDE), Significance level (ฮฑ), Statistical power (1 - ฮฒ)
Key Inputs:
Parameter Meaning
alpha (ฮฑ) Significance level (probability of false positive), e.g. 0.05
Power (1 - ฮฒ) Probability of detecting a true effect, e.g. 0.80 or 0.90
Baseline Current outcome (e.g., 10% conversion, $50 revenue)
MDE Minimum detectable effect โ€” the smallest meaningful lift (e.g., +2% or +$5)
Std Dev Standard deviation of the metric (for continuous outcomes)
Effect Size Optional: Cohen's d (for t-tests) or f (for ANOVA)
Groups Number of groups (relevant for ANOVA)

This notebook automatically selects the correct formula based on experiment_type variable.

โš™๏ธ Setup Inputs + Config Values

๐Ÿ“– Click to Expand

These are the core experiment design parameters required for power analysis and statistical testing.

  • alpha: Significance level โ€” the tolerance for false positives (commonly set at 0.05).
  • power: Probability of detecting a true effect โ€” typically 0.80 or 0.90.
  • group_labels: The names of the experimental groups (e.g., 'control', 'treatment').
  • metric_col: Outcome metric column you're analyzing.
  • test_family: Chosen statistical test (e.g., 't_test', 'z_test', 'chi_square') based on assumptions.
  • variant: Experimental design structure โ€” 'independent' or 'paired'.

These inputs drive sample size estimation, test choice, and downstream analysis logic.

Inย [5]:
# Define Core Inputs

# Use values from your config or plug in manually
alpha = 0.05  # False positive tolerance (Type I error)
power = 0.80  # Statistical power (1 - Type II error)

๐Ÿ“ˆ Baseline Estimation (Pre-Experiment)

๐Ÿ“– Click to Expand

Before running power analysis, we need a baseline estimate of the outcome metric.

  • These values must come from historical data collected before the experiment.
  • They represent the expected behavior of users under the current system (control condition).
  • For binary metrics (e.g., conversion), the baseline is the historical conversion rate.
  • For continuous metrics (e.g., revenue, engagement), we estimate the historical mean and standard deviation.

These estimates allow us to translate the Minimum Detectable Effect (MDE) into a statistical effect size and compute the required sample size.

โš ๏ธ Baselines must be computed from pre-experiment data. Using outcome data from the experiment itself would introduce data leakage.
Inย [6]:
# ๐Ÿงฎ Data-Driven Baseline Metric from historical data (stored in test_config; only relevant keys are set per test family)
_b = compute_baseline_from_data(historical_df, test_config)
test_config['baseline_rate'] = _b['baseline_rate']
test_config['baseline_mean'] = _b['baseline_mean']
test_config['std_dev'] = _b['std_dev']
๐Ÿ“Š Baseline mean (historical): 49.16
๐Ÿ“ Baseline std dev (historical): 14.64

๐Ÿ“ˆ Minimum Detectable Effect

๐Ÿ“– Click to Expand

๐ŸŽฏ Minimum Detectable Effect (MDE) is the smallest business-relevant difference you want your test to catch.

  • It reflects what matters โ€” not what the data happens to show
  • Drives required sample size:
    • Smaller MDE โ†’ larger sample
    • Larger MDE โ†’ smaller sample

๐Ÿง  Choose an MDE based on:

  • What level of uplift would justify launching the feature?
  • What's a meaningful change in your metric โ€” not just statistical noise?
Inย [7]:
# Minimum Detectable Effect (MDE)
# This is NOT data-driven โ€” it reflects the minimum improvement you care about detecting.
# It should be small enough to catch valuable changes, but large enough to avoid inflating sample size.

# Examples by Metric Type:
# - Binary       : 0.02 โ†’ detect a 2% lift in conversion rate (e.g., from 10% to 12%)
# - Categorical  : 0.05 โ†’ detect a 5% shift in plan preference (e.g., more users choosing 'premium' over 'basic')
# - Continuous   : 3.0  โ†’ detect a 3-point gain in engagement score (e.g., from 50 to 53 avg. score)

mde = 5  # TODO: Change this based on business relevance

๐Ÿ” Test Family

๐Ÿ“– Click to Expand

Selects the appropriate statistical test based on:

  • Outcome data type (binary, continuous, categorical)
  • Distributional assumptions (normality, variance)
  • Number of groups and experiment structure (independent vs paired)

This step automatically maps to the correct test (e.g., t-test, z-test, chi-square, ANOVA).

๐Ÿงช Experiment Type โ†’ Test Family Mapping
Outcome MetricNormalityGroup CountSelected Test Family
binaryโ€”2z_test
binaryโ€”3+chi_square
continuousโœ…2t_test
continuousโœ…3+anova
continuousโŒ2non_parametric (Mann-Whitney U)
continuousโŒ3+non_parametric (Kruskal-Wallis)
categoricalโ€”2chi_square
categoricalโ€”3+chi_square
Inย [8]:
test_config['family'] = determine_test_family(test_config)
# test_config
print_config_summary(test_config)

print(f"โœ… Selected test family: {test_config['family']}")
๐Ÿ“‹ Hypothesis Test Configuration Summary

๐Ÿ”ธ Outcome Metric Col        : engagement_score
๐Ÿ”ธ Observation Id Col        : user_id
๐Ÿ”ธ Pre Experiment Metric     : past_purchase_count
๐Ÿ”ธ Outcome Metric Datatype   : continuous
๐Ÿ”ธ Group Labels              : ('control', 'treatment')
๐Ÿ”ธ Group Count               : 2
๐Ÿ”ธ Variant                   : independent
๐Ÿ”ธ Guardrail Metric Col      : bounce_rate
๐Ÿ”ธ Normality                 : None
๐Ÿ”ธ Equal Variance            : None
๐Ÿ”ธ Family                    : mann_whitney_u_test
๐Ÿ”ธ Baseline Rate             : None
๐Ÿ”ธ Baseline Mean             : 49.16337003981812
๐Ÿ”ธ Std Dev                   : 14.640142733622508
โœ… Selected test family: mann_whitney_u_test

๐Ÿ“ Required Sample Size

Inย [9]:
required_sample_size = calculate_power_sample_size(
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    alpha=alpha,
    power=power,
    baseline_rate=test_config.get('baseline_rate'),
    mde=mde,
    std_dev=test_config.get('std_dev'),
    effect_size=None,  # Let it compute internally via mde/std
    num_groups=2
)

test_config['required_sample_size'] = required_sample_size
print(f"โœ… Required sample size per group: {required_sample_size}")
print(f"๐Ÿ‘ฅ Total sample size: {required_sample_size * 2}")
โœ… Required sample size per group: 136
๐Ÿ‘ฅ Total sample size: 272
Inย [10]:
print_power_summary(
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    alpha=alpha,
    power=power,
    baseline_rate=test_config.get('baseline_rate'),
    mde=mde,
    std_dev=test_config.get('std_dev'),
    required_sample_size=required_sample_size
)
๐Ÿ“ˆ Power Analysis Summary
- Test: MANN_WHITNEY_U_TEST (independent)
- Significance level (ฮฑ): 0.05
- Statistical power (1 - ฮฒ): 0.8
- Std Dev (baseline): 14.64
- MDE (mean difference): 5
- Cohen's d: 0.34

โœ… To detect a 5-unit lift in mean outcome,
you need 136 users per group โ†’ total 272 users.

Back to the top


๐Ÿ”€ Randomization

๐Ÿ“– Click to Expand

Randomization is used to ensure that observed differences in outcome metrics are due to the experiment, not pre-existing differences.

  • Prevents selection bias (e.g., users self-selecting into groups)
  • Balances confounding factors like platform, region, or past behavior
  • Enables valid inference through statistical testing

๐Ÿ“– Simple Randomization (Click to Expand)

Each user is assigned to control or treatment with equal probability, independent of any characteristics.

โœ… When to Use:

  • Sample size is large enough to ensure natural balance
  • No strong concern about confounding variables
  • Need a quick, default assignment strategy

๐Ÿ› ๏ธ How It Works:

  • Assign each user randomly (e.g., 50/50 split)
  • No grouping, segmentation, or blocking involved
  • Groups are expected to balance out on average

๐Ÿ“– Stratified Sampling (Click to Expand)

Ensures that key segments (e.g., platform, region) are evenly represented across control and treatment.

When to Use
  • User base is naturally skewed (e.g., 70% mobile, 30% desktop)
  • Important to control for known confounders like geography or device
  • You want balance within subgroups, not just overall
How It Works
  • Pick a stratification variable (e.g., platform)
  • Split population into strata (groups)
  • Randomly assign users within each stratum

๐Ÿ“– Block Randomization (Click to Expand)

Groups users into fixed-size blocks and randomly assigns groups within each block.

When to Use
  • Users arrive in time-based batches (e.g., daily cohorts)
  • Sample size is small and needs enforced balance
  • You want to minimize temporal or ordering effects
How It Works
  • Create blocks based on order or ID (e.g., every 10 users)
  • Randomize assignments within each block
  • Ensures near-equal split in every batch

๐Ÿ“– Match Pair Randomization (Click to Expand)

Participants are paired based on similar characteristics before random group assignment. This reduces variance and improves statistical power by ensuring balance on key covariates.

When to Use
  • Small sample size with high risk of confounding
  • Outcomes influenced by user traits (e.g., age, income, tenure)
  • Need to minimize variance across groups
How It Works
  1. Identify important covariates (e.g., age, purchase history)
  2. Sort users by those variables
  3. Create matched pairs (or small groups)
  4. Randomly assign one to control, the other to treatment

๐Ÿ“– Cluster Randomization (Click to Expand)

Entire groups or clusters (e.g., cities, stores, schools) are assigned to control or treatment. Used when it's impractical or risky to randomize individuals within a cluster.

When to Use
  • Users naturally exist in groups (e.g., teams, locations, devices)
  • There's a risk of interference between users (e.g., word-of-mouth)
  • Operational or tech constraints prevent individual-level randomization
How It Works
  1. Define the cluster unit (e.g., store, city)
  2. Randomly assign each cluster to control or treatment
  3. All users within the cluster inherit the group assignment

๐ŸŽฒ Apply Randomization

๐Ÿ“– Click to Expand

In this notebook we randomize the entire dataset (e.g., all 1000 simulated users). This is done purely for demonstration so that downstream analysis has enough data to run.

In real A/B experiments the workflow is different:

  • Power analysis determines the required sample size (e.g., 272 users per group).
  • Users are then randomized as they arrive in the product.
  • The experiment continues until the required sample size is reached.
  • Analysis is performed once the target sample size is collected.

In other words, real experiments do not randomize a fixed dataset upfront. Instead, randomization happens dynamically during the experiment as new users enter the system.

โš ๏ธ In this notebook the dataset is simulated beforehand, so randomization is applied to all users at once. In production experimentation platforms (e.g., Optimizely, Statsig, internal experimentation systems), users are assigned to variants at runtime.
Inย [11]:
n_required = test_config['required_sample_size'] * test_config['group_count']
df = df.sample(n=n_required, random_state=42)

df.head()
Out[11]:
user_id engagement_score bounce_rate past_purchase_count platform device_type
521 522 NaN NaN 25.729102 Android mobile
737 738 NaN NaN 51.861463 iOS mobile
740 741 NaN NaN 42.022993 Android desktop
660 661 NaN NaN 49.573649 Android mobile
411 412 NaN NaN 43.824528 iOS mobile
Inย [12]:
# Apply randomization method
if randomization_method == "simple":
    df = apply_simple_randomization(df, group_col=group_col, seed=my_seed)

elif randomization_method == "stratified":
    df = apply_stratified_randomization(df, stratify_col='platform', group_col=group_col, seed=my_seed)

elif randomization_method == "block":
    df = apply_block_randomization(df, observation_id_col='user_id', group_col=group_col, block_size=10, seed=my_seed)

elif randomization_method == "matched_pair":
    df = apply_matched_pair_randomization(df, sort_col=pre_experiment_metric, group_col=group_col, group_labels=test_config['group_labels'])

elif randomization_method == "cluster":
    df = apply_cluster_randomization(df, cluster_col='city', group_col=group_col, seed=my_seed)

else:
    raise ValueError(f"โŒ Unsupported randomization method: {randomization_method}")

# Randomization only assigns group. Outcome data is collected in the next section (Outcome data).
df
Out[12]:
group user_id engagement_score bounce_rate past_purchase_count platform device_type
521 control 522 NaN NaN 25.729102 Android mobile
737 control 738 NaN NaN 51.861463 iOS mobile
740 control 741 NaN NaN 42.022993 Android desktop
660 treatment 661 NaN NaN 49.573649 Android mobile
411 treatment 412 NaN NaN 43.824528 iOS mobile
... ... ... ... ... ... ... ...
380 treatment 381 NaN NaN 56.269061 iOS mobile
631 control 632 NaN NaN 61.199801 iOS desktop
381 control 382 NaN NaN 48.515995 iOS mobile
490 control 491 NaN NaN 55.112986 Android mobile
118 control 119 NaN NaN 61.520038 iOS desktop

272 rows ร— 7 columns

๐Ÿ•ธ๏ธ Network Effects & SUTVA Violations

๐Ÿ“– When Randomization Assumptions Break (Click to Expand)

Most A/B tests assume the Stable Unit Treatment Value Assumption (SUTVA) โ€” meaning:

  • A user's outcome depends only on their own treatment assignment.
  • One unit's treatment does not influence another unitโ€™s outcome.
๐Ÿงช Why It Mattersยถ

If users in different groups interact:

  • Control group behavior may be influenced by treatment group exposure.
  • This biases your estimates and dilutes treatment effect.
  • Standard tests may incorrectly accept the null hypothesis due to spillover.

This assumption breaks down in experiments involving social behavior, multi-user platforms, or ecosystem effects.

โš ๏ธ Common Violation Scenariosยถ
  • ๐Ÿ›๏ธ Marketplace platforms (e.g., sellers and buyers interact)
  • ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ Social features (e.g., follows, likes, comments, feeds)
  • ๐Ÿ“ฒ Referrals / network effects (e.g., invites, rewards)
  • ๐Ÿ’ฌ Chat and collaboration tools (e.g., Slack, Teams)
๐Ÿงฉ Solutions (If You Suspect Interference)ยถ
Strategy Description
Cluster Randomization Randomize at group level (e.g., friend group, region, org ID)
Isolation Experiments Only roll out to fully disconnected segments (e.g., one region only)
Network-Based Metrics Include network centrality / exposure as covariates
Post-Experiment Checks Monitor if control group was exposed indirectly (e.g., referrals, shared UIs)
Simulation-Based Designs Use agent-based or graph simulations to estimate contamination risk

โš–๏ธ Sample Ratio Mismatch

๐Ÿ“– Click to Expand

Is group assignment balanced?

  • SRM (Sample Ratio Mismatch) checks whether the observed group sizes match the expected ratio.
  • In a perfect world, random assignment to 'A1' and 'A2' should give ~50/50 split.
  • SRM helps catch bugs in randomization, data logging, or user eligibility filtering.

Real-World Experiment Split Ratios

Scenario Split Why
Default A/B 50 / 50 Maximizes power and ensures fairness
Risky feature 10 / 90 or 20 / 80 Limits user exposure to minimize risk
Ramp-up Step-wise (1-5-25-50โ€ฆ) Gradual rollout to catch issues early
A/B/C Test 33 / 33 / 33 or weighted Compare multiple variants fairly or with bias
High control confidence needed 70 / 30 or 60 / 40 More stability in baseline comparisons
Inย [13]:
check_sample_ratio_mismatch(df, group_col=group_col, group_labels=test_config['group_labels'], expected_ratios=[0.5, 0.5], alpha=0.05)
๐Ÿ” Sample Ratio Mismatch (SRM) Check
Group control: 133 users (48.90%) โ€” Expected: 136.0
Group treatment: 139 users (51.10%) โ€” Expected: 136.0

Chi2 Statistic: 0.1324
P-value       : 0.7160
โœ… No SRM โ€” group sizes look balanced.

Back to the top


๐Ÿงช AA Testing

๐Ÿ“– Click to Expand

A/A testing is a preliminary experiment where both groups (e.g., โ€œcontrolโ€ and โ€œtreatmentโ€) receive the exact same experience. It's used to validate the experimental setup before running an actual A/B test.

What Are We Checking?

  • Are users being assigned fairly and randomly?
  • Are key outcome metrics statistically similar across groups?
  • Can we trust the experimental framework?

Why A/A Testing Matters

  • Validates Randomization โ€” Confirms the groups are balanced at baseline (no bias or leakage)
  • Detects SRM (Sample Ratio Mismatch) โ€” Ensures the actual split (e.g., 50/50) matches what was intended
  • Estimates Variability โ€” Helps calibrate variance for accurate power calculations later
  • Trust Check โ€” Catches bugs in assignment logic, event tracking, or instrumentation

A/A Test Process

  1. Randomly assign users into two equal groups โ€” Just like you would for an A/B test (e.g., control vs treatment)
  2. Measure key outcome โ€” This depends on your experiment type:
    • binary โ†’ conversion rate
    • continuous โ†’ avg. revenue, time spent
    • categorical โ†’ feature adoption, plan selected
  3. Run statistical test:
    • binary โ†’ Z-test or Chi-square
    • continuous โ†’ t-test
    • categorical โ†’ Chi-square test
  4. Check SRM โ€” Use a chi-square goodness-of-fit test to detect assignment imbalances

Possible Outcomes

Result Interpretation
No significant difference โœ… Randomization looks good. Test setup is sound.
Statistically significant difference โš ๏ธ Somethingโ€™s off โ€” check assignment logic, instrumentation, or sample leakage

Run A/A tests whenever you launch a new experiment framework, roll out a new randomizer, or need to build stakeholder trust.

Inย [14]:
# Experiment has run; outcome and guardrail data come in (simulated via add_outcome_metrics). A/A: no treatment effect.
df = add_outcome_metrics(df, group_col=group_col, group_labels=test_config['group_labels'], outcome_metric_col=test_config['outcome_metric_col'], guardrail_metric_col=test_config.get('guardrail_metric_col') or guardrail_metric_col, treatment_effect=False, seed=my_seed)
df.head()
Out[14]:
group user_id engagement_score bounce_rate past_purchase_count platform device_type
521 control 522 31.390500 0.572531 25.729102 Android mobile
737 control 738 27.941317 0.484713 51.861463 iOS mobile
740 control 741 81.517858 0.710801 42.022993 Android desktop
660 treatment 661 28.027676 0.389262 49.573649 Android mobile
411 treatment 412 62.268831 0.610558 43.824528 iOS mobile

๐Ÿงฌ Outcome Similarity Test

๐Ÿ“– Click to Expand

Compares the outcome metric across groups to ensure no significant differences exist when there shouldn't be any โ€” usually used during A/A testing or pre-experiment validation.

  • Helps detect setup issues like biased group assignment or data leakage.
  • Null Hypothesis: No difference in outcomes between control and treatment.
  • Uses the same statistical test as the main A/B test (e.g., t-test, z-test, chi-square).
Inย [15]:
df.groupby("group")[test_config['outcome_metric_col']].mean() # TODO, move this into function

# Outcome similarity test only.
_ = run_outcome_similarity_test(
    df=df,
    group_col='group',
    metric_col=test_config['outcome_metric_col'],
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    group_labels=test_config['group_labels'],
    alpha=0.05,
    verbose=True
)
Out[15]:
group
control      46.639234
treatment    49.714740
Name: engagement_score, dtype: float64
๐Ÿ“ Outcome Similarity Check


๐Ÿง  Interpretation:
Used a Mann-Whitney U test to compare medians of 'engagement_score' across groups (non-parametric).
Null Hypothesis: Distributions are identical across groups.

We use ฮฑ = 0.05
โœ… p = 0.1748 โ‰ฅ ฮฑ โ†’ Failed to reject null. No significant difference.

๐Ÿ“Š AA Test Visualization

Inย [16]:
visualize_aa_distribution(
    df,
    group_col='group',
    metric_col=test_config['outcome_metric_col'],
    test_family=test_config['family'],
    group_labels=test_config['group_labels'],
    variant=test_config.get('variant')
)
No description has been provided for this image

๐ŸŽฒ Type I Error Simulation

๐Ÿ“– Click to Expand
๐Ÿ” Repeated A/A Tests

While a single A/A test helps detect obvious flaws in group assignment (like SRM or data leakage), itโ€™s still a one-off check. To gain confidence in your randomization method, we simulate multiple A/A tests using the same logic:

  • Each run reassigns users randomly into control and treatment (with no actual change)
  • We then run the statistical test between groups for each simulation
  • We track how often the test reports a false positive (p < ฮฑ), which estimates the Type I error rate
In theory, if your setup is unbiased and ฮฑ = 0.05, you'd expect about 5% of simulations to return a significant result โ€” this validates your A/B framework isnโ€™t "trigger-happy."
๐Ÿ“Š What this tells you:
  • Too many significant p-values โ†’ your framework is too noisy (bad randomization, poor test choice)
  • Near 5% = healthy noise level, expected by design

This step is optional but highly recommended when you're:

  • Trying out a new randomization strategy
  • Validating an internal experimentation framework
  • Stress-testing your end-to-end pipeline
Inย [17]:
_ = simulate_aa_type1_error_rate(
    df=df,
    metric_col=test_config['outcome_metric_col'],
    group_labels=test_config['group_labels'],
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    runs=100,
    alpha=0.05
)
๐Ÿ“ˆ Type I Error Rate Estimate: 1/100 = 1.00%

            ๐Ÿง  Summary Interpretation:
            We simulated 100 A/A experiments using random group assignment (no actual treatment).

            Test: MANN_WHITNEY_U_TEST (independent)
            Metric: engagement_score
            Alpha: 0.05

            False positives (p < ฮฑ): 1 / 100
            โ†’ Estimated Type I Error Rate: 1.00%

            This is within expected range for ฮฑ = 0.05.
            โ†’ โœ… Test framework is behaving correctly โ€” no bias or sensitivity inflation.
            
No description has been provided for this image
๐Ÿ“– Pseudo A/A Test (Click to Expand)

A pseudo A/A test is an offline validation technique used when historical experiment data already exists. Instead of running a live A/A experiment, analysts take data from a completed experiment and temporarily treat both groups as if they received the same experience.

The goal is to check whether the experimentation analysis pipeline behaves correctly when there is no true treatment effect. By re-running the statistical test under the assumption that both groups are identical, we can verify that the analysis does not produce systematic false positives.

Pseudo A/A tests are commonly used to validate analysis code, experiment dashboards, or statistical tooling before running new experiments.

โš ๏ธ Since pseudo A/A tests reuse historical experiment data rather than live traffic, they are only a diagnostic check and do not replace proper A/A validation in a production experimentation platform.

Back to the top


๐Ÿงช A/B Testing

๐Ÿ”— For test selection (e.g., Z-test, t-test), refer to ๐Ÿ“– Hypothesis Testing Notebook

๐Ÿ“– Click to Expand
๐Ÿงช A/B Testing - Outcome Comparison

This section compares the outcome metric between control and treatment groups using the appropriate statistical test based on the experiment type.

๐Ÿ“Œ Metric Tracked:
  • Primary metric: Depends on use case:
    • Binary: Conversion rate (clicked or not)
    • Continuous: Average engagement, revenue, time spent
    • Categorical: Plan type, user tier, etc.
  • Unit of analysis: Unique user or unique observation
๐Ÿ”ฌ Outcome Analysis Steps:
  • Choose the right statistical test based on experiment_type:
    • 'binary' โ†’ Z-test for proportions
    • 'continuous_independent' โ†’ Two-sample t-test
    • 'continuous_paired' โ†’ Paired t-test
    • 'categorical' โ†’ Chi-square test of independence
  • Calculate test statistics, p-values, and confidence intervals
  • Visualize the comparison to aid interpretation
Inย [18]:
# Simulate experiment data (outcome + guardrail) with treatment effect for A/B analysis.
df = add_outcome_metrics(df, group_col=group_col, group_labels=test_config['group_labels'], outcome_metric_col=test_config['outcome_metric_col'], guardrail_metric_col=test_config.get('guardrail_metric_col') or guardrail_metric_col, treatment_effect=True, seed=my_seed)
df.head()
Out[18]:
group user_id engagement_score bounce_rate past_purchase_count platform device_type
521 control 522 31.390500 0.490186 25.729102 Android mobile
737 control 738 27.941317 0.094631 51.861463 iOS mobile
740 control 741 81.517858 0.531660 42.022993 Android desktop
660 treatment 661 31.301273 0.564170 49.573649 Android mobile
411 treatment 412 65.362110 0.572919 43.824528 iOS mobile
Inย [19]:
result = run_ab_test(
    df=df,
    group_col='group',
    metric_col=test_config['outcome_metric_col'],
    group_labels=test_config['group_labels'],
    test_family=test_config['family'],
    variant=test_config.get('variant'),
    alpha=0.05
)
print_config_summary(result) # TODO: Print nested json properly
# result
๐Ÿ“‹ Hypothesis Test Configuration Summary

๐Ÿ”ธ Test Family    : mann_whitney_u_test
๐Ÿ”ธ Variant        : independent
๐Ÿ”ธ Group Labels   : ('control', 'treatment')
๐Ÿ”ธ Alpha          : 0.05
๐Ÿ”ธ Summary        : {'control': {'n': 133, 'mean': np.float64(46.639233969426925), 'std': 14.203174151273695, 'sum': None}, 'treatment': {'n': 139, 'mean': np.float64(54.66250808310677), 'std': 16.47955206608026, 'sum': None}}
๐Ÿ”ธ Test           : Mann-Whitney U test
๐Ÿ”ธ U Stat         : 6750.0
๐Ÿ”ธ P Value        : 0.00012097959260162445

๐Ÿงพ Summaries

Inย [20]:
summarize_ab_test_result(result)
=============================================
๐Ÿงช A/B Test Result Summary [MANN_WHITNEY_U_TEST]
=============================================

๐Ÿ“Š Hypothesis Test Result
Test used: Mann-Whitney U test
U-statistic: 6750.0000
P-value    : 0.0001
โœ… Statistically significant difference detected.

๐Ÿ“‹ Group Summary:

n mean std sum
control 133.0 46.639234 14.203174 NaN
treatment 139.0 54.662508 16.479552 NaN
๐Ÿ“ˆ Lift Analysis
- Absolute Lift   : 8.0233
- Percentage Lift : 17.20%
- 95% CI for Lift : [4.3719, 11.6746]
=============================================

๐Ÿ“Š Visualization

Inย [21]:
plot_ab_test_results(result)
๐Ÿ“Š Visualization:
No description has been provided for this image

๐ŸŽฏ 95% Confidence Intervals
for outcome in groups

๐Ÿ“– Click to Expand
  • The 95% confidence interval gives a range in which we expect the true conversion rate to fall for each group.
  • If the confidence intervals do not overlap, it's strong evidence that the difference is statistically significant.
  • If they do overlap, it doesn't guarantee insignificance โ€” you still need the p-value to decide โ€” but it suggests caution when interpreting lift.
Inย [22]:
plot_confidence_intervals(result)
No description has been provided for this image

๐Ÿ“ˆ Lift Analysis
AKA 95% Confidence Intervals for (difference in outcomes)

๐Ÿ“– Click to Expand

This confidence interval helps quantify uncertainty around the observed lift between treatment and control groups. It answers:

  • How large is the difference between groups?
  • How confident are we in this lift estimate?

We compute a 95% CI for the difference in means (or proportions), not just for each group. If this interval does not include 0, we can reasonably trust there's a true difference. If it does include 0, the observed difference might be due to random chance.

This complements the p-value โ€” while p-values tell us if the difference is significant, CIs tell us how big the effect is, and how uncertain we are.

Inย [23]:
compute_lift_confidence_interval(result)
=============================================
๐Ÿ“ˆ 95% CI for Difference in Outcome [mann_whitney_u_test]
=============================================
- Absolute Lift (diff in means): 8.0233
- 95% CI for difference        : [4.3719, 11.6746]
โœ… Likely positive impact (CI > 0)
=============================================

โœ… Final Conclusion

Inย [24]:
print_final_ab_test_summary(result)
========================================
          ๐Ÿ“Š FINAL A/B TEST SUMMARY
========================================
๐Ÿ‘ฅ  Control Avg outcome         :  46.6392
๐Ÿงช  Treatment Avg outcome         :  54.6625
๐Ÿ“ˆ  Absolute lift              :  8.0233
๐Ÿ“Š  Percentage lift            :  17.20%
๐Ÿงช  P-value (from Mann-Whitney U test):  0.0001
----------------------------------------
โœ… RESULT: Statistically significant difference detected.
========================================

โฑ๏ธ How Long

๐Ÿ“– Click to Expand

The duration of an A/B test depends on how quickly you reach the required sample size per group, as estimated during your power analysis.

โœ… Key Inputs
  • Daily volume of eligible observations (users, sessions, or orders โ€” depends on your unit of analysis)
  • Required sample size per group (from power analysis)
  • Traffic split ratio (e.g., 50/50, 10/90, 33/33/33)
๐Ÿงฎ Formula
Test Duration (in days) =
Required Sample Size per Group รท (Daily Eligible Observations ร— Group Split Proportion)

This ensures the experiment runs long enough to detect the expected effect with the desired confidence and power.

๐Ÿ’ก Planning Tips
  1. Estimate required sample size using power analysis (based on effect size, baseline, alpha, and power)
  2. Understand your traffic:
    • Whatโ€™s your average daily eligible traffic?
    • What unit of analysis is used (user, session, impression)?
  3. Apply group split:
    • e.g., for a 50/50 A/B test, each group gets 50% of traffic
  4. Estimate days using the formula above.
๐Ÿง  Real-World Considerations
  • โœ… Ramp-Up Period
    Gradually increase traffic exposure: 5% โ†’ 25% โ†’ 50% โ†’ full traffic.
    Helps catch bugs, stability issues, and confounding edge cases early.
  • โœ… Cool-Down Buffer
    Avoid ending tests on weekends, holidays, or during unusual traffic spikes.
    Add buffer days so your conclusions arenโ€™t skewed by anomalies.
  • โœ… Trust Checks Before Analysis
    • A/A testing to verify setup
    • SRM checks to confirm user distribution
    • Monitor guardrail metrics (e.g., bounce rate, latency, load time)
๐Ÿ—ฃ๏ธ Common Practitioner Advice
โ€œWe calculate sample size using power analysis, then divide by daily traffic per group. But we always factor in buffer days โ€” for ramp-up, trust checks, and stability. Better safe than sorry.โ€

โ€œPower analysis is the starting point. But we donโ€™t blindly stop when we hit N. We monitor confidence intervals, metric stability, and coverage to make sure weโ€™re making decisions the business can trust.โ€
๐Ÿ“– Monitoring Dashboard (Click to Expand)
  • Overall Test Health
    • Start/end date, traffic ramp-up %, time remaining
    • SRM (Sample Ratio Mismatch) indicator
    • P-value and effect size summary (updated daily)
  • Primary Metric Tracking
    • Daily trends for primary outcome (conversion, revenue, etc.)
    • Cumulative lift + confidence intervals
    • Statistical significance tracker (p-value, test stat)
  • Guardrail Metrics
    • Bounce rate, load time, checkout errors, etc.
    • Alert thresholds (e.g., +10% increase in latency)
    • Trend vs baseline and prior experiments
  • Segment Drilldowns
    • Platform (iOS vs Android), geography, user tier
    • Detect heterogeneous treatment effects
    • Option to toggle test results per segment
  • Cohort Coverage
    • Total users assigned vs eligible
    • Daily inclusion and exclusion trends
    • Debugging filters (e.g., why user X didnโ€™t get assigned)
  • Variance & Stability Checks
    • Volatility of key metrics
    • Pre vs post baseline comparisons
    • Funnel conversion variance analysis
  • Notes & Annotations
    • Manual tagging of major incidents (e.g., bug fix deployed, pricing change)
    • Timeline of changes affecting experiment interpretation
Inย [25]:
daily_eligible_users = 1000
allocation_ratios = (0.5, 0.5)
buffer_days = 2

test_duration_result = estimate_test_duration(
    required_sample_size_per_group=test_config['required_sample_size'],
    daily_eligible_users=daily_eligible_users,
    allocation_ratios=allocation_ratios,
    buffer_days=buffer_days,
    test_family=test_config['family']
)
test_duration_result
๐Ÿงฎ Estimated Test Duration
- Test family               : mann_whitney_u_test
- Required sample per group : 136
- Daily eligible traffic    : 1000
- Allocation ratio          : (0.5, 0.5)
- Longest group runtime     : 1 days
- Buffer days               : 2
โœ… Total estimated duration : 3 days

Out[25]:
{'test_family': 'mann_whitney_u_test',
 'per_group_days': [np.float64(1.0), np.float64(1.0)],
 'longest_group_runtime': 1,
 'recommended_total_duration': 3}

Back to the top


๐Ÿ” Post Hoc Analysis

๐Ÿ“– Click to Expand
After statistical significance, post-hoc analysis helps connect results to business confidence.
It's not just did it work โ€” but how, for whom, and at what cost or benefit?

๐Ÿง  Why Post Hoc Analysis Matters

  • Segments may respond differently โ€” average lift may hide underperformance in subgroups
  • Guardrails may show collateral damage (e.g., slower load time, higher churn)
  • Stakeholders need impact translation โ€” what does this mean in revenue, retention, or strategy?

๐Ÿ”Ž Typical Post Hoc Questions

  • Segment Lift
    • Did certain platforms, geos, cohorts, or user types benefit more?
    • Any negative lift in high-value user segments?
  • Guardrail Checks
    • Did the treatment impact non-primary metrics (e.g., latency, engagement, bounce rate)?
    • Were alert thresholds breached?
  • Business Impact Simulation
    • How does the observed lift scale to 100% of eligible users?
    • Whatโ€™s the projected change in conversions, revenue, or user satisfaction?
  • Edge Case Discovery
    • Any bugs, instrumentation gaps, or unexpected usage patterns?
    • Did any user types get excluded disproportionately?

๐Ÿ“Š What to Report

Area What to Show
Segment Analysis Table or chart showing lift per segment, sorted by effect size or risk
Guardrail Metrics Summary table of guardrails vs baseline, with thresholds or annotations
Revenue Simulation Projected uplift ร— traffic volume ร— conversion = business impact
Confidence Range 95% CI for key metrics per segment (wherever possible)
Rollout Readiness Any blockers, mitigations, or next steps if full rollout is considered

๐Ÿ’ก Pro Tip
Even if your p-value says โ€œyes,โ€ business rollout is a risk-based decision.
Post-hoc analysis is where statistical rigor meets product judgment.

๐Ÿงฉ Segmented Lift

๐Ÿ“– Click to Expand

Segmented lift tells us how different user segments responded to the treatment.

Why It Matters:

  • Uncovers hidden heterogeneity โ€” The overall average might mask variation across platforms, geographies, or user tiers.
  • Identifies high-risk or high-reward cohorts โ€” Some segments might benefit more, while others could be negatively impacted.
  • Guides rollout and targeting decisions โ€” Helps decide where to prioritize feature exposure, or where to mitigate risk.

Typical Segments:

  • Device type (e.g., mobile vs desktop)
  • Region (e.g., North vs South)
  • User lifecycle (e.g., new vs returning)
  • Platform (e.g., iOS vs Android)
"Segmentation answers who is benefiting (or suffering) โ€” not just whether it worked on average."
Inย [26]:
analyze_segment_lift(
    df=df,
    test_config=test_config,
    segment_cols=['platform', 'device_type'], # , 'user_tier', 'region'
    min_count_per_group=30,
    visualize=True
)
๐Ÿ”Ž Segmenting by: platform
platform count_control count_treatment mean_control mean_treatment std_control std_treatment lift p_value_lift
0 Android 56 48 44.881922 53.659185 13.900282 16.633431 8.777262 None
1 iOS 77 91 47.917279 55.191734 14.374087 16.465483 7.274455 None
No description has been provided for this image
๐Ÿ”Ž Segmenting by: device_type
device_type count_control count_treatment mean_control mean_treatment std_control std_treatment lift p_value_lift
0 mobile 87 106 46.029603 54.846641 13.205630 16.346295 8.817038 None
1 desktop 46 33 47.792232 54.071050 16.012127 17.144749 6.278819 None
No description has been provided for this image

๐Ÿšฆ Guardrail Metrics

๐Ÿ“– Click to Expand

Guardrail metrics are non-primary metrics tracked during an experiment to ensure the feature doesn't create unintended negative consequences.

We monitor them alongside the main success metric to:

  • ๐Ÿ“‰ Catch regressions in user behavior or system performance
  • ๐Ÿ” Detect trade-offs (e.g., conversion โ†‘ but bounce rate โ†‘ too)
  • ๐Ÿ›‘ Block rollouts if a feature does more harm than good
๐Ÿงช How We Check
  • Run statistical tests on each guardrail metric just like we do for the primary metric
  • Use the same experiment type (binary, continuous, etc.) for evaluation
  • Report p-values and lift to assess significance and direction
  • Focus more on risk detection than optimization
๐Ÿ“Š Common Guardrail Metrics
TypeExamples
UX HealthBounce Rate, Session Length, Engagement
PerformancePage Load Time, API Latency, CPU Usage
ReliabilityError Rate, Crash Rate, Timeout Errors
BehavioralScroll Depth, Page Views per Session
โœ… When to Act
  • If the treatment significantly worsens a guardrail metric โ†’ investigate
  • If the primary metric improves but guardrails suffer, assess trade-offs
  • Use p-values, lift, and domain context to guide decision-making
๐Ÿง  Why Guardrails Matter
โ€œWe donโ€™t just care if a metric moves โ€” we care what else it moved. Guardrails give us confidence that improvements arenโ€™t hiding regressions elsewhere.โ€
Inย [27]:
run_guardrail_analysis(df, test_config, group_col='group', alpha=0.05)
๐Ÿšฆ Guardrail Metric Check โ†’ 'bounce_rate'
Hypothesis (two-sided t-test): Hโ‚€ โ€” no difference in mean vs Hโ‚ โ€” means differ.
- control   : 0.5534
- treatment : 0.5733
- Difference   : +0.0198
- P-value (t-test): 0.2849
๐ŸŸก No statistically significant change โ€” guardrail looks stable.

๐Ÿ”„ CUPED

๐Ÿ“– Click to Expand

Controlled Pre-Experiment Data: A statistical adjustment that uses pre-experiment behavior to reduce variance and improve power. It helps detect smaller effects without increasing sample size.

When to Use
  • You have reliable pre-experiment metrics (e.g., past spend, engagement)
  • You want to reduce variance and improve test sensitivity
  • Youโ€™re dealing with small lifts or costly sample sizes
How It Works
  1. Identify a pre-period metric correlated with your outcome
  2. Use regression to compute an adjustment (theta)
  3. Subtract the correlated component from your outcome metric
  4. Analyze the adjusted metric instead of the raw one
Inย [28]:
df = apply_cuped(
    df=df,
    pre_metric='past_purchase_count',
    outcome_metric_col=test_config['outcome_metric_col'],
    group_col='group',
    group_labels=test_config['group_labels']
)

df.head()
Out[28]:
group user_id engagement_score bounce_rate past_purchase_count platform device_type engagement_score_cuped_adjusted
521 control 522 31.390500 0.490186 25.729102 Android mobile 29.264534
737 control 738 27.941317 0.094631 51.861463 iOS mobile 23.656065
740 control 741 81.517858 0.531660 42.022993 Android desktop 78.045548
660 treatment 661 31.301273 0.564170 49.573649 Android mobile 27.205060
411 treatment 412 65.362110 0.572919 43.824528 iOS mobile 61.740941
Inย [29]:
# TODO: move these into apply_cuped()
original_std = df[test_config['outcome_metric_col']].std()
cuped_std = df[f"{test_config['outcome_metric_col']}_cuped_adjusted"].std()

print("Variance Reduction from CUPED")
print("--------------------------------")
print(f"Original std dev : {original_std:.3f}")
print(f"CUPED std dev    : {cuped_std:.3f}")
print(f"Reduction        : {(1 - cuped_std/original_std)*100:.2f}%")
Variance Reduction from CUPED
--------------------------------
Original std dev : 15.896
CUPED std dev    : 15.859
Reduction        : 0.24%
Inย [30]:
result_cuped = run_ab_test(
    df=df,
    group_col='group',
    metric_col=f"{test_config['outcome_metric_col']}_cuped_adjusted",
    group_labels=test_config['group_labels'],
    test_family=test_config['family'],
    variant=test_config.get('variant')
)

summarize_ab_test_result(result_cuped)
=============================================
๐Ÿงช A/B Test Result Summary [MANN_WHITNEY_U_TEST]
=============================================

๐Ÿ“Š Hypothesis Test Result
Test used: Mann-Whitney U test
U-statistic: 6790.0000
P-value    : 0.0002
โœ… Statistically significant difference detected.

๐Ÿ“‹ Group Summary:

n mean std sum
control 133.0 42.684917 14.179971 NaN
treatment 139.0 50.562556 16.462328 NaN
๐Ÿ“ˆ Lift Analysis
- Absolute Lift   : 7.8776
- Percentage Lift : 18.46%
- 95% CI for Lift : [4.2310, 11.5242]
=============================================

๐Ÿง  Correcting for Multiple Comparisons

๐Ÿ“– Why p-values can't always be trusted

When we test multiple segments, multiple metrics or multiple variants, we increase the risk of false positives (Type I errors). This is known as the Multiple Comparisons Problem โ€” and itโ€™s dangerous in data-driven decision-making.

๐Ÿ“‰ Example Scenario:ยถ

We run A/B tests on:

  • Overall population โœ…
  • By platform โœ…
  • By user tier โœ…
  • By region โœ…

If we test 10 hypotheses at 0.05 significance level, the chance of at least one false positive โ‰ˆ 40%.

โœ… Correction Methodsยถ
Method Use Case Risk
Bonferroni Very strict, controls Family-Wise Error Rate (FWER) โ„๏ธ Conservative
Benjamini-Hochberg Controls False Discovery Rate (FDR) ๐Ÿ”ฅ Balanced
๐Ÿง  In Practice:ยถ

We calculate raw p-values for each segment, and then apply corrections to get adjusted p-values.

If even the adjusted p-values are significant โ†’ result is robust.

โ„๏ธ Bonferroni Correctionยถ
๐Ÿ“– FWER Control (Click to Expand) Bonferroni is the most **conservative** correction method. It adjusts the p-value threshold by dividing it by the number of comparisons.
  • Formula: adjusted_alpha = alpha / num_tests
  • Or: adjusted_p = p * num_tests
  • If even one adjusted p-value < 0.05, itโ€™s very likely real

๐Ÿ“Œ Best for: High-risk decisions (e.g., medical trials, irreversible launches)
โš ๏ธ Drawback: May miss true positives (higher Type II error)

๐Ÿ”ฌ Benjamini-Hochberg (BH) Procedureยถ
๐Ÿ“– FDR Control (Click to Expand)

BH controls the expected proportion of false discoveries (i.e., false positives among all positives). It:

  • Ranks p-values from smallest to largest
  • Compares each to (i/m) * alpha, where:
    • i = rank
    • m = total number of tests

๐Ÿง  Important: After adjustment, BH enforces monotonicity by capping earlier (smaller) ranks to not exceed later ones.

In simple terms: adjusted p-values can only decrease as rank increases.

The largest p-value that satisfies this inequality becomes the threshold โ€” all smaller p-values are considered significant.

๐Ÿ“Œ Best for: Exploratory research, product experiments with many segments
๐Ÿ’ก Advantage: More power than Bonferroni, still controls errors

Inย [31]:
# Original inputs
segment_names = ['North', 'South', 'East', 'West']
p_vals = [0.03, 0.06, 0.02, 0.10]

# Create DataFrame and sort by raw p-values BEFORE correction
df_pvalues = pd.DataFrame({
    'Segment': segment_names,
    'Raw_pValue': p_vals
}).sort_values('Raw_pValue').reset_index(drop=True)

# Apply corrections to the sorted p-values
_, bonf, _, _ = multipletests(df_pvalues['Raw_pValue'], alpha=0.05, method='bonferroni')
_, bh, _, _ = multipletests(df_pvalues['Raw_pValue'], alpha=0.05, method='fdr_bh')

# Add to DataFrame
df_pvalues['Bonferroni_Adj_pValue'] = bonf
df_pvalues['BH_Adj_pValue'] = bh
df_pvalues

# TODO: decision from p-value?
Out[31]:
Segment Raw_pValue Bonferroni_Adj_pValue BH_Adj_pValue
0 East 0.02 0.08 0.06
1 North 0.03 0.12 0.06
2 South 0.06 0.24 0.08
3 West 0.10 0.40 0.10
Inย [32]:
#TODO: club with earlier cell, function possible?

# Plot p values - raw and adjusted
plt.figure(figsize=(8, 5))

# Plot lines
plt.plot(df_pvalues.index + 1, df_pvalues['Raw_pValue'], marker='o', label='Raw p-value')
plt.plot(df_pvalues.index + 1, df_pvalues['Bonferroni_Adj_pValue'], marker='^', label='Bonferroni Adj p-value')
plt.plot(df_pvalues.index + 1, df_pvalues['BH_Adj_pValue'], marker='s', label='BH Adj p-value')

# Add value labels next to each point
for i in range(len(df_pvalues)):
    x = i + 1
    plt.text(x + 0.05, df_pvalues['Raw_pValue'][i], f"{df_pvalues['Raw_pValue'][i]:.2f}", va='center')
    plt.text(x + 0.05, df_pvalues['Bonferroni_Adj_pValue'][i], f"{df_pvalues['Bonferroni_Adj_pValue'][i]:.2f}", va='center')
    plt.text(x + 0.05, df_pvalues['BH_Adj_pValue'][i], f"{df_pvalues['BH_Adj_pValue'][i]:.2f}", va='center')

# Axis & labels
plt.xticks(df_pvalues.index + 1, df_pvalues['Segment']);
plt.axhline(0.05, color='gray', linestyle='--', label='ฮฑ = 0.05');
plt.xlabel("Segment (Ranked by Significance)");
plt.ylabel("p-value");
plt.title("p-value Correction: Bonferroni vs Benjamini Hochberg (FDR)");
plt.legend();
plt.tight_layout();
plt.show();
No description has been provided for this image

๐Ÿช„ Novelty Effects & Behavioral Decay

๐Ÿ“– Why First Impressions Might Lie (Click to Expand)
๐Ÿช„ Novelty Effects & Behavioral Decayยถ

Even if an A/B test shows a statistically significant lift, that improvement may not last.

This often happens due to novelty effects โ€” short-term spikes in engagement driven by:

  • Curiosity (โ€œWhatโ€™s this new feature?โ€)
  • Surprise (โ€œThis looks different!โ€)
  • Visual attention (e.g., placement or color changes)
๐Ÿ“‰ Common Signs of Novelty Effectsยถ
  • Strong lift in week 1 โ†’ drops by week 3.
  • High initial usage โ†’ no long-term retention.
  • Positive metrics in one segment only (e.g., โ€œnew usersโ€).
๐Ÿงญ What We Do About Itยถ

To address this risk during rollouts:

  • โœ… Monitor metrics over time post-launch (e.g., 7, 14, 28-day retention)
  • โœ… Compare results across early adopters vs late adopters
  • โœ… Run holdout experiments during phased rollout to detect fading impact

๐ŸŽฏ Primacy Effect & Order Bias

๐Ÿ“– When First = Best (Click to Expand)

Sometimes, the position of a variant or option can distort results โ€” especially if it's shown first. This is called the primacy effect, a type of cognitive bias.

It often shows up in:

  • Feed ranking or content ordering experiments
  • Option selection (e.g., first dropdown item)
  • Surveys or in-app prompts
๐Ÿšฉ Common Indicatorsยถ
  • Variant A always performs better regardless of content
  • Metrics drop when position is swapped
  • Discrepancy between test and real-world usage
๐Ÿงญ What We Do About Itยถ

To minimize primacy bias:

  • โœ… Randomize order of options or content
  • โœ… Use position-aware metrics (e.g., click-through by slot)
  • โœ… Validate with follow-up tests using rotated or reversed orders

๐ŸŽฒ Rollout Simulation

๐Ÿ“– Click to Expand

Once statistical significance is established, it's useful to simulate business impact from full rollout.

Assume full exposure to eligible daily traffic, and estimate incremental impact from the observed lift.

This helps stakeholders understand the real-world benefit of implementing the change.

We typically estimate:

  • ๐Ÿ“ˆ Daily lift (e.g., additional conversions, dollars, sessions)
  • ๐Ÿ“ˆ Monthly extrapolation (daily lift ร— 30)
Inย [33]:
# Derive daily volume from actual data
daily_traffic_estimate = df.shape[0]  # Assuming full traffic per day

simulate_rollout_impact(
    experiment_result=result,                         # Output from run_ab_test()
    daily_eligible_observations=daily_traffic_estimate,
    metric_unit=test_config['outcome_metric_col']     # Dynamic label like 'engagement_score' or 'revenue'
)
๐Ÿ“ฆ Rollout Simulation
- Outcome Metric      : engagement_score
- Observed Lift       : 8.0233 per unit
- Daily Eligible Units: 272
- Estimated Daily Impact   : 2,182 engagement_score/day
- Estimated Monthly Impact : 65,470 engagement_score/month

๐Ÿงช A/B Test Holdouts

๐Ÿ“– Why We Sometimes Don't Ship to 100% (Click to Expand)
๐Ÿงช A/B Test Holdoutsยถ

Even after a successful A/B test, we often maintain a small holdout group during rollout.

This helps us:

  • Track long-term impact beyond the experiment window.
  • Detect novelty fade or unexpected side effects.
  • Maintain a clean โ€œcontrolโ€ for system-wide benchmarking.
๐Ÿข Industry Practiceยถ
  • Common at large orgs like Facebook, where teams share a holdout pool for all feature launches.
  • Holdouts help leadership evaluate true impact during performance reviews and roadmap planning.
โš ๏ธ When We Skip Holdoutsยถ
  • Bug fixes or critical updates (e.g., spam, abuse, policy violations).
  • Sensitive changes like content filtering (e.g., child safety flags).

๐Ÿšซ Limits & Alteratives

๐Ÿ“‰ When Not to A/B Test & What to Do Instead (Click to Expand)
๐Ÿ™…โ€โ™€๏ธ When Not to A/B Testยถ
  • Lack of infrastructure โ†’ No tracking, engineering, or experiment setup.
  • Lack of impact โ†’ Not worth the effort if the feature has minimal upside, shipping features has downstream implications (support, bugs, operations)..
  • Lack of traffic โ†’ Canโ€™t reach stat sig in a reasonable time.
  • Lack of conviction โ†’ No strong hypothesis; testing dozens of variants blindly.
  • Lack of isolation โ†’ Hard to contain exposure (e.g., testing a new logo everyone sees).
๐Ÿงช Alternatives & Edge Casesยถ
  • Use user interviews or logs to gather directional signals.
  • Leverage retrospective data for pre/post comparisons.
  • Consider sequential testing or soft rollouts for low-risk changes.
  • Use design experiments (e.g., multivariate, observational) when randomization isn't feasible.

Back to the top