๐ AB Testing
- โ๏ธ Environment Setup
- ๐ ๏ธ Experiment Setup
- ๐ง Central Control Panel
- ๐ฅ Read/Generate Data
- โ๏ธ Setup Inputs + Config Values
- ๐ Baseline Estimation from Data
- ๐ Minimum Detectable Effect
- ๐ Test Family
- ๐ Required Sample Size
- ๐งพ Summaries
- ๐ Visualization
- ๐ฏ 95% Confidence Intervals
- ๐ Lift Analysis
- โ Final Conclusion
- โฑ๏ธ How Long
- ๐งฉ Segmented Lift
- ๐ฆ Guardrail Metrics
- ๐ CUPED
- ๐ง Correcting for Multiple Comparisons
- ๐ช Novelty Effects & Behavioral Decay
- ๐ฏ Primacy Effect & Order Bias
- ๐ฒ Rollout Simulation
- ๐งช A/B Test Holdouts
- ๐ซ Limits & Alteratives
# Display Settings
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import display, HTML
# Set Seed
my_seed=1995
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
from scipy import stats
from scipy.stats import (
ttest_ind,
ttest_rel,
chi2_contingency,
mannwhitneyu,
levene,
shapiro
)
import statsmodels.api as sm
from statsmodels.stats.power import (
TTestIndPower,
TTestPower,
FTestAnovaPower,
NormalIndPower
)
from statsmodels.stats.multitest import multipletests
from sklearn.model_selection import train_test_split
from ab_utils_01_data_setup import *
from ab_utils_02_power_analysis import *
from ab_utils_03_randomization import *
from ab_utils_04_aa_testing import *
from ab_utils_05_ab_testing import *
from ab_utils_06_post_hoc import *
import sys
sys.path.insert(0, '..')
from Hypothesis_Testing.ht_utils import print_config_summary
# 1. Main outcome variable you're testing
outcome_metric_col = 'engagement_score'
# 2. Metric type: 'binary', 'continuous', or 'categorical'
outcome_metric_datatype = 'continuous'
# 3. Group assignment (to be generated)
group_labels = ('control', 'treatment')
# 3b. Number of groups in the experiment (e.g., 2 for A/B test, 3 for A/B/C test)
group_count = len(group_labels)
# 4. Experimental design variant: independent or paired
variant = 'independent' # Options: 'independent' (supported), 'paired' (not supported yet, todo)
# 5. Optional: Unique identifier for each observation (can be user_id, session_id, etc.)
observation_id_col = 'user_id'
# 6. Optional: Pre-experiment metric for CUPED, if used
pre_experiment_metric = 'past_purchase_count' # Can be None
# Column name used to store assigned group after randomization
group_col = 'group'
# Randomization method to assign users to groups
# Options: 'simple', 'stratified', 'block', 'matched_pair', 'cluster', 'cuped'
randomization_method = "simple"
# Optional: guardrail metric column for simulated outcome data. Set to None to omit.
guardrail_metric_col = 'bounce_rate'
test_config = {
# Core experiment setup
'outcome_metric_col' : outcome_metric_col, # Main metric to analyze (e.g., 'engagement_score')
'observation_id_col' : observation_id_col, # Unique identifier for each observation
'pre_experiment_metric' : pre_experiment_metric, # Used for CUPED adjustment (if any)
'outcome_metric_datatype': outcome_metric_datatype, # One of: 'binary', 'continuous', 'categorical'
'group_labels' : group_labels, # Tuple of (control, treatment) group names
'group_count' : group_count, # Number of groups (usually 2 for A/B tests)
'variant' : variant, # 'independent' or 'paired'
'guardrail_metric_col' : guardrail_metric_col, # Optional: e.g. 'bounce_rate'; None to omit
# Diagnostic results โ filled after EDA/assumptions check
'normality' : None, # Will be set based on Shapiro-Wilk or visual tests
'equal_variance' : None, # Will be set using Leveneโs/Bartlettโs test
'family' : None # Test family โ 'z_test', 't_test', 'anova', 'chi_square', etc.
}
print_config_summary(test_config)
๐ Hypothesis Test Configuration Summary
๐ธ Outcome Metric Col : engagement_score
๐ธ Observation Id Col : user_id
๐ธ Pre Experiment Metric : past_purchase_count
๐ธ Outcome Metric Datatype : continuous
๐ธ Group Labels : ('control', 'treatment')
๐ธ Group Count : 2
๐ธ Variant : independent
๐ธ Guardrail Metric Col : bounce_rate
๐ธ Normality : None
๐ธ Equal Variance : None
๐ธ Family : None
observations_count = 1000
df = create_dummy_ab_data(observations_count, seed=my_seed, outcome_metric_col=outcome_metric_col, guardrail_metric_col=guardrail_metric_col)
historical_df = create_historical_df(df, outcome_metric_col, guardrail_metric_col, seed=my_seed)
df
historical_df
| user_id | engagement_score | bounce_rate | past_purchase_count | platform | device_type | |
|---|---|---|---|---|---|---|
| 0 | 1 | NaN | NaN | 37.593666 | Android | mobile |
| 1 | 2 | NaN | NaN | 35.294211 | Android | mobile |
| 2 | 3 | NaN | NaN | 71.011905 | iOS | desktop |
| 3 | 4 | NaN | NaN | 35.351784 | Android | mobile |
| 4 | 5 | NaN | NaN | 58.179221 | Android | mobile |
| ... | ... | ... | ... | ... | ... | ... |
| 995 | 996 | NaN | NaN | 56.227450 | iOS | mobile |
| 996 | 997 | NaN | NaN | 57.966997 | iOS | mobile |
| 997 | 998 | NaN | NaN | 50.674093 | Android | mobile |
| 998 | 999 | NaN | NaN | 51.941265 | iOS | mobile |
| 999 | 1000 | NaN | NaN | 38.515887 | iOS | desktop |
1000 rows ร 6 columns
| user_id | engagement_score | bounce_rate | past_purchase_count | platform | device_type | |
|---|---|---|---|---|---|---|
| 0 | 1 | 31.390500 | 0.560153 | 37.593666 | Android | mobile |
| 1 | 2 | 27.941317 | 0.547708 | 35.294211 | Android | mobile |
| 2 | 3 | 81.517858 | 0.452955 | 71.011905 | iOS | desktop |
| 3 | 4 | 28.027676 | 0.565883 | 35.351784 | Android | mobile |
| 4 | 5 | 62.268831 | 0.403212 | 58.179221 | Android | mobile |
| ... | ... | ... | ... | ... | ... | ... |
| 995 | 996 | 59.341175 | 0.622611 | 56.227450 | iOS | mobile |
| 996 | 997 | 61.950496 | 0.422644 | 57.966997 | iOS | mobile |
| 997 | 998 | 51.011140 | 0.544316 | 50.674093 | Android | mobile |
| 998 | 999 | 52.911898 | 0.440618 | 51.941265 | iOS | mobile |
| 999 | 1000 | 32.773830 | 0.442153 | 38.515887 | iOS | desktop |
1000 rows ร 6 columns
โก Power Analysis
๐ Click to Expand
Power analysis helps determine the minimum sample size required to detect a true effect with statistical confidence.
Why It Matters:
- Avoids underpowered tests (risk of missing real effects)
- Balances tradeoffs between Sample size, Minimum Detectable Effect (MDE), Significance level (ฮฑ), Statistical power (1 - ฮฒ)
Key Inputs:
| Parameter | Meaning |
|---|---|
| alpha (ฮฑ) | Significance level (probability of false positive), e.g. 0.05 |
| Power (1 - ฮฒ) | Probability of detecting a true effect, e.g. 0.80 or 0.90 |
| Baseline | Current outcome (e.g., 10% conversion, $50 revenue) |
| MDE | Minimum detectable effect โ the smallest meaningful lift (e.g., +2% or +$5) |
| Std Dev | Standard deviation of the metric (for continuous outcomes) |
| Effect Size | Optional: Cohen's d (for t-tests) or f (for ANOVA) |
| Groups | Number of groups (relevant for ANOVA) |
This notebook automatically selects the correct formula based on experiment_type variable.
โ๏ธ Setup Inputs + Config Values
๐ Click to Expand
These are the core experiment design parameters required for power analysis and statistical testing.
alpha: Significance level โ the tolerance for false positives (commonly set at 0.05).power: Probability of detecting a true effect โ typically 0.80 or 0.90.group_labels: The names of the experimental groups (e.g.,'control','treatment').metric_col: Outcome metric column you're analyzing.test_family: Chosen statistical test (e.g.,'t_test','z_test','chi_square') based on assumptions.variant: Experimental design structure โ'independent'or'paired'.
These inputs drive sample size estimation, test choice, and downstream analysis logic.
# Define Core Inputs
# Use values from your config or plug in manually
alpha = 0.05 # False positive tolerance (Type I error)
power = 0.80 # Statistical power (1 - Type II error)
๐ Baseline Estimation (Pre-Experiment)
๐ Click to Expand
Before running power analysis, we need a baseline estimate of the outcome metric.
- These values must come from historical data collected before the experiment.
- They represent the expected behavior of users under the current system (control condition).
- For binary metrics (e.g., conversion), the baseline is the historical conversion rate.
- For continuous metrics (e.g., revenue, engagement), we estimate the historical mean and standard deviation.
These estimates allow us to translate the Minimum Detectable Effect (MDE) into a statistical effect size and compute the required sample size.
โ ๏ธ Baselines must be computed from pre-experiment data. Using outcome data from the experiment itself would introduce data leakage.
# ๐งฎ Data-Driven Baseline Metric from historical data (stored in test_config; only relevant keys are set per test family)
_b = compute_baseline_from_data(historical_df, test_config)
test_config['baseline_rate'] = _b['baseline_rate']
test_config['baseline_mean'] = _b['baseline_mean']
test_config['std_dev'] = _b['std_dev']
๐ Baseline mean (historical): 49.16 ๐ Baseline std dev (historical): 14.64
๐ Minimum Detectable Effect
๐ Click to Expand
๐ฏ Minimum Detectable Effect (MDE) is the smallest business-relevant difference you want your test to catch.
- It reflects what matters โ not what the data happens to show
- Drives required sample size:
- Smaller MDE โ larger sample
- Larger MDE โ smaller sample
๐ง Choose an MDE based on:
- What level of uplift would justify launching the feature?
- What's a meaningful change in your metric โ not just statistical noise?
# Minimum Detectable Effect (MDE)
# This is NOT data-driven โ it reflects the minimum improvement you care about detecting.
# It should be small enough to catch valuable changes, but large enough to avoid inflating sample size.
# Examples by Metric Type:
# - Binary : 0.02 โ detect a 2% lift in conversion rate (e.g., from 10% to 12%)
# - Categorical : 0.05 โ detect a 5% shift in plan preference (e.g., more users choosing 'premium' over 'basic')
# - Continuous : 3.0 โ detect a 3-point gain in engagement score (e.g., from 50 to 53 avg. score)
mde = 5 # TODO: Change this based on business relevance
๐ Test Family
๐ Click to Expand
Selects the appropriate statistical test based on:
- Outcome data type (binary, continuous, categorical)
- Distributional assumptions (normality, variance)
- Number of groups and experiment structure (independent vs paired)
This step automatically maps to the correct test (e.g., t-test, z-test, chi-square, ANOVA).
๐งช Experiment Type โ Test Family Mapping
| Outcome Metric | Normality | Group Count | Selected Test Family |
|---|---|---|---|
| binary | โ | 2 | z_test |
| binary | โ | 3+ | chi_square |
| continuous | โ | 2 | t_test |
| continuous | โ | 3+ | anova |
| continuous | โ | 2 | non_parametric (Mann-Whitney U) |
| continuous | โ | 3+ | non_parametric (Kruskal-Wallis) |
| categorical | โ | 2 | chi_square |
| categorical | โ | 3+ | chi_square |
test_config['family'] = determine_test_family(test_config)
# test_config
print_config_summary(test_config)
print(f"โ
Selected test family: {test_config['family']}")
๐ Hypothesis Test Configuration Summary
๐ธ Outcome Metric Col : engagement_score
๐ธ Observation Id Col : user_id
๐ธ Pre Experiment Metric : past_purchase_count
๐ธ Outcome Metric Datatype : continuous
๐ธ Group Labels : ('control', 'treatment')
๐ธ Group Count : 2
๐ธ Variant : independent
๐ธ Guardrail Metric Col : bounce_rate
๐ธ Normality : None
๐ธ Equal Variance : None
๐ธ Family : mann_whitney_u_test
๐ธ Baseline Rate : None
๐ธ Baseline Mean : 49.16337003981812
๐ธ Std Dev : 14.640142733622508
โ
Selected test family: mann_whitney_u_test
required_sample_size = calculate_power_sample_size(
test_family=test_config['family'],
variant=test_config.get('variant'),
alpha=alpha,
power=power,
baseline_rate=test_config.get('baseline_rate'),
mde=mde,
std_dev=test_config.get('std_dev'),
effect_size=None, # Let it compute internally via mde/std
num_groups=2
)
test_config['required_sample_size'] = required_sample_size
print(f"โ
Required sample size per group: {required_sample_size}")
print(f"๐ฅ Total sample size: {required_sample_size * 2}")
โ Required sample size per group: 136 ๐ฅ Total sample size: 272
print_power_summary(
test_family=test_config['family'],
variant=test_config.get('variant'),
alpha=alpha,
power=power,
baseline_rate=test_config.get('baseline_rate'),
mde=mde,
std_dev=test_config.get('std_dev'),
required_sample_size=required_sample_size
)
๐ Power Analysis Summary - Test: MANN_WHITNEY_U_TEST (independent) - Significance level (ฮฑ): 0.05 - Statistical power (1 - ฮฒ): 0.8 - Std Dev (baseline): 14.64 - MDE (mean difference): 5 - Cohen's d: 0.34 โ To detect a 5-unit lift in mean outcome, you need 136 users per group โ total 272 users.
๐ Randomization
๐ Click to Expand
Randomization is used to ensure that observed differences in outcome metrics are due to the experiment, not pre-existing differences.
- Prevents selection bias (e.g., users self-selecting into groups)
- Balances confounding factors like platform, region, or past behavior
- Enables valid inference through statistical testing
๐ Simple Randomization (Click to Expand)
Each user is assigned to control or treatment with equal probability, independent of any characteristics.
โ When to Use:
- Sample size is large enough to ensure natural balance
- No strong concern about confounding variables
- Need a quick, default assignment strategy
๐ ๏ธ How It Works:
- Assign each user randomly (e.g., 50/50 split)
- No grouping, segmentation, or blocking involved
- Groups are expected to balance out on average
๐ Stratified Sampling (Click to Expand)
Ensures that key segments (e.g., platform, region) are evenly represented across control and treatment.
When to Use
- User base is naturally skewed (e.g., 70% mobile, 30% desktop)
- Important to control for known confounders like geography or device
- You want balance within subgroups, not just overall
How It Works
- Pick a stratification variable (e.g., platform)
- Split population into strata (groups)
- Randomly assign users within each stratum
๐ Block Randomization (Click to Expand)
Groups users into fixed-size blocks and randomly assigns groups within each block.
When to Use
- Users arrive in time-based batches (e.g., daily cohorts)
- Sample size is small and needs enforced balance
- You want to minimize temporal or ordering effects
How It Works
- Create blocks based on order or ID (e.g., every 10 users)
- Randomize assignments within each block
- Ensures near-equal split in every batch
๐ Match Pair Randomization (Click to Expand)
Participants are paired based on similar characteristics before random group assignment. This reduces variance and improves statistical power by ensuring balance on key covariates.
When to Use
- Small sample size with high risk of confounding
- Outcomes influenced by user traits (e.g., age, income, tenure)
- Need to minimize variance across groups
How It Works
- Identify important covariates (e.g., age, purchase history)
- Sort users by those variables
- Create matched pairs (or small groups)
- Randomly assign one to control, the other to treatment
๐ Cluster Randomization (Click to Expand)
Entire groups or clusters (e.g., cities, stores, schools) are assigned to control or treatment. Used when it's impractical or risky to randomize individuals within a cluster.
When to Use
- Users naturally exist in groups (e.g., teams, locations, devices)
- There's a risk of interference between users (e.g., word-of-mouth)
- Operational or tech constraints prevent individual-level randomization
How It Works
- Define the cluster unit (e.g., store, city)
- Randomly assign each cluster to control or treatment
- All users within the cluster inherit the group assignment
๐ Click to Expand
In this notebook we randomize the entire dataset (e.g., all 1000 simulated users). This is done purely for demonstration so that downstream analysis has enough data to run.
In real A/B experiments the workflow is different:
- Power analysis determines the required sample size (e.g., 272 users per group).
- Users are then randomized as they arrive in the product.
- The experiment continues until the required sample size is reached.
- Analysis is performed once the target sample size is collected.
In other words, real experiments do not randomize a fixed dataset upfront. Instead, randomization happens dynamically during the experiment as new users enter the system.
โ ๏ธ In this notebook the dataset is simulated beforehand, so randomization is applied to all users at once. In production experimentation platforms (e.g., Optimizely, Statsig, internal experimentation systems), users are assigned to variants at runtime.
n_required = test_config['required_sample_size'] * test_config['group_count']
df = df.sample(n=n_required, random_state=42)
df.head()
| user_id | engagement_score | bounce_rate | past_purchase_count | platform | device_type | |
|---|---|---|---|---|---|---|
| 521 | 522 | NaN | NaN | 25.729102 | Android | mobile |
| 737 | 738 | NaN | NaN | 51.861463 | iOS | mobile |
| 740 | 741 | NaN | NaN | 42.022993 | Android | desktop |
| 660 | 661 | NaN | NaN | 49.573649 | Android | mobile |
| 411 | 412 | NaN | NaN | 43.824528 | iOS | mobile |
# Apply randomization method
if randomization_method == "simple":
df = apply_simple_randomization(df, group_col=group_col, seed=my_seed)
elif randomization_method == "stratified":
df = apply_stratified_randomization(df, stratify_col='platform', group_col=group_col, seed=my_seed)
elif randomization_method == "block":
df = apply_block_randomization(df, observation_id_col='user_id', group_col=group_col, block_size=10, seed=my_seed)
elif randomization_method == "matched_pair":
df = apply_matched_pair_randomization(df, sort_col=pre_experiment_metric, group_col=group_col, group_labels=test_config['group_labels'])
elif randomization_method == "cluster":
df = apply_cluster_randomization(df, cluster_col='city', group_col=group_col, seed=my_seed)
else:
raise ValueError(f"โ Unsupported randomization method: {randomization_method}")
# Randomization only assigns group. Outcome data is collected in the next section (Outcome data).
df
| group | user_id | engagement_score | bounce_rate | past_purchase_count | platform | device_type | |
|---|---|---|---|---|---|---|---|
| 521 | control | 522 | NaN | NaN | 25.729102 | Android | mobile |
| 737 | control | 738 | NaN | NaN | 51.861463 | iOS | mobile |
| 740 | control | 741 | NaN | NaN | 42.022993 | Android | desktop |
| 660 | treatment | 661 | NaN | NaN | 49.573649 | Android | mobile |
| 411 | treatment | 412 | NaN | NaN | 43.824528 | iOS | mobile |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 380 | treatment | 381 | NaN | NaN | 56.269061 | iOS | mobile |
| 631 | control | 632 | NaN | NaN | 61.199801 | iOS | desktop |
| 381 | control | 382 | NaN | NaN | 48.515995 | iOS | mobile |
| 490 | control | 491 | NaN | NaN | 55.112986 | Android | mobile |
| 118 | control | 119 | NaN | NaN | 61.520038 | iOS | desktop |
272 rows ร 7 columns
๐ When Randomization Assumptions Break (Click to Expand)
Most A/B tests assume the Stable Unit Treatment Value Assumption (SUTVA) โ meaning:
- A user's outcome depends only on their own treatment assignment.
- One unit's treatment does not influence another unitโs outcome.
๐งช Why It Mattersยถ
If users in different groups interact:
- Control group behavior may be influenced by treatment group exposure.
- This biases your estimates and dilutes treatment effect.
- Standard tests may incorrectly accept the null hypothesis due to spillover.
This assumption breaks down in experiments involving social behavior, multi-user platforms, or ecosystem effects.
โ ๏ธ Common Violation Scenariosยถ
- ๐๏ธ Marketplace platforms (e.g., sellers and buyers interact)
- ๐งโ๐คโ๐ง Social features (e.g., follows, likes, comments, feeds)
- ๐ฒ Referrals / network effects (e.g., invites, rewards)
- ๐ฌ Chat and collaboration tools (e.g., Slack, Teams)
๐งฉ Solutions (If You Suspect Interference)ยถ
| Strategy | Description |
|---|---|
| Cluster Randomization | Randomize at group level (e.g., friend group, region, org ID) |
| Isolation Experiments | Only roll out to fully disconnected segments (e.g., one region only) |
| Network-Based Metrics | Include network centrality / exposure as covariates |
| Post-Experiment Checks | Monitor if control group was exposed indirectly (e.g., referrals, shared UIs) |
| Simulation-Based Designs | Use agent-based or graph simulations to estimate contamination risk |
โ๏ธ Sample Ratio Mismatch
๐ Click to Expand
Is group assignment balanced?
- SRM (Sample Ratio Mismatch) checks whether the observed group sizes match the expected ratio.
- In a perfect world, random assignment to 'A1' and 'A2' should give ~50/50 split.
- SRM helps catch bugs in randomization, data logging, or user eligibility filtering.
Real-World Experiment Split Ratios
| Scenario | Split | Why |
|---|---|---|
| Default A/B | 50 / 50 | Maximizes power and ensures fairness |
| Risky feature | 10 / 90 or 20 / 80 | Limits user exposure to minimize risk |
| Ramp-up | Step-wise (1-5-25-50โฆ) | Gradual rollout to catch issues early |
| A/B/C Test | 33 / 33 / 33 or weighted | Compare multiple variants fairly or with bias |
| High control confidence needed | 70 / 30 or 60 / 40 | More stability in baseline comparisons |
check_sample_ratio_mismatch(df, group_col=group_col, group_labels=test_config['group_labels'], expected_ratios=[0.5, 0.5], alpha=0.05)
๐ Sample Ratio Mismatch (SRM) Check Group control: 133 users (48.90%) โ Expected: 136.0 Group treatment: 139 users (51.10%) โ Expected: 136.0 Chi2 Statistic: 0.1324 P-value : 0.7160 โ No SRM โ group sizes look balanced.
๐งช AA Testing
๐ Click to Expand
A/A testing is a preliminary experiment where both groups (e.g., โcontrolโ and โtreatmentโ) receive the exact same experience. It's used to validate the experimental setup before running an actual A/B test.
What Are We Checking?
- Are users being assigned fairly and randomly?
- Are key outcome metrics statistically similar across groups?
- Can we trust the experimental framework?
Why A/A Testing Matters
- Validates Randomization โ Confirms the groups are balanced at baseline (no bias or leakage)
- Detects SRM (Sample Ratio Mismatch) โ Ensures the actual split (e.g., 50/50) matches what was intended
- Estimates Variability โ Helps calibrate variance for accurate power calculations later
- Trust Check โ Catches bugs in assignment logic, event tracking, or instrumentation
A/A Test Process
- Randomly assign users into two equal groups โ Just like you would for an A/B test (e.g., control vs treatment)
- Measure key outcome โ This depends on your experiment type:
binaryโ conversion ratecontinuousโ avg. revenue, time spentcategoricalโ feature adoption, plan selected
- Run statistical test:
binaryโ Z-test or Chi-squarecontinuousโ t-testcategoricalโ Chi-square test
- Check SRM โ Use a chi-square goodness-of-fit test to detect assignment imbalances
Possible Outcomes
| Result | Interpretation |
|---|---|
| No significant difference | โ Randomization looks good. Test setup is sound. |
| Statistically significant difference | โ ๏ธ Somethingโs off โ check assignment logic, instrumentation, or sample leakage |
Run A/A tests whenever you launch a new experiment framework, roll out a new randomizer, or need to build stakeholder trust.
# Experiment has run; outcome and guardrail data come in (simulated via add_outcome_metrics). A/A: no treatment effect.
df = add_outcome_metrics(df, group_col=group_col, group_labels=test_config['group_labels'], outcome_metric_col=test_config['outcome_metric_col'], guardrail_metric_col=test_config.get('guardrail_metric_col') or guardrail_metric_col, treatment_effect=False, seed=my_seed)
df.head()
| group | user_id | engagement_score | bounce_rate | past_purchase_count | platform | device_type | |
|---|---|---|---|---|---|---|---|
| 521 | control | 522 | 31.390500 | 0.572531 | 25.729102 | Android | mobile |
| 737 | control | 738 | 27.941317 | 0.484713 | 51.861463 | iOS | mobile |
| 740 | control | 741 | 81.517858 | 0.710801 | 42.022993 | Android | desktop |
| 660 | treatment | 661 | 28.027676 | 0.389262 | 49.573649 | Android | mobile |
| 411 | treatment | 412 | 62.268831 | 0.610558 | 43.824528 | iOS | mobile |
๐งฌ Outcome Similarity Test
๐ Click to Expand
Compares the outcome metric across groups to ensure no significant differences exist when there shouldn't be any โ usually used during A/A testing or pre-experiment validation.
- Helps detect setup issues like biased group assignment or data leakage.
- Null Hypothesis: No difference in outcomes between control and treatment.
- Uses the same statistical test as the main A/B test (e.g., t-test, z-test, chi-square).
df.groupby("group")[test_config['outcome_metric_col']].mean() # TODO, move this into function
# Outcome similarity test only.
_ = run_outcome_similarity_test(
df=df,
group_col='group',
metric_col=test_config['outcome_metric_col'],
test_family=test_config['family'],
variant=test_config.get('variant'),
group_labels=test_config['group_labels'],
alpha=0.05,
verbose=True
)
group control 46.639234 treatment 49.714740 Name: engagement_score, dtype: float64
๐ Outcome Similarity Check ๐ง Interpretation: Used a Mann-Whitney U test to compare medians of 'engagement_score' across groups (non-parametric). Null Hypothesis: Distributions are identical across groups. We use ฮฑ = 0.05 โ p = 0.1748 โฅ ฮฑ โ Failed to reject null. No significant difference.
visualize_aa_distribution(
df,
group_col='group',
metric_col=test_config['outcome_metric_col'],
test_family=test_config['family'],
group_labels=test_config['group_labels'],
variant=test_config.get('variant')
)
๐ฒ Type I Error Simulation
๐ Click to Expand
๐ Repeated A/A Tests
While a single A/A test helps detect obvious flaws in group assignment (like SRM or data leakage), itโs still a one-off check. To gain confidence in your randomization method, we simulate multiple A/A tests using the same logic:
- Each run reassigns users randomly into
controlandtreatment(with no actual change) - We then run the statistical test between groups for each simulation
- We track how often the test reports a false positive (p < ฮฑ), which estimates the Type I error rate
In theory, if your setup is unbiased and ฮฑ = 0.05, you'd expect about 5% of simulations to return a significant result โ this validates your A/B framework isnโt "trigger-happy."
๐ What this tells you:
- Too many significant p-values โ your framework is too noisy (bad randomization, poor test choice)
- Near 5% = healthy noise level, expected by design
This step is optional but highly recommended when you're:
- Trying out a new randomization strategy
- Validating an internal experimentation framework
- Stress-testing your end-to-end pipeline
_ = simulate_aa_type1_error_rate(
df=df,
metric_col=test_config['outcome_metric_col'],
group_labels=test_config['group_labels'],
test_family=test_config['family'],
variant=test_config.get('variant'),
runs=100,
alpha=0.05
)
๐ Type I Error Rate Estimate: 1/100 = 1.00%
๐ง Summary Interpretation:
We simulated 100 A/A experiments using random group assignment (no actual treatment).
Test: MANN_WHITNEY_U_TEST (independent)
Metric: engagement_score
Alpha: 0.05
False positives (p < ฮฑ): 1 / 100
โ Estimated Type I Error Rate: 1.00%
This is within expected range for ฮฑ = 0.05.
โ โ
Test framework is behaving correctly โ no bias or sensitivity inflation.
๐ Pseudo A/A Test (Click to Expand)
A pseudo A/A test is an offline validation technique used when historical experiment data already exists. Instead of running a live A/A experiment, analysts take data from a completed experiment and temporarily treat both groups as if they received the same experience.
The goal is to check whether the experimentation analysis pipeline behaves correctly when there is no true treatment effect. By re-running the statistical test under the assumption that both groups are identical, we can verify that the analysis does not produce systematic false positives.
Pseudo A/A tests are commonly used to validate analysis code, experiment dashboards, or statistical tooling before running new experiments.
โ ๏ธ Since pseudo A/A tests reuse historical experiment data rather than live traffic, they are only a diagnostic check and do not replace proper A/A validation in a production experimentation platform.
๐ For test selection (e.g., Z-test, t-test), refer to ๐ Hypothesis Testing Notebook
๐ Click to Expand
๐งช A/B Testing - Outcome Comparison
This section compares the outcome metric between control and treatment groups using the appropriate statistical test based on the experiment type.
๐ Metric Tracked:
- Primary metric: Depends on use case:
- Binary: Conversion rate (clicked or not)
- Continuous: Average engagement, revenue, time spent
- Categorical: Plan type, user tier, etc.
- Unit of analysis: Unique user or unique observation
๐ฌ Outcome Analysis Steps:
- Choose the right statistical test based on
experiment_type:'binary'โ Z-test for proportions'continuous_independent'โ Two-sample t-test'continuous_paired'โ Paired t-test'categorical'โ Chi-square test of independence
- Calculate test statistics, p-values, and confidence intervals
- Visualize the comparison to aid interpretation
# Simulate experiment data (outcome + guardrail) with treatment effect for A/B analysis.
df = add_outcome_metrics(df, group_col=group_col, group_labels=test_config['group_labels'], outcome_metric_col=test_config['outcome_metric_col'], guardrail_metric_col=test_config.get('guardrail_metric_col') or guardrail_metric_col, treatment_effect=True, seed=my_seed)
df.head()
| group | user_id | engagement_score | bounce_rate | past_purchase_count | platform | device_type | |
|---|---|---|---|---|---|---|---|
| 521 | control | 522 | 31.390500 | 0.490186 | 25.729102 | Android | mobile |
| 737 | control | 738 | 27.941317 | 0.094631 | 51.861463 | iOS | mobile |
| 740 | control | 741 | 81.517858 | 0.531660 | 42.022993 | Android | desktop |
| 660 | treatment | 661 | 31.301273 | 0.564170 | 49.573649 | Android | mobile |
| 411 | treatment | 412 | 65.362110 | 0.572919 | 43.824528 | iOS | mobile |
result = run_ab_test(
df=df,
group_col='group',
metric_col=test_config['outcome_metric_col'],
group_labels=test_config['group_labels'],
test_family=test_config['family'],
variant=test_config.get('variant'),
alpha=0.05
)
print_config_summary(result) # TODO: Print nested json properly
# result
๐ Hypothesis Test Configuration Summary
๐ธ Test Family : mann_whitney_u_test
๐ธ Variant : independent
๐ธ Group Labels : ('control', 'treatment')
๐ธ Alpha : 0.05
๐ธ Summary : {'control': {'n': 133, 'mean': np.float64(46.639233969426925), 'std': 14.203174151273695, 'sum': None}, 'treatment': {'n': 139, 'mean': np.float64(54.66250808310677), 'std': 16.47955206608026, 'sum': None}}
๐ธ Test : Mann-Whitney U test
๐ธ U Stat : 6750.0
๐ธ P Value : 0.00012097959260162445
summarize_ab_test_result(result)
============================================= ๐งช A/B Test Result Summary [MANN_WHITNEY_U_TEST] ============================================= ๐ Hypothesis Test Result Test used: Mann-Whitney U test U-statistic: 6750.0000 P-value : 0.0001 โ Statistically significant difference detected. ๐ Group Summary:
| n | mean | std | sum | |
|---|---|---|---|---|
| control | 133.0 | 46.639234 | 14.203174 | NaN |
| treatment | 139.0 | 54.662508 | 16.479552 | NaN |
๐ Lift Analysis - Absolute Lift : 8.0233 - Percentage Lift : 17.20% - 95% CI for Lift : [4.3719, 11.6746] =============================================
plot_ab_test_results(result)
๐ Visualization:
๐ฏ 95% Confidence Intervals
for outcome in groups
๐ Click to Expand
- The 95% confidence interval gives a range in which we expect the true conversion rate to fall for each group.
- If the confidence intervals do not overlap, it's strong evidence that the difference is statistically significant.
- If they do overlap, it doesn't guarantee insignificance โ you still need the p-value to decide โ but it suggests caution when interpreting lift.
plot_confidence_intervals(result)
๐ Lift Analysis
AKA 95% Confidence Intervals for (difference in outcomes)
๐ Click to Expand
This confidence interval helps quantify uncertainty around the observed lift between treatment and control groups. It answers:
- How large is the difference between groups?
- How confident are we in this lift estimate?
We compute a 95% CI for the difference in means (or proportions), not just for each group. If this interval does not include 0, we can reasonably trust there's a true difference. If it does include 0, the observed difference might be due to random chance.
This complements the p-value โ while p-values tell us if the difference is significant, CIs tell us how big the effect is, and how uncertain we are.
compute_lift_confidence_interval(result)
============================================= ๐ 95% CI for Difference in Outcome [mann_whitney_u_test] ============================================= - Absolute Lift (diff in means): 8.0233 - 95% CI for difference : [4.3719, 11.6746] โ Likely positive impact (CI > 0) =============================================
print_final_ab_test_summary(result)
========================================
๐ FINAL A/B TEST SUMMARY
========================================
๐ฅ Control Avg outcome : 46.6392
๐งช Treatment Avg outcome : 54.6625
๐ Absolute lift : 8.0233
๐ Percentage lift : 17.20%
๐งช P-value (from Mann-Whitney U test): 0.0001
----------------------------------------
โ
RESULT: Statistically significant difference detected.
========================================
โฑ๏ธ How Long
๐ Click to Expand
The duration of an A/B test depends on how quickly you reach the required sample size per group, as estimated during your power analysis.
โ Key Inputs
- Daily volume of eligible observations (users, sessions, or orders โ depends on your unit of analysis)
- Required sample size per group (from power analysis)
- Traffic split ratio (e.g., 50/50, 10/90, 33/33/33)
๐งฎ Formula
Test Duration (in days) =
Required Sample Size per Group รท (Daily Eligible Observations ร Group Split Proportion)
This ensures the experiment runs long enough to detect the expected effect with the desired confidence and power.
๐ก Planning Tips
- Estimate required sample size using power analysis (based on effect size, baseline, alpha, and power)
- Understand your traffic:
- Whatโs your average daily eligible traffic?
- What unit of analysis is used (user, session, impression)?
- Apply group split:
- e.g., for a 50/50 A/B test, each group gets 50% of traffic
- Estimate days using the formula above.
๐ง Real-World Considerations
- โ
Ramp-Up Period
Gradually increase traffic exposure: 5% โ 25% โ 50% โ full traffic.
Helps catch bugs, stability issues, and confounding edge cases early. - โ
Cool-Down Buffer
Avoid ending tests on weekends, holidays, or during unusual traffic spikes.
Add buffer days so your conclusions arenโt skewed by anomalies. - โ
Trust Checks Before Analysis
- A/A testing to verify setup
- SRM checks to confirm user distribution
- Monitor guardrail metrics (e.g., bounce rate, latency, load time)
๐ฃ๏ธ Common Practitioner Advice
โWe calculate sample size using power analysis, then divide by daily traffic per group. But we always factor in buffer days โ for ramp-up, trust checks, and stability. Better safe than sorry.โ
โPower analysis is the starting point. But we donโt blindly stop when we hit N. We monitor confidence intervals, metric stability, and coverage to make sure weโre making decisions the business can trust.โ
๐ Monitoring Dashboard (Click to Expand)
- Overall Test Health
- Start/end date, traffic ramp-up %, time remaining
- SRM (Sample Ratio Mismatch) indicator
- P-value and effect size summary (updated daily)
- Primary Metric Tracking
- Daily trends for primary outcome (conversion, revenue, etc.)
- Cumulative lift + confidence intervals
- Statistical significance tracker (p-value, test stat)
- Guardrail Metrics
- Bounce rate, load time, checkout errors, etc.
- Alert thresholds (e.g., +10% increase in latency)
- Trend vs baseline and prior experiments
- Segment Drilldowns
- Platform (iOS vs Android), geography, user tier
- Detect heterogeneous treatment effects
- Option to toggle test results per segment
- Cohort Coverage
- Total users assigned vs eligible
- Daily inclusion and exclusion trends
- Debugging filters (e.g., why user X didnโt get assigned)
- Variance & Stability Checks
- Volatility of key metrics
- Pre vs post baseline comparisons
- Funnel conversion variance analysis
- Notes & Annotations
- Manual tagging of major incidents (e.g., bug fix deployed, pricing change)
- Timeline of changes affecting experiment interpretation
daily_eligible_users = 1000
allocation_ratios = (0.5, 0.5)
buffer_days = 2
test_duration_result = estimate_test_duration(
required_sample_size_per_group=test_config['required_sample_size'],
daily_eligible_users=daily_eligible_users,
allocation_ratios=allocation_ratios,
buffer_days=buffer_days,
test_family=test_config['family']
)
test_duration_result
๐งฎ Estimated Test Duration - Test family : mann_whitney_u_test - Required sample per group : 136 - Daily eligible traffic : 1000 - Allocation ratio : (0.5, 0.5) - Longest group runtime : 1 days - Buffer days : 2 โ Total estimated duration : 3 days
{'test_family': 'mann_whitney_u_test',
'per_group_days': [np.float64(1.0), np.float64(1.0)],
'longest_group_runtime': 1,
'recommended_total_duration': 3}
๐ Post Hoc Analysis
๐ Click to Expand
After statistical significance, post-hoc analysis helps connect results to business confidence.
It's not just did it work โ but how, for whom, and at what cost or benefit?
๐ง Why Post Hoc Analysis Matters
- Segments may respond differently โ average lift may hide underperformance in subgroups
- Guardrails may show collateral damage (e.g., slower load time, higher churn)
- Stakeholders need impact translation โ what does this mean in revenue, retention, or strategy?
๐ Typical Post Hoc Questions
- Segment Lift
- Did certain platforms, geos, cohorts, or user types benefit more?
- Any negative lift in high-value user segments?
- Guardrail Checks
- Did the treatment impact non-primary metrics (e.g., latency, engagement, bounce rate)?
- Were alert thresholds breached?
- Business Impact Simulation
- How does the observed lift scale to 100% of eligible users?
- Whatโs the projected change in conversions, revenue, or user satisfaction?
- Edge Case Discovery
- Any bugs, instrumentation gaps, or unexpected usage patterns?
- Did any user types get excluded disproportionately?
๐ What to Report
| Area | What to Show |
|---|---|
| Segment Analysis | Table or chart showing lift per segment, sorted by effect size or risk |
| Guardrail Metrics | Summary table of guardrails vs baseline, with thresholds or annotations |
| Revenue Simulation | Projected uplift ร traffic volume ร conversion = business impact |
| Confidence Range | 95% CI for key metrics per segment (wherever possible) |
| Rollout Readiness | Any blockers, mitigations, or next steps if full rollout is considered |
๐ก Pro Tip
Even if your p-value says โyes,โ business rollout is a risk-based decision.
Post-hoc analysis is where statistical rigor meets product judgment.
๐งฉ Segmented Lift
๐ Click to Expand
Segmented lift tells us how different user segments responded to the treatment.
Why It Matters:
- Uncovers hidden heterogeneity โ The overall average might mask variation across platforms, geographies, or user tiers.
- Identifies high-risk or high-reward cohorts โ Some segments might benefit more, while others could be negatively impacted.
- Guides rollout and targeting decisions โ Helps decide where to prioritize feature exposure, or where to mitigate risk.
Typical Segments:
- Device type (e.g., mobile vs desktop)
- Region (e.g., North vs South)
- User lifecycle (e.g., new vs returning)
- Platform (e.g., iOS vs Android)
"Segmentation answers who is benefiting (or suffering) โ not just whether it worked on average."
analyze_segment_lift(
df=df,
test_config=test_config,
segment_cols=['platform', 'device_type'], # , 'user_tier', 'region'
min_count_per_group=30,
visualize=True
)
๐ Segmenting by: platform
| platform | count_control | count_treatment | mean_control | mean_treatment | std_control | std_treatment | lift | p_value_lift | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Android | 56 | 48 | 44.881922 | 53.659185 | 13.900282 | 16.633431 | 8.777262 | None |
| 1 | iOS | 77 | 91 | 47.917279 | 55.191734 | 14.374087 | 16.465483 | 7.274455 | None |
๐ Segmenting by: device_type
| device_type | count_control | count_treatment | mean_control | mean_treatment | std_control | std_treatment | lift | p_value_lift | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | mobile | 87 | 106 | 46.029603 | 54.846641 | 13.205630 | 16.346295 | 8.817038 | None |
| 1 | desktop | 46 | 33 | 47.792232 | 54.071050 | 16.012127 | 17.144749 | 6.278819 | None |
๐ฆ Guardrail Metrics
๐ Click to Expand
Guardrail metrics are non-primary metrics tracked during an experiment to ensure the feature doesn't create unintended negative consequences.
We monitor them alongside the main success metric to:
- ๐ Catch regressions in user behavior or system performance
- ๐ Detect trade-offs (e.g., conversion โ but bounce rate โ too)
- ๐ Block rollouts if a feature does more harm than good
๐งช How We Check
- Run statistical tests on each guardrail metric just like we do for the primary metric
- Use the same experiment type (binary, continuous, etc.) for evaluation
- Report p-values and lift to assess significance and direction
- Focus more on risk detection than optimization
๐ Common Guardrail Metrics
| Type | Examples |
|---|---|
| UX Health | Bounce Rate, Session Length, Engagement |
| Performance | Page Load Time, API Latency, CPU Usage |
| Reliability | Error Rate, Crash Rate, Timeout Errors |
| Behavioral | Scroll Depth, Page Views per Session |
โ When to Act
- If the treatment significantly worsens a guardrail metric โ investigate
- If the primary metric improves but guardrails suffer, assess trade-offs
- Use p-values, lift, and domain context to guide decision-making
๐ง Why Guardrails Matter
โWe donโt just care if a metric moves โ we care what else it moved. Guardrails give us confidence that improvements arenโt hiding regressions elsewhere.โ
run_guardrail_analysis(df, test_config, group_col='group', alpha=0.05)
๐ฆ Guardrail Metric Check โ 'bounce_rate' Hypothesis (two-sided t-test): Hโ โ no difference in mean vs Hโ โ means differ. - control : 0.5534 - treatment : 0.5733 - Difference : +0.0198 - P-value (t-test): 0.2849 ๐ก No statistically significant change โ guardrail looks stable.
๐ CUPED
๐ Click to Expand
Controlled Pre-Experiment Data: A statistical adjustment that uses pre-experiment behavior to reduce variance and improve power. It helps detect smaller effects without increasing sample size.
When to Use
- You have reliable pre-experiment metrics (e.g., past spend, engagement)
- You want to reduce variance and improve test sensitivity
- Youโre dealing with small lifts or costly sample sizes
How It Works
- Identify a pre-period metric correlated with your outcome
- Use regression to compute an adjustment (theta)
- Subtract the correlated component from your outcome metric
- Analyze the adjusted metric instead of the raw one
df = apply_cuped(
df=df,
pre_metric='past_purchase_count',
outcome_metric_col=test_config['outcome_metric_col'],
group_col='group',
group_labels=test_config['group_labels']
)
df.head()
| group | user_id | engagement_score | bounce_rate | past_purchase_count | platform | device_type | engagement_score_cuped_adjusted | |
|---|---|---|---|---|---|---|---|---|
| 521 | control | 522 | 31.390500 | 0.490186 | 25.729102 | Android | mobile | 29.264534 |
| 737 | control | 738 | 27.941317 | 0.094631 | 51.861463 | iOS | mobile | 23.656065 |
| 740 | control | 741 | 81.517858 | 0.531660 | 42.022993 | Android | desktop | 78.045548 |
| 660 | treatment | 661 | 31.301273 | 0.564170 | 49.573649 | Android | mobile | 27.205060 |
| 411 | treatment | 412 | 65.362110 | 0.572919 | 43.824528 | iOS | mobile | 61.740941 |
# TODO: move these into apply_cuped()
original_std = df[test_config['outcome_metric_col']].std()
cuped_std = df[f"{test_config['outcome_metric_col']}_cuped_adjusted"].std()
print("Variance Reduction from CUPED")
print("--------------------------------")
print(f"Original std dev : {original_std:.3f}")
print(f"CUPED std dev : {cuped_std:.3f}")
print(f"Reduction : {(1 - cuped_std/original_std)*100:.2f}%")
Variance Reduction from CUPED -------------------------------- Original std dev : 15.896 CUPED std dev : 15.859 Reduction : 0.24%
result_cuped = run_ab_test(
df=df,
group_col='group',
metric_col=f"{test_config['outcome_metric_col']}_cuped_adjusted",
group_labels=test_config['group_labels'],
test_family=test_config['family'],
variant=test_config.get('variant')
)
summarize_ab_test_result(result_cuped)
============================================= ๐งช A/B Test Result Summary [MANN_WHITNEY_U_TEST] ============================================= ๐ Hypothesis Test Result Test used: Mann-Whitney U test U-statistic: 6790.0000 P-value : 0.0002 โ Statistically significant difference detected. ๐ Group Summary:
| n | mean | std | sum | |
|---|---|---|---|---|
| control | 133.0 | 42.684917 | 14.179971 | NaN |
| treatment | 139.0 | 50.562556 | 16.462328 | NaN |
๐ Lift Analysis - Absolute Lift : 7.8776 - Percentage Lift : 18.46% - 95% CI for Lift : [4.2310, 11.5242] =============================================
๐ Why p-values can't always be trusted
When we test multiple segments, multiple metrics or multiple variants, we increase the risk of false positives (Type I errors). This is known as the Multiple Comparisons Problem โ and itโs dangerous in data-driven decision-making.
๐ Example Scenario:ยถ
We run A/B tests on:
- Overall population โ
- By platform โ
- By user tier โ
- By region โ
If we test 10 hypotheses at 0.05 significance level, the chance of at least one false positive โ 40%.
โ Correction Methodsยถ
| Method | Use Case | Risk |
|---|---|---|
| Bonferroni | Very strict, controls Family-Wise Error Rate (FWER) | โ๏ธ Conservative |
| Benjamini-Hochberg | Controls False Discovery Rate (FDR) | ๐ฅ Balanced |
๐ง In Practice:ยถ
We calculate raw p-values for each segment, and then apply corrections to get adjusted p-values.
If even the adjusted p-values are significant โ result is robust.
โ๏ธ Bonferroni Correctionยถ
๐ FWER Control (Click to Expand)
Bonferroni is the most **conservative** correction method. It adjusts the p-value threshold by dividing it by the number of comparisons.- Formula:
adjusted_alpha = alpha / num_tests - Or:
adjusted_p = p * num_tests - If even one adjusted p-value < 0.05, itโs very likely real
๐ Best for: High-risk decisions (e.g., medical trials, irreversible launches)
โ ๏ธ Drawback: May miss true positives (higher Type II error)
๐ฌ Benjamini-Hochberg (BH) Procedureยถ
๐ FDR Control (Click to Expand)
BH controls the expected proportion of false discoveries (i.e., false positives among all positives). It:
- Ranks p-values from smallest to largest
- Compares each to
(i/m) * alpha, where:i= rankm= total number of tests
๐ง Important: After adjustment, BH enforces monotonicity by capping earlier (smaller) ranks to not exceed later ones.
In simple terms: adjusted p-values can only decrease as rank increases.
The largest p-value that satisfies this inequality becomes the threshold โ all smaller p-values are considered significant.
๐ Best for: Exploratory research, product experiments with many segments
๐ก Advantage: More power than Bonferroni, still controls errors
# Original inputs
segment_names = ['North', 'South', 'East', 'West']
p_vals = [0.03, 0.06, 0.02, 0.10]
# Create DataFrame and sort by raw p-values BEFORE correction
df_pvalues = pd.DataFrame({
'Segment': segment_names,
'Raw_pValue': p_vals
}).sort_values('Raw_pValue').reset_index(drop=True)
# Apply corrections to the sorted p-values
_, bonf, _, _ = multipletests(df_pvalues['Raw_pValue'], alpha=0.05, method='bonferroni')
_, bh, _, _ = multipletests(df_pvalues['Raw_pValue'], alpha=0.05, method='fdr_bh')
# Add to DataFrame
df_pvalues['Bonferroni_Adj_pValue'] = bonf
df_pvalues['BH_Adj_pValue'] = bh
df_pvalues
# TODO: decision from p-value?
| Segment | Raw_pValue | Bonferroni_Adj_pValue | BH_Adj_pValue | |
|---|---|---|---|---|
| 0 | East | 0.02 | 0.08 | 0.06 |
| 1 | North | 0.03 | 0.12 | 0.06 |
| 2 | South | 0.06 | 0.24 | 0.08 |
| 3 | West | 0.10 | 0.40 | 0.10 |
#TODO: club with earlier cell, function possible?
# Plot p values - raw and adjusted
plt.figure(figsize=(8, 5))
# Plot lines
plt.plot(df_pvalues.index + 1, df_pvalues['Raw_pValue'], marker='o', label='Raw p-value')
plt.plot(df_pvalues.index + 1, df_pvalues['Bonferroni_Adj_pValue'], marker='^', label='Bonferroni Adj p-value')
plt.plot(df_pvalues.index + 1, df_pvalues['BH_Adj_pValue'], marker='s', label='BH Adj p-value')
# Add value labels next to each point
for i in range(len(df_pvalues)):
x = i + 1
plt.text(x + 0.05, df_pvalues['Raw_pValue'][i], f"{df_pvalues['Raw_pValue'][i]:.2f}", va='center')
plt.text(x + 0.05, df_pvalues['Bonferroni_Adj_pValue'][i], f"{df_pvalues['Bonferroni_Adj_pValue'][i]:.2f}", va='center')
plt.text(x + 0.05, df_pvalues['BH_Adj_pValue'][i], f"{df_pvalues['BH_Adj_pValue'][i]:.2f}", va='center')
# Axis & labels
plt.xticks(df_pvalues.index + 1, df_pvalues['Segment']);
plt.axhline(0.05, color='gray', linestyle='--', label='ฮฑ = 0.05');
plt.xlabel("Segment (Ranked by Significance)");
plt.ylabel("p-value");
plt.title("p-value Correction: Bonferroni vs Benjamini Hochberg (FDR)");
plt.legend();
plt.tight_layout();
plt.show();
๐ช Novelty Effects & Behavioral Decay
๐ Why First Impressions Might Lie (Click to Expand)
๐ช Novelty Effects & Behavioral Decayยถ
Even if an A/B test shows a statistically significant lift, that improvement may not last.
This often happens due to novelty effects โ short-term spikes in engagement driven by:
- Curiosity (โWhatโs this new feature?โ)
- Surprise (โThis looks different!โ)
- Visual attention (e.g., placement or color changes)
๐ Common Signs of Novelty Effectsยถ
- Strong lift in week 1 โ drops by week 3.
- High initial usage โ no long-term retention.
- Positive metrics in one segment only (e.g., โnew usersโ).
๐งญ What We Do About Itยถ
To address this risk during rollouts:
- โ Monitor metrics over time post-launch (e.g., 7, 14, 28-day retention)
- โ Compare results across early adopters vs late adopters
- โ Run holdout experiments during phased rollout to detect fading impact
๐ฏ Primacy Effect & Order Bias
๐ When First = Best (Click to Expand)
Sometimes, the position of a variant or option can distort results โ especially if it's shown first. This is called the primacy effect, a type of cognitive bias.
It often shows up in:
- Feed ranking or content ordering experiments
- Option selection (e.g., first dropdown item)
- Surveys or in-app prompts
๐ฉ Common Indicatorsยถ
- Variant A always performs better regardless of content
- Metrics drop when position is swapped
- Discrepancy between test and real-world usage
๐งญ What We Do About Itยถ
To minimize primacy bias:
- โ Randomize order of options or content
- โ Use position-aware metrics (e.g., click-through by slot)
- โ Validate with follow-up tests using rotated or reversed orders
๐ฒ Rollout Simulation
๐ Click to Expand
Once statistical significance is established, it's useful to simulate business impact from full rollout.
Assume full exposure to eligible daily traffic, and estimate incremental impact from the observed lift.
This helps stakeholders understand the real-world benefit of implementing the change.
We typically estimate:
- ๐ Daily lift (e.g., additional conversions, dollars, sessions)
- ๐ Monthly extrapolation (daily lift ร 30)
# Derive daily volume from actual data
daily_traffic_estimate = df.shape[0] # Assuming full traffic per day
simulate_rollout_impact(
experiment_result=result, # Output from run_ab_test()
daily_eligible_observations=daily_traffic_estimate,
metric_unit=test_config['outcome_metric_col'] # Dynamic label like 'engagement_score' or 'revenue'
)
๐ฆ Rollout Simulation - Outcome Metric : engagement_score - Observed Lift : 8.0233 per unit - Daily Eligible Units: 272 - Estimated Daily Impact : 2,182 engagement_score/day - Estimated Monthly Impact : 65,470 engagement_score/month
๐งช A/B Test Holdouts
๐ Why We Sometimes Don't Ship to 100% (Click to Expand)
๐งช A/B Test Holdoutsยถ
Even after a successful A/B test, we often maintain a small holdout group during rollout.
This helps us:
- Track long-term impact beyond the experiment window.
- Detect novelty fade or unexpected side effects.
- Maintain a clean โcontrolโ for system-wide benchmarking.
๐ข Industry Practiceยถ
- Common at large orgs like Facebook, where teams share a holdout pool for all feature launches.
- Holdouts help leadership evaluate true impact during performance reviews and roadmap planning.
โ ๏ธ When We Skip Holdoutsยถ
- Bug fixes or critical updates (e.g., spam, abuse, policy violations).
- Sensitive changes like content filtering (e.g., child safety flags).
๐ซ Limits & Alteratives
๐ When Not to A/B Test & What to Do Instead (Click to Expand)
๐ โโ๏ธ When Not to A/B Testยถ
- Lack of infrastructure โ No tracking, engineering, or experiment setup.
- Lack of impact โ Not worth the effort if the feature has minimal upside, shipping features has downstream implications (support, bugs, operations)..
- Lack of traffic โ Canโt reach stat sig in a reasonable time.
- Lack of conviction โ No strong hypothesis; testing dozens of variants blindly.
- Lack of isolation โ Hard to contain exposure (e.g., testing a new logo everyone sees).
๐งช Alternatives & Edge Casesยถ
- Use user interviews or logs to gather directional signals.
- Leverage retrospective data for pre/post comparisons.
- Consider sequential testing or soft rollouts for low-risk changes.
- Use design experiments (e.g., multivariate, observational) when randomization isn't feasible.