Status: Complete Python Coverage License

📖 Hypothesis Testing¶

🗂️ Data Setup

  • ⚙️ Define Test Configuration
  • 🧪 Generate Data from Config

📈 EDA

  • 👥 Group Count from Data
  • 📏 Sample Size from Data
  • 📊 Outcome Type from Data
  • 🔍 Check Distribution (Normality)
  • 📏 Check Variance Equality
  • 🔍 Visual Check

🛠️ Test Setup

  • 📏 Infer Parametric Flag
  • 🧪 Validate Configuration Dictionary

🧪 Hypothesis Testing

  • 🧭 Determine Test To Run
  • 🧠 Print Hypothesis Statement
  • 🧪 Run Hypothesis Test

🗂️ Data Setup¶

📖 Notebook User Guide (click to expand)
▶️ Run order¶
  • Run cells top to bottom.
  • After changing config, re-run from Define Test Configuration onward.
✅ Valid config¶
  • outcome_type: continuous | binary | categorical | count
  • group_count: one-sample | two-sample | multi-sample
  • group_relationship: independent | paired
  • ❌ Invalid examples: multi-sample + paired, one-sample + count, one-sample + categorical, paired + count
    → Validate Configuration will raise on invalid combos.
1️⃣ One-sample¶
  • Set config['population_mean'] = <value> (e.g. 0) in the config cell.
  • Do this before running Validate Configuration.
📊 Data¶
  • Use Generate Data from Config to create df, or load your own CSV.
  • Own data must match the expected data structure for your design.
🔄 Pipeline (in order)¶
Step What
1 Config → Validate
2 Generate data
3 EDA (group count, sample size, outcome, distribution, variance, visuals)
4 Validate again
5 Determine test → Hypothesis statement → Run test

⚙️ Define Test Configuration¶

In [20]:
# Display Settings
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import display, HTML
import warnings
import json
%load_ext autoreload
%autoreload 2

# UDF
my_seed = 1995
from ht_utils import *
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [ ]:
config = {
    # ----------------------------
    # Experiment Inputs
    # ----------------------------
    'outcome_type': 'continuous',        # 'continuous', 'binary', 'categorical', 'count'
    'group_relationship': 'independent', # 'independent', 'paired'. One-sample is independent by definition.
    'group_count': 'two-sample',         # 'one-sample', 'two-sample', 'multi-sample'
    'distribution': None,                # None, 'normal', 'non-normal'
    'variance_equal': None,              # None, 'equal', 'unequal'
    
    # For one-sample only: set reference value (e.g. 0) → uncomment and set before Validate:
    # 'population_mean': 0,
    
    # ----------------------------
    # Simulated Data
    # ----------------------------
    'sample_size': 100,                  # per group
    'effect_size': 0.5,                  # for generating synthetic difference
    
    # ----------------------------
    # Assumption Flags (Inferred later)
    # ----------------------------
    'tail_type': 'two-tailed',           # 'one-tailed', 'two-tailed'
    'parametric': None,                  # True or False → to be inferred later
    'alpha': 0.05,                       # significance level
    
    # ----------------------------
    # Test Decision
    # ----------------------------
    'test_name': None,                   # e.g. two_sample_ttest_welch. → to be inferred later
    'H_0': None,                         # Null Hypo → to be filled later   
    'H_a': None                          # Alt Hypo → to be filled later    
}
In [ ]:
print_config_summary(config)
📋 Hypothesis Test Configuration Summary

🔸 Outcome Type         : continuous
🔸 Group Relationship   : independent
🔸 Group Count          : two-sample
🔸 Distribution         : None
🔸 Variance Equal       : None
🔸 Tail Type            : two-tailed
🔸 Sample Size          : 100
🔸 Effect Size          : 0.5
🔸 Parametric           : None
🔸 Alpha                : 0.05
🔸 Test Name            : None
🔸 H 0                  : None
🔸 H A                  : None
In [23]:
validate_config(config)
✅ Config validated successfully.

🧪 Generate Data from Config¶

📐 Expected data structure (click to expand)

One-sample — one column: value

value
1.23
-0.45
2.10
0.67
1.88

Two-sample independent — columns: group, value

group value
A 0.52
A 1.11
B 1.03
B 0.89
A 0.41

Two-sample paired — columns: group_A, group_B (optional: id)

id group_A group_B
0 0.12 0.58
1 1.02 1.49
2 -0.33 0.21
3 0.77 1.22
4 0.45 0.91

Multi-sample independent — columns: group, value

group value
A 0.31
B 0.95
C 1.12
A 0.44
B 0.78

Column names must match exactly: value, group, group_A, group_B.

In [24]:
df = generate_data_from_config(config)
# df = pd.read_csv("hypothesis_testing_data.csv")
df
Out[24]:
group value
0 A -1.240633
1 A -1.470579
2 A 2.101191
3 A -1.464822
4 A 0.817922
... ... ...
195 B 0.816507
196 B 0.218070
197 B 0.151804
198 B 0.072379
199 B -0.048232

200 rows × 2 columns


Back to the top

📈 EDA¶

👥 Group Count from Data¶

📖 Group Count Check (Click to Expand)
🧠 What Does group_count Represent?¶

group_count determines the structural design of the hypothesis test:

  • One-sample → A single group compared to a reference value
  • Two-sample → Two groups being compared
  • Multi-sample → More than two independent groups

This controls which family of tests is eligible (t-test, ANOVA, etc.).


🔍 How Do We Infer It from Data?¶

The logic depends on the study structure:

✅ Independent Design¶

If the dataset contains a group column:

  • 1 unique group → one-sample
  • 2 unique groups → two-sample
  • More than 2 groups → multi-sample

If there is no group column, we assume a one-sample test.


🔁 Paired Design¶

If the dataset contains:

  • group_A and group_B columns

We treat this as:

→ two-sample (paired)

Paired multi-sample designs are not supported in this module.


⚠️ Why This Matters¶

The number of groups directly determines:

  • Whether we use t-test vs ANOVA
  • Whether variance assumptions are checked
  • Whether post-hoc testing might be required

Incorrect group structure leads to incorrect test selection.


In [25]:
config['group_count'] = infer_group_count_from_data(config, df)
🔄 Step: Infer Group Count from Dataset
📊 Detected 2 group(s) → group_count = 'two-sample'

📏 Sample Size from Data¶

📖 Sample Size Check (Click to Expand)
🧠 What Does sample_size Represent?¶

It depends on the study design:

  • One-sample test → Total number of observations
  • Two-sample (independent) → Per-group sample size
    • We use the minimum group size (conservative approach)
  • Two-sample (paired) → Number of paired observations
  • Multi-sample → Total number of rows (can be extended later)

⚖️ Why Use Minimum Group Size for Independent Tests?¶

If group sizes are unequal:

  • Many test assumptions behave best under balanced design.
  • Using the smallest group size is statistically conservative.
  • Prevents overstating effective sample size.

In [26]:
config['sample_size'] = infer_sample_size_from_data(config, df)
📊 Synced sample_size → 100

📊 Outcome Type from Data¶

📖 Outcome Type Check (Click to Expand)
🧠 What Does outcome_type Represent?¶

outcome_type defines the nature of the dependent variable being tested.

It determines:

  • Which statistical tests are eligible
  • Whether normality assumptions apply
  • Whether variance equality is relevant
  • Whether parametric logic is meaningful

Correct classification is essential for correct test selection.


🔍 How Do We Infer It from Data?¶

We inspect:

  • Data type (dtype)
  • Number of unique values
  • Value patterns (e.g., only 0/1)

🟢 Binary¶

If values are only {0, 1} →
outcome_type = 'binary'

Used for:

  • Proportion tests
  • McNemar
  • Two-proportion z-test

🟣 Categorical¶

If values are strings or non-numeric categories →
outcome_type = 'categorical'

Used for:

  • Chi-square tests

🟡 Count¶

If values are:

  • Integers
  • Non-negative
  • More than two unique levels

→ outcome_type = 'count'

Used for:

  • Poisson-based tests

🔵 Continuous¶

If values are numeric (float or many unique numeric levels) →
outcome_type = 'continuous'

Used for:

  • t-tests
  • ANOVA
  • Non-parametric rank tests

In [27]:
config['outcome_type'] = infer_outcome_type_from_data(config, df)
🔄 Step: Infer Outcome Type from Dataset
📊 Inferred outcome_type → 'continuous'

🔍 Check Distribution (Normality)¶

📖 Normality Check (Click to Expand)
📘 Why Are We Checking for Normality?¶

Many hypothesis tests (like the t-test) rely on the assumption that the outcome variable is normally distributed.
This is particularly important when working with continuous outcome variables in small to moderate-sized samples.

🧪 Test Used: Shapiro-Wilk¶

The Shapiro-Wilk test checks whether the sample comes from a normal distribution.

  • Null Hypothesis (H₀): The data follows a normal distribution
  • Alternative Hypothesis (H₁): The data does not follow a normal distribution
🧠 Interpretation:¶
  • p > 0.05 → Fail to reject H₀ → The data is likely a normal distribution ✅
  • p < 0.05 → Reject H₀ → The data is likely a non-normal distribution ⚠️

We check this per group, and only if the outcome variable is continuous.

❗Note:¶
  • No need to check normality for binary, categorical, or count data.
  • For paired tests, we assess normality on the differences between paired observations.
In [28]:
config['distribution'] = infer_distribution_from_data(config, df)
🔍 Step: Infer Distribution of Outcome Variable
📘 Checking if the outcome variable follows a normal distribution
   Using Shapiro-Wilk Test
   H₀: Data comes from a normal distribution
   H₁: Data does NOT come from a normal distribution

• Two-sample (independent) case → testing both groups
• Group A → Shapiro-Wilk p = 0.4534 → Fail to reject H₀ ✅ (likely a normal distribution)
• Group B → Shapiro-Wilk p = 0.7145 → Fail to reject H₀ ✅ (likely a normal distribution)
✅ Both groups are likely drawn from normal distributions
📦 Final Decision → config['distribution'] = `normal`
📖 Normality Check using KS Test (Click to Expand)
📘 Why Use the Kolmogorov–Smirnov Test?¶

The Kolmogorov–Smirnov (KS) test is another method to assess whether a sample follows a specified distribution — in this case, a normal distribution.

Unlike Shapiro-Wilk (which is specifically designed for normality),
the KS test compares the sample distribution to a theoretical normal distribution fitted using the sample’s mean and standard deviation.

🧪 Test Used: Kolmogorov–Smirnov (KS)¶

The KS test evaluates the maximum distance between:

  • The empirical distribution function (EDF) of the sample

  • The cumulative distribution function (CDF) of the fitted normal distribution

  • Null Hypothesis (H₀): The data follows a normal distribution

  • Alternative Hypothesis (H₁): The data does not follow a normal distribution

🧠 Interpretation:¶
  • p > 0.05 → Fail to reject H₀ → The data is likely normally distributed ✅
  • p < 0.05 → Reject H₀ → The data is likely non-normal ⚠️

We apply this test per group, and only if the outcome variable is continuous.

⚠️ Important Notes:¶
  • The KS test assumes the comparison distribution is fully specified.
    Since we estimate mean and standard deviation from the data, this is technically an approximation.
  • KS is generally less powerful than Shapiro-Wilk for detecting normality in small samples.
  • For large samples, even minor deviations from normality may become statistically significant.
  • Visual diagnostics (like Q-Q plots) should complement formal tests.
❗Reminder:¶
  • No need to check normality for binary, categorical, or count data.
  • For paired tests, assess normality on the differences between paired observations.
In [29]:
# config['distribution'] = infer_distribution_from_data_ks(config, df)
# pretty_json(config_2)
# print_config_summary(config_2)
📖 Q-Q Plot for Normality (Click to Expand)
📘 Why Use a Q-Q Plot?¶

A Q-Q (Quantile–Quantile) plot visually compares the distribution of your sample data
to a theoretical normal distribution.

It helps identify:

  • Skewness
  • Heavy tails
  • Outliers
  • Systematic deviations from normality
🧪 How It Works¶

The plot compares:

  • Observed quantiles (from your data)
  • Expected quantiles (from a normal distribution)

If the data is normally distributed, the points should fall approximately along a straight line.

🧠 Interpretation:¶
  • Points closely follow the line → Data is likely normal ✅
  • Systematic curvature or tail deviations → Data may be non-normal ⚠️
❗Why Use This Alongside Statistical Tests?¶
  • Formal tests (Shapiro, KS) can be overly sensitive in large samples.
  • Q-Q plots help understand where deviations occur.
  • Mild deviations may not meaningfully impact parametric tests.

For paired tests, assess the distribution of differences.

In [30]:
qq_plot_normality(config, df)
📊 Step: Visual Normality Check using Q-Q Plot
If points fall approximately along the straight line → data is likely normal.

No description has been provided for this image

📏 Check Variance Equality¶

📖 Equal Variance Check (Click to Expand)
📘 Why Are We Checking for Equal Variance?¶

When comparing two independent groups using a parametric test (like a two-sample t-test),
we assume that both groups have equal variances — this is called the homogeneity of variance assumption.

Failing to meet this assumption can lead to incorrect conclusions if the wrong test is applied.


🧪 Test Used: Levene’s Test¶

Levene’s Test checks whether the spread (variance) of values is roughly the same in both groups.

  • Null Hypothesis (H₀): Variance in Group A = Variance in Group B
  • Alternative Hypothesis (H₁): Variances are different between the groups

🧠 Interpretation:¶
  • p > 0.05 → Fail to reject H₀ → Variances are likely equal ✅
  • p < 0.05 → Reject H₀ → Variances are likely unequal ⚠️

✅ When Should You Check This?¶
  • ✔️ Check only when:

    • You’re comparing two groups
    • The groups are independent
    • The outcome is continuous
  • ❌ Don’t check if:

    • The test is one-sample
    • The groups are paired (since variance of differences is what matters)
In [31]:
config['variance_equal'] = infer_variance_equality(config, df)
📏 **Step: Infer Equality of Variance Across Groups**
📘 We're checking if the spread (variance) of the outcome variable is similar across groups A and B.
   This is important for choosing between a **pooled t-test** vs **Welch’s t-test**.
🔬 Test Used: Levene’s Test for Equal Variance
   H₀: Variance in Group A = Variance in Group B
   H₁: Variances are different

📊 Levene’s Test Result:
• Test Statistic = 1.0751
• p-value        = 0.3011
✅ Fail to reject H₀ → Variances appear equal across groups

📦 Final Decision → config['variance_equal'] = `equal`

📏 Alternative Variance Tests (FYI)¶

📖 Other Tests for Equal Variance (Click to Expand)
🧪 1️⃣ Bartlett’s Test¶
  • Assumes data is normally distributed
  • More powerful than Levene’s under strict normality
  • Sensitive to non-normal data

Use when:

  • Normality assumption is strongly satisfied.

🧪 2️⃣ Fligner–Killeen Test¶
  • Non-parametric
  • Does not assume normality
  • More robust to skewness and outliers

Use when:

  • Data is clearly non-normal
  • You want a distribution-free alternative.

📝 Note¶

Levene’s test is generally preferred in practice because it is more robust to non-normality.


🔍 Visual Check¶

📖 Distribution (Click to Expand)
In [32]:
visualize_distribution(config, df)
visualize_variance_boxplot_annotated(config, df)
📊 Step: Visual Distribution Overview (Side-by-Side)

No description has been provided for this image
📊 Visual Check: Spread Comparison Between Groups
   (Spread = how much values vary within each group)

📋 Spread Summary:
std_dev variance median IQR
group
A 1.070873 1.146769 -0.146503 1.562018
B 0.964252 0.929781 0.317251 1.363176
No description has been provided for this image
🧠 Business Interpretation:
✅ The spread of values across groups is very similar.
   Variability does not appear meaningfully different.

Back to the top

🛠️ Test Setup¶

📖 Test Settings Explanation (Click to Expand)
📊 Test Type (test_type)¶

This setting defines the type of test you want to perform.

  • one_sample: Comparing the sample mean against a known value (e.g., a population mean).
  • two_sample: Comparing the means of two independent groups (e.g., A vs B).
  • paired: Comparing means from the same group at two different times (before vs after).
  • proportions: Comparing proportions (e.g., the conversion rates of two groups).

Example: You might want to test if the mean age of two groups of people (Group A and Group B) differs, or if the proportion of people who converted in each group is different.

📏 Tail Type (tail_type)¶

This setting determines whether you are performing a one-tailed or two-tailed test.

  • one_tailed: You are testing if the value is greater than or less than the reference value (directional).
  • two_tailed: You are testing if the value is different from the reference value, either higher or lower (non-directional).

Example:

  • One-tailed: Testing if new treatment increases sales (you only care if it's greater).
  • Two-tailed: Testing if there is any difference in sales between two treatments (it could be either an increase or decrease).
🧮 Parametric (parametric)¶

This setting indicates whether the test is parametric or non-parametric.

  • True (Parametric): This means we assume that the data follows a certain distribution, often a normal distribution. The most common parametric tests are t-tests and z-tests. Parametric tests are generally more powerful if the assumptions are met.
  • False (Non-Parametric): Non-parametric tests don’t assume any specific distribution. These are used when the data doesn’t follow a normal distribution or when the sample size is small. Examples include Mann-Whitney U (alternative to the t-test) and Wilcoxon Signed-Rank (alternative to paired t-test).

Why does this matter?
Parametric tests tend to be more powerful because they make assumptions about the distribution of the data (e.g., normality). Non-parametric tests are more flexible and can be used when these assumptions are not met, but they may be less powerful.

📊 Equal Variance (equal_variance)¶

This setting is used specifically for two-sample t-tests.

  • True: Assumes that the two groups have equal variances (i.e., the spread of data is the same in both groups). This is used for the pooled t-test.
  • False: Assumes the two groups have different variances. This is used for the Welch t-test, which is more robust when the assumption of equal variances is violated.

Why is this important?
If the variances are not equal, using a pooled t-test (which assumes equal variance) can lead to incorrect conclusions. The Welch t-test is safer when in doubt about the equality of variances.

🔑 Significance Level (alpha)¶

The alpha level is your threshold for statistical significance.

  • Commonly set at 0.05, this means that you are willing to accept a 5% chance of wrongly rejecting the null hypothesis (i.e., a 5% chance of a Type I error).
  • If the p-value (calculated from your test) is less than alpha, you reject the null hypothesis. If it's greater than alpha, you fail to reject the null hypothesis.

Example:

  • alpha = 0.05 means there’s a 5% risk of concluding that a treatment has an effect when it actually doesn’t.
🎯 Putting It All Together¶

For instance, let's say you're testing if a new feature (Group A) increases user engagement compared to the existing feature (Group B). Here’s how each configuration works together:

  • test_type = 'two_sample': You're comparing two independent groups (A vs B).
  • tail_type = 'two_tailed': You’re testing if there’s any difference (increase or decrease) in engagement.
  • parametric = True: You assume the data is normally distributed, so a t-test will be appropriate.
  • equal_variance = True: You assume the two groups have equal variance, so you’ll use a pooled t-test.
  • alpha = 0.05: You’re using a 5% significance level for your hypothesis test.

📏 Infer Parametric Flag¶

📖 Parametric vs Non-Parametric (Click to Expand)
📘 What Does "Parametric" Mean?¶

A parametric test assumes that the data follows a known distribution — typically a normal distribution.

These tests also often assume:

  • Equal variances between groups (for two-sample cases)
  • Independent samples

When those assumptions are met, parametric tests are more powerful (i.e., they detect real effects more easily).


🔁 What Happens If Assumptions Don’t Hold?¶

You should use a non-parametric test — these don’t rely on strong distributional assumptions and are more robust, especially for small sample sizes or skewed data.

Examples:

Parametric Test Non-Parametric Alternative
Two-sample t-test Mann-Whitney U test
Paired t-test Wilcoxon Signed-Rank test
ANOVA Kruskal-Wallis test

🧠 How We Decide Here¶

In our pipeline, a test is parametric only if:

  • The outcome variable is continuous
  • The data is normally distributed
  • The variance is equal, if applicable (or marked "NA" for paired designs)

If these aren’t all true, we default to a non-parametric test.

In [33]:
config['parametric'] = infer_parametric_flag(config)
📏 Step: Decide Between Parametric vs Non-Parametric Approach
🔍 Distribution of outcome = `normal`
✅ Normal distribution → Proceeding with a parametric test

📦 Final Decision → config['parametric'] = `True`

🧪 Validate Configuration Dictionary¶

In [34]:
# pretty_json(config)
print_config_summary(config)
📋 Hypothesis Test Configuration Summary

🔸 Outcome Type         : continuous
🔸 Group Relationship   : independent
🔸 Group Count          : two-sample
🔸 Distribution         : normal
🔸 Variance Equal       : equal
🔸 Tail Type            : two-tailed
🔸 Sample Size          : 100
🔸 Effect Size          : 0.5
🔸 Parametric           : True
🔸 Alpha                : 0.05
🔸 Test Name            : None
🔸 H 0                  : None
🔸 H A                  : None
In [35]:
validate_config(config)
✅ Config validated successfully.

Back to the top

🧪 Hypothesis Testing

🧭 Determine Test¶

📖 How We Select the Right Statistical Test (Click to Expand)
🧠 What Are We Doing Here?¶

Based on all the configuration values you’ve either set or inferred (outcome_type, group_relationship, distribution, etc),
we determine which statistical test is most appropriate for your hypothesis.

This is the decision engine of the pipeline.


⚙️ How the Logic Works¶

We go through structured rules based on:

Config Field What it Affects
outcome_type Binary / continuous / categorical
group_count One-sample / two-sample / multi
group_relationship Independent or paired
distribution Normal or non-normal
variance_equal Determines pooled vs Welch’s t-test
parametric Whether to use parametric approach

🧪 Example Mappings:¶
Scenario Selected Test
Continuous, 2 groups, normal, equal variance Two-sample t-test (pooled)
Continuous, 2 groups, non-normal Mann-Whitney U
Binary, 2 groups, independent Proportions z-test
Continuous, paired, non-normal Wilcoxon Signed-Rank
3+ groups, categorical outcome Chi-square test
📖 Test Selection Matrix (Click to Expand)
# 💼 Example Business Problem 📊 Outcome Variable Type 📈 Outcome Distribution 👥 Group Count 🔗 Groups Type ✅ Recommended Test 📝 Notes
1 Is average order value different from $50? Continuous Normal One-Sample Not Applicable One-sample t-test -
2 Do users who saw recs spend more time on site? Continuous Normal Two-Sample Independent Two-sample t-test Use if variances are equal; Welch’s t-test if not
3 Did users spend more after redesign? Continuous Normal Two-Sample Paired Paired t-test Use only if paired differences are roughly Normal
4 Does time spent differ across A/B/C? Continuous Normal Multi-Sample (3+) Independent ANOVA Use Welch ANOVA if group variances differ
5 Is average order value different from $50 (skewed)? Continuous Non-Normal One-Sample Not Applicable Wilcoxon Signed-Rank Test Use when normality is violated; tests median. Sign Test is an alternative with fewer assumptions
6 Is revenue different between coupon A vs B? Continuous Non-Normal Two-Sample Independent Mann-Whitney U test Use when data is skewed or has outliers
7 Did time on site change (skewed)? Continuous Non-Normal Two-Sample Paired Wilcoxon signed-rank test For paired non-normal distributions
8 Does spend differ across segments? Continuous Non-Normal Multi-Sample (3+) Independent Kruskal-Wallis test Non-parametric version of ANOVA
9 Is conversion rate different from 10%? Binary Not Applicable One-Sample Not Applicable One-proportion z-test Use binomial exact test if sample size is small
10 Does new CTA improve conversion? Binary Not Applicable Two-Sample Independent Proportions z-test Use when counts are raw; chi-square for independence. Fisher’s Exact if expected counts are low
11 Do users convert more after badges? Binary Not Applicable Two-Sample Paired McNemar’s test Used for 2×2 paired binary outcomes
12 Do plan choices differ across layout options? Categorical Not Applicable Multi-Sample (3+) Independent Chi-square test Requires expected frequency ≥5 in each cell. Use Fisher’s Exact if assumption fails
13 Do users add more items to cart? Count Poisson Two-Sample Independent Poisson / NB test Use Negative Binomial if variance > mean
14 Is effect still significant after adjusting for device & region? Any Not Applicable Any Any Regression (linear / logistic) Use to control for covariates / confounders
15 What’s the probability that B beats A? Any Not Applicable Two-Sample Any Bayesian A/B test Posterior probability; no p-value
16 Is observed lift statistically rare? Any Not Applicable Two-Sample Any Permutation / Bootstrap Use when parametric assumptions are violated
📖 Test Selection FlowChart (Click to Expand)
[What is your outcome variable type?]  
  |
  +--> 📏 Continuous
  |     |
  |     +--> Is the outcome distribution normal?
  |           |
  |           +--> ✅ Yes
  |           |     |
  |           |     +--> 👥 Group Count = One-Sample ----------> 🧪 One-sample t-test
  |           |     +--> 👥 Group Count = Two-Sample
  |           |     |           +-->  🔗 Groups Type = Independent
  |           |     |           |            |
  |           |     |           |            +--> Are variances equal?
  |           |     |           |            |
  |           |     |           |            +--> ✅ Yes ------> 🧪 Two-sample t-test (pooled)
  |           |     |           |            +--> ❌ No  ------> 🧪 Welch’s t-test
  |           |     |           +--> 🔗 Groups Type = Paired --> 🧪 Paired t-test
  |           |     +--> 👥 Group Count = Multi-Sample (3+)
  |           |                              |
  |           |                              +--> Are variances equal?
  |           |                              |
  |           |                              +--> ✅ Yes ------> 🧪 ANOVA
  |           |                              +--> ❌ No  ------> 🧪 Welch ANOVA
  |           |
  |           +--> ❌ No (Non-Normal)
  |                 |
  |                 +--> 👥 Group Count = One-Sample ----------> 🧪 Wilcoxon Signed-Rank Test
  |                 +--> 👥 Group Count = Two-Sample
  |                 |     +--> 🔗 Groups Type = Independent ---> 🧪 Mann-Whitney U Test
  |                 |     +--> 🔗 Groups Type = Paired --------> 🧪 Wilcoxon Signed-Rank Test
  |                 +--> 👥 Group Count = Multi-Sample (3+) ---> 🧪 Kruskal-Wallis Test
  |
  +--> ⚪ Binary
  |     |
  |     +--> 👥 Group Count = One-Sample -----------------------> 🧪 One-proportion z-test
  |     +--> 👥 Group Count = Two-Sample
  |     |     +--> 🔗 Groups Type = Independent ---------------> 🧪 Proportions z-test
  |     |     +--> 🔗 Groups Type = Paired --------------------> 🧪 McNemar’s Test
  |
  +--> 🟪 Categorical
  |     |
  |     +--> 👥 Group Count = Multi-Sample (3+) ---------------> 🧪 Chi-square Test
  |
  +--> 🔢 Count
  |     |
  |     +--> Distribution = Poisson
  |           +--> 👥 Group Count = Two-Sample ---------------> 🧪 Poisson or Negative Binomial Test
  |
  +--> 🧠 Any
        |
        +--> Want to control for covariates? -------------> 📉 Regression (Linear or Logistic)
        +--> Prefer probability over p-values? -----------> 📊 Bayesian A/B Test
        +--> Assumptions violated / custom metric? -------> 🔁 Permutation or Bootstrap
In [36]:
config['test_name'] = determine_test_to_run(config)
🧭 Step: Determine Which Statistical Test to Use
📦 Inputs:
• Outcome Type         = `continuous`
• Group Count          = `two-sample`
• Group Relationship   = `independent`
• Distribution         = `normal`
• Equal Variance       = `equal`
• Parametric Flag      = `True`

🔍 Matching against known test cases...

✅ Selected Test: `two_sample_ttest_pooled`

🧠 Print Hypothesis¶

📖 Hypothesis Structure & Interpretation (Click to Expand)
📘 What Are We Doing Here?¶

This step generates a formal hypothesis statement for the selected test.

Every statistical test is built around two competing ideas:

  • H₀ (Null Hypothesis) → The “status quo”. There is no effect, no difference, or no association.
  • H₁ (Alternative Hypothesis) → There is an effect, difference, or relationship.

🧪 Examples of Hypothesis Pairs¶
Test Type Null Hypothesis (H₀) Alternative (H₁)
One-sample t-test The average value equals the reference value The average value is different (or higher / lower)
Two-sample t-test The average outcome is the same in both groups The average outcome differs between the groups
Welch’s t-test The average outcome is the same in both groups The average outcome differs between the groups
Paired t-test The average change between before and after is zero The average change is not zero
ANOVA The average outcome is the same across all groups At least one group has a different average outcome
Welch ANOVA The average outcome is the same across all groups At least one group has a different average outcome
Wilcoxon Signed-Rank Test The typical value equals the reference (or no typical change observed) The typical value differs (or typical change exists)
Mann-Whitney U Test The outcome pattern is the same in both groups One group generally has higher values than the other
Kruskal-Wallis Test The outcome pattern is the same across all groups At least one group differs in outcome pattern
One-proportion z-test The conversion rate equals the reference rate The conversion rate differs from the reference rate
Proportions z-test The conversion rate is the same in both groups The conversion rate differs between groups
McNemar’s test The proportion of “Yes” responses is the same before and after The proportion changes after the intervention
Chi-square test Category preferences are the same across groups Category preferences differ across groups
Poisson test The average event rate is the same in both groups The event rate differs between groups
Negative Binomial test The average event rate is the same in both groups The event rate differs between groups
Regression (linear/logistic) The variable being tested has no meaningful impact on the outcome The variable being tested impacts the outcome
Bayesian A/B test Version B is not more likely to outperform Version A Version B is more likely to outperform Version A
Permutation / Bootstrap The observed difference is consistent with random chance The observed difference is unlikely due to random chance

In [37]:
config['H_0'], config['H_a'] = print_hypothesis_statement(config)
🧠 Step: Generate Hypothesis Statement
🔍 Selected Test        : `two_sample_ttest_pooled`
🔍 Tail Type            : `two-tailed`

📜 Hypothesis Statement:
• H₀: The outcome (mean/proportion) is the same across groups A and B.
• H₁: The outcome differs between groups.

🧪 Run Hypothesis Test¶

📖 Running the Hypothesis Test (Click to Expand)
🧠 What Happens in This Step?¶

This function takes in your final config + dataset and executes the appropriate test — outputting:

  • The test statistic (e.g., t, z, chi², U, F)
  • The p-value
  • Whether the result is statistically significant
🧪 Interpreting the Output¶
Field What It Means
statistic The test result (e.g., t-score, chi-square, etc)
p_value Probability of seeing this result by chance
significant True if p < alpha (reject H₀), else False
alpha The pre-defined significance threshold (typically 0.05)
📏 Significance Logic¶
  • If p < alpha → reject the null hypothesis
  • If p ≥ alpha → fail to reject the null
⚠️ Robustness¶

The function handles different test types:

  • Parametric (e.g., t-tests, ANOVA)
  • Non-parametric (e.g., Wilcoxon, Mann-Whitney)
  • Binary proportions (e.g., z-test, McNemar)
  • Multi-group (e.g., ANOVA, chi-square)
  • Even fallback with test_not_found
In [38]:
_ = run_hypothesis_test(config, df)
📜 Hypothesis Being Tested:
• H₀: The outcome (mean/proportion) is the same across groups A and B.
• H₁: The outcome differs between groups.


🧪 Step: Run Hypothesis Test
✅ Selected Test        : `two_sample_ttest_pooled`
🔍 Significance Threshold (α) : 0.05

🚀 Executing statistical test...

📊 Test Summary: Two Sample Ttest Pooled

🧪 Technical Result
• Test Statistic (t-statistic) = -3.0942
• P-value            = 0.0023
• Alpha (α)          = 0.05
No description has been provided for this image
• Conclusion         = ✅ Statistically significant → Reject H₀

📈 Interpretation
• The observed difference is unlikely due to random variation.

💼 Business Insight
• Group A mean = -0.08
• Group B mean = 0.36
• Lift = 0.45 (-528.26%)
🏆 Group B outperforms Group A — and the effect is statistically significant.

Back to the top