A/B Testing Use
A/B Testing is the Industry Standard for causal analysis. When A/B Testing is not possible, see our sample size calculator notes for other research design possibilities. It is at the end of the sample size calculator, as one might be able to change \(\alpha, \beta, k\) or even the metrics’ probability (\(p\)) calculations to make experimentation possible. Better first try than to forfeit the game.
A/B testing application is less ideal to measure the following:
- Return business
- Referrals
- If measurement period is long
- Does not tell us what might be missing
- A/B might not be the sole criteria to use
Definitions & Hypothesis
Before one begins experimentation, it is a fantastic idea to become familiar with the grammar of experimentation.
Let us begin with where we want to test if two population means are equal.
- The Null Hypothesis \(H_0\) means there is no difference in means.
- The Alternative Hypothesis \(H_1\) means there is a difference in means.
Hypothesis Outcome Possibilities
- \(\alpha\) = Probability reject null when null is true.
- \(\beta\) = Fail to reject null when null is false.
Decision: ACCEPT NULL | Decision: REJECT NULL | ||
---|---|---|---|
Truth | TRUE | Correct \(1-\alpha\) | Type 1 Error \(\alpha\) |
Truth | FALSE | Type 2 Error \(\beta\) | Correct \(1-\beta\) |
- \(\alpha\) Type 1 Error: Observe difference when done exists, false positive, declaring something that is not there.
- \(\beta\) Type 2 Error: Failed to observe difference when one exists. Beta, false negative, failing to declare something that is.
- \(Power = 1-\beta\) “sensitivity”, how likely it is to differentiate an actual effect from one caused by chance.
Assumptions
- Independent
- Normally Distributed
- Same Variance
- Random Sample
Two-Tailed Example
First, before conversion it is a good idea to remember that adoption is a key metric before conversion. I have overlooked this little but of first importance metric, specifically on new product launches, charging ahead to the excitement of creating conversion metrics.
How is a conversion rate normally distributed, is not the outcomes binary? Yes, Yes they are. Thanks to the Central Limit Theorem, these Bernoulli distributions outcomes become normally distributed with the following standard deviation. For a great introductory course, see A/B Testing on Udacity.
In this example, we will use a metric call click through probability, which is a Success / Failure type of distribution and then assume normality, after checking our assumptions, for statistical calculations.
Example Parameters
\(\begin{aligned} n_0 = 2000 \\ X_0= 150 \\ n_1 = 2000 \\ X_1 = 300 \\ \hat p_1 = \frac{X_1}{N_1} = \frac{300}{2000}=.15\\ \hat p_0 = \frac{X_0}{N_0} = \frac{150}{2000}=.075\\ \end{aligned}\)
Is it normal?
\(\begin{aligned} N{\hat p} > 5 \\ N(1-\hat p) > 5 \\ \end{aligned}\)
Find Confidence Interval for Null
Find Margin of Error from Standard Error and \(Z_{score}\) and calculate the Confidence Interval.
\[\begin{aligned} \sigma_0 = \sqrt{\frac{p_0(1-p_0)}{n_0}} \\ m = z\sigma_0\\ m = 1.96 * \sqrt{\frac{.075(1-.075)}{2000}}=.012 \\ .075 \pm .012 = (.063,.087) \\ \end{aligned}\]Pooled Standard Error
\[\begin{aligned} \hat p_{pool} = \frac{X_0 + X_1}{N_0+N_1}=.1125\\ SE_{pool} = \sqrt{\hat p_{pool}(1-\hat p_{pool})(\frac{1}{N_0}+\frac{1}{N_1})}\\ \hat d = p_1-p_0 = .15-.075 = .075 \\ SE_{pool} = \sqrt{.1125(1-.1125)(\frac{1}{2000}+\frac{1}{2000})}=.01\\ m = z\sigma_0 = 1.96 * .01 = .0196 \end{aligned}\]Result
- Since \(\hat d\) is larger than \(m\), we can reject the null hypothesis.
- Changed by at least 5.54% up to 9.46%.
- Note: Null is no difference and Alternative is that there is a difference.
We will leave it here with one question, is this practically significant for the business?
Possible Outcomes from A/B Tests
- Significant Statistically and Practically
- Neutral (less than practical significance) and not Statistically Significant
- Neutral (less than practical significance) and Statistically Significant
- Not enough power, repeat, could be positive or negative - repeat
- Best guess, but confidence interval could be zero- repeat
- Best guess, but CI not practically significant - repeat