A/B Experimentation Review

A/B Testing Use

A/B Testing is the Industry Standard for causal analysis. When A/B Testing is not possible, see our sample size calculator notes for other research design possibilities. It is at the end of the sample size calculator, as one might be able to change \(\alpha, \beta, k\) or even the metrics’ probability (\(p\)) calculations to make experimentation possible. Better first try than to forfeit the game.

A/B testing application is less ideal to measure the following:

Return business
Referrals
If measurement period is long
Does not tell us what might be missing
A/B might not be the sole criteria to use

Definitions & Hypothesis

Before one begins experimentation, it is a fantastic idea to become familiar with the grammar of experimentation.

Let us begin with where we want to test if two population means are equal.

The Null Hypothesis \(H_0\) means there is no difference in means.
The Alternative Hypothesis \(H_1\) means there is a difference in means.

\[\begin{aligned} H_0: \mu_1=\mu_2 \\ H_1: \mu_1\neq\mu_2 \\ \end{aligned}\]

Hypothesis Outcome Possibilities

\(\alpha\) = Probability reject null when null is true.
\(\beta\) = Fail to reject null when null is false.

		Decision: ACCEPT NULL	Decision: REJECT NULL
Truth	TRUE	Correct \(1-\alpha\)	Type 1 Error \(\alpha\)
Truth	FALSE	Type 2 Error \(\beta\)	Correct \(1-\beta\)

\(\alpha\) Type 1 Error: Observe difference when done exists, false positive, declaring something that is not there.
\(\beta\) Type 2 Error: Failed to observe difference when one exists. Beta, false negative, failing to declare something that is.
\(Power = 1-\beta\) “sensitivity”, how likely it is to differentiate an actual effect from one caused by chance.

Assumptions

Independent
Normally Distributed
Same Variance
Random Sample

Two-Tailed Example

First, before conversion it is a good idea to remember that adoption is a key metric before conversion. I have overlooked this little but of first importance metric, specifically on new product launches, charging ahead to the excitement of creating conversion metrics.

How is a conversion rate normally distributed, is not the outcomes binary? Yes, Yes they are. Thanks to the Central Limit Theorem, these Bernoulli distributions outcomes become normally distributed with the following standard deviation. For a great introductory course, see A/B Testing on Udacity.

In this example, we will use a metric call click through probability, which is a Success / Failure type of distribution and then assume normality, after checking our assumptions, for statistical calculations.

Example Parameters

\(\begin{aligned} n_0 = 2000 \\ X_0= 150 \\ n_1 = 2000 \\ X_1 = 300 \\ \hat p_1 = \frac{X_1}{N_1} = \frac{300}{2000}=.15\\ \hat p_0 = \frac{X_0}{N_0} = \frac{150}{2000}=.075\\ \end{aligned}\)

Is it normal?

\(\begin{aligned} N{\hat p} > 5 \\ N(1-\hat p) > 5 \\ \end{aligned}\)

Find Confidence Interval for Null

Find Margin of Error from Standard Error and \(Z_{score}\) and calculate the Confidence Interval.

\[\begin{aligned} \sigma_0 = \sqrt{\frac{p_0(1-p_0)}{n_0}} \\ m = z\sigma_0\\ m = 1.96 * \sqrt{\frac{.075(1-.075)}{2000}}=.012 \\ .075 \pm .012 = (.063,.087) \\ \end{aligned}\]

Pooled Standard Error

\[\begin{aligned} \hat p_{pool} = \frac{X_0 + X_1}{N_0+N_1}=.1125\\ SE_{pool} = \sqrt{\hat p_{pool}(1-\hat p_{pool})(\frac{1}{N_0}+\frac{1}{N_1})}\\ \hat d = p_1-p_0 = .15-.075 = .075 \\ SE_{pool} = \sqrt{.1125(1-.1125)(\frac{1}{2000}+\frac{1}{2000})}=.01\\ m = z\sigma_0 = 1.96 * .01 = .0196 \end{aligned}\]

Result

Since \(\hat d\) is larger than \(m\), we can reject the null hypothesis.
Changed by at least 5.54% up to 9.46%.
Note: Null is no difference and Alternative is that there is a difference.

We will leave it here with one question, is this practically significant for the business?

Possible Outcomes from A/B Tests

Significant Statistically and Practically
Neutral (less than practical significance) and not Statistically Significant
Neutral (less than practical significance) and Statistically Significant
Not enough power, repeat, could be positive or negative - repeat
Best guess, but confidence interval could be zero- repeat
Best guess, but CI not practically significant - repeat

Share on

Twitter Facebook LinkedIn