Central Limit Theorem, Violations & Remedy

Normal Distribution

About 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule.

What is Central Limit Theorem (CLT)?

In its most general form, under some conditions (which include finite variance), it states that averages of samples of observations of random variables independently drawn from independent distributions converge in distribution to the normal, that is, become normally distributed when the number of observations is sufficiently large.

Independence Violations

Bootstrap (FB)

Delta Method (MS, Airbnb, Uber, Linkedin)

Outliers

Unusually high values in the number of orders or the shopping cart values push this average value upward. If the control features a customer with extremely high values purely by chance, at a relatively small test size this can already lead to results that are no longer positive, but rather become even significantly negative.

High Skew

Traditional methods to calculate confidence intervals include a problem: they assume that the fundamental data follows a certain distribution, namely a normal distribution. The left graphic shows a perfect (theoretical) normal distribution. The number of orders fluctuates around a positive average value. In the example, most customers order five times. More or fewer orders arise less often. The graphic to the right shows the reality.

Assuming an average conversion rate of 5%, 95% are customers who don’t buy. Most buyers have probably placed one or two orders, and there are a few customers who order an extreme quantity.

Such distributions are referred to as “right skewed” and they have an influence on the validity of the confidence intervals, especially at a small test size. In essence, the intervals can no longer be reliably calculated. The central limit theorem implies these distortions will carry less weight with very large samples, but how “incorrect” the confidence intervals actually are depends on how much the data deviates from a perfectly normal distribution.

As a consequence, taking a look at the data apart from using the classic t-test is worthwhile. There are other methods that provide reliable results for underlying data that is non-normally distributed.

1. U-Test

The Mann-Whitney U-Test (Wilcoxon rank-sum test) is an alternative to the t-test when the data deviates greatly from the normal distribution.

2. Robust statistics

Methods from robust statistics are used when the data is not normally distributed or distorted by outliers. Here, average values and variances are calculated such that they are not influenced by unusually high or low values.

A robust alternative to the average is the median.

Another robust alternative to the mean is the trimmed mean. Here, the mean of a subset of
the data is calculated. The subset is formed by omitting a certain portion of the data, the lowest
and the highest x%. The 50% trimmed mean is equivalent to the median, the trimmed mean is
therefore often seen as a compromise between the standard mean and the robust median

3. Bootstrapping

This so-called non-parametric procedure works independently of any distribution assumption and provides reliable estimates for confidence levels and intervals. At its core, it belongs to the resampling methods. They provide reliable estimates of the distribution of variables on the basis of the observed data through random sampling procedures.

You may have to filter out spam and fraud to de-bias the data. One way to figure out if you are biasing or de-biasing the data by filtering, is to slice your data and then calculate the metric for each slice after filterig. If you are affecting any slide disproportionately, then you may be biasing your data with filtering

To remove any weekly effects when looking say at total active cookies over time, use week-over-week i.e. divide current data by data from a week ago. Alternately, one can use year-over-year.