Here’s a few questions to ask yourself to decide whether you should run an A/B test:
- Do I have an important question? Will answering this question make an impact worth the effort of running a test?
- What else could I be doing with my energy? Running experiments and being creative and visionary are two completely different brain modes. Are you distracting yourself from making valuable insights?
- Can I run a large enough test? Plug your baseline conversion rate and “minimum detectable effect” this sample size calculator to see if you have a chance at even reaching statistical significance.
- Do I really understand testing? Try running A/A tests to get a feel for how misleading “results” can be. If you still feel testing is worth it, read and understand some of the resources below to learn how to run a reliable test.
Sanity Check of Experimental Designs
When doing experiments, it’s very important to do sanity check before you run the experiment and measure the results
(1) Pre-experiment AA test
A/A test is an A/B test, with the difference being that the two variations which form the user experience are identical. An A/A test helps marketers examine the correctness of the setup and the reliability of an A/B testing platform. AA test means randomly split the eligible population into test and control groups but test and control groups receive the same treatment.
If sample size is big enough and randomization is done right, we should have statistical significance>5%
A/A testing is done when organizations are taking up a new implementation of an A/B testing tool. Running an A/A test at that time can help them with:
- Checking the accuracy of an A/B Testing tool
AA testing is a good technique for checking the health of an integration of a tool with a website. In addition, it is good to check the quality of the execution (choice of variation and stickiness), data collection and integrity of the tool, and that no data is lost or altered.In most other cases, the A/A test is a method of double-checking the effectiveness and accuracy of the A/B testing software. You should look to see if the software reports that there is a statistically significant (>95% statistical significance) difference between the control and variation. If the software reports that there is a statistically significant difference, that’s a problem, and you’ll want to check that the software is correctly implemented on your website or mobile app.
- Setting a baseline conversion rate for future A/B tests
- Deciding a minimum sample size and test duration
- We can also use AA test pre-experiment data to do CUPED design to do variance reduction
On November 11, 2012, the Copyhackers team began an A/A split test on their homepage
On the 18th — 6 days later — their testing tool declared a winner with 95% confidence. For the sake of accuracy, though, the team decided to let the test run one more day — at which point their software declared a winner at a confidence level of 99.6%:
Their homepage was performing nearly 24% better than the exact same page, and there was only a .4% chance the result was a false positive, according to the software. Still, the team let the test run for about three more days, and the differences eventually evened out:
But that’s not the point. The point is: The testing tool declared a winner too early. If the Copyhackers team hadn’t kept it running, they would’ve incorrectly assumed there was an issue with their experiment. Read more about the test here.
In an A/A test, a web page is A/B tested against an identical variation. When there is absolutely no difference between the control and the variation, it is expected that the result will be inconclusive. However, in cases where an A/A test provides a winner between two identical variations, there is a problem. The reasons could be any of the following:
- The AB testing tool you’re using has not been set up correctly or inefficient. The test is not conducted correctly.
- The data being reported by your website is wrong or duplicated.
- The AA test needs to run longer. sample size you collected for the AA test is too small.
In a nutshell, the two main problems inherent in A/A testing are:
- Ever-present element of randomness in any experimental setup
- Requirement of a large sample size
(2) Post experiment check sample ratio mismatch
If Chi-square results show that Test/Control ratio of the experiment is significantly different from it was designed for, then ideally we’d want to discard the data and re-run the test.
Running an A/A test should be a relatively rare occurrence.
There are two kinds of A/A test:
- A “Pure” two variation test
- An AB test with a “Calibration Variation”
Here are some of the advantages and disadvantages of these kinds of A/A tests.
The Pure Two-Variation A/A Test
With this approach, you select a high-traffic page and setup a test in your AB testing tool. It will have the Control variation and a second variation with no changes.
Advantages: This test will complete in the shortest timeframe because all traffic is dedicated to the test
Disadvantages: Nothing is learned about your visitors–well, almost. See below.
The Calibration Variation A/A Test
This approach involves adding what we call a “Calibration Variation” to the design of a AB test. This test will have a Control variation, one or more “B” variations that are being tested, and another variation with no changes from the Control. When the test is complete you will have learned something from the “B” variations and will also have “calibrated” the tool with an A/A test variation.
Advantages: You can do an A/A test without stopping your AB testing program.
Disadvantages: This approach is statistically tricky. The more variations you add to a test, the larger the margin of error you would expect. It will also drain traffic from the AB test variations, requiring the test to run longer to statistical significance.
Propensity Score Matching Method for Remedies
A propensity score is the probability of a unit (e.g., person, classroom, school) being assigned to a particular treatment given a set of observed covariates. Propensity scores are used to reduce selection bias by equating groups based on these covariates.
PSM is normally for cases of causal inference and simple selection bias in non-experimental settings in which: (i) few units in the non-treatment comparison group are comparable to the treatment units; and (ii) selecting a subset of comparison units similar to the treatment unit is difficult because units must be compared across a high-dimensional set of pretreatment characteristics.
PSM employs a predicted probability of group membership e.g., treatment vs. control group—based on observed predictors, usually obtained from logistic regression to create a counterfactual group. Also propensity scores may be used for matching or as covariates—alone or with other matching variables or covariates.
The basic steps to propensity score matching are:
- Collect and prepare the data.
- Estimate the propensity scores. The true scores are unknown, but can be estimated by many methods including: discriminant analysis, logistic regression, and random forests. The “best” method is up for debate, but one of the more popular methods is logistic regression.
- Match the participants using the estimated scores.
- Evaluate the covariates for an even spread across groups. The scores are good estimates for true propensity scores if the matching process successfully distributes covariates over the treated/untreated groups (Ho et. al, 2007).
Example: Run logistic regression:
- Dependent variable: Y = 1, if participate (test group); Y = 0 (control group), otherwise.
in survey sampling 1=survey sample 0=total population where the survey was sampled from
- Choose appropriate confounders (variables hypothesized to be associated with both treatment and outcome)
- Obtain propensity score: predicted probability (p) or log[p/(1 − p)].
Check that propensity score is balanced across treatment and comparison groups, and check that covariates are balanced across treatment and comparison groups within strata of the propensity score.
- Use standardized differences or graphs to examine distributions
Match each participant to one or more nonparticipants on propensity score:
- Nearest neighbor matching
- Caliper matching: comparison units within a certain width of the propensity score of the treated units get matched, where the width is generally a fraction of the standard deviation of the propensity score
- Mahalanobis metric matching in conjunction with PSM
- Stratification matching
- Difference-in-differences matching (kernel and local linear weights)
- Exact matching
Verify that covariates are balanced across treatment and comparison groups in the matched or weighted sample
Multivariate analysis based on new sample
- Use analyses appropriate for non-independent matched samples if more than one nonparticipant is matched to each participant
Note: When you have multiple matches for a single treated observation, it is essential to use Weighted Least Squares rather than OLS.
Like other matching procedures, PSM estimates an average treatment effect from observational data. The key advantages of PSM were, at the time of its introduction, that by using a linear combination of covariates for a single score, it balances treatment and control groups on a large number of covariates without losing a large number of observations. If units in the treatment and control were balanced on a large number of covariates one at a time, large numbers of observations would be needed to overcome the “dimensionality problem” whereby the introduction of a new balancing covariate increases the minimum necessary number of observations in the sample geometrically.
One disadvantage of PSM is that it only accounts for observed (and observable) covariates. Factors that affect assignment to treatment and outcome but that cannot be observed cannot be accounted for in the matching procedure. As the procedure only controls for observed variables, any hidden bias due to latent variables may remain after matching. Another issue is that PSM requires large samples, with substantial overlap between treatment and control groups.
General concerns with matching have also been raised by Judea Pearl, who has argued that hidden bias may actually increase because matching on observed variables may unleash bias due to dormant unobserved confounders. Similarly, Pearl has argued that bias reduction can only be assured (asymptotically) by modelling the qualitative causal relationships between treatment, outcome, observed and unobserved covariates. Confounding occurs when the experimenter is unable to control for alternative, non-causal explanations for an observed relationship between independent and dependent variables. Such control should satisfy the “backdoor criterion” of Pearl.