Multivariate Tests (Orthogonal Design)

Because resources are limited, it is very important to get the most information from each experiment you do.

Well-designed experiments can produce significantly more information and often require fewer runs than haphazard or unplanned experiments.

Also, a well-designed experiment will ensure that you can evaluate the effects that you have identified as important.

As a rule, we tend to accept a confidence level of 95%. This means that you accept a 5% probability of a type 1 error (“alpha error,” or a false positive). In 5% of all cases an assumption of a significant effect is made, even though in reality there is none at all. This low error probability, however, is valid only in the case of ONE test variant. It increases exponentially when you introduce multiple variants.

Cumulative alpha = 1-(1-Alpha)^n

Alpha = selected significance level, as a rule 0.05

n = number of test variants in your test (without the control)

The more variants being tested, the higher the risk of an incorrect decision.

As you increase the number of metrics, you can use a higher confidence level to overcome false positives.

A different method used in practice is Bonferroni correction. It has the advantages of being simple, makes no assumptions, and guaranteed to give αoverallαoverall as low as you have specified.

To use it, calculate


For e.g. if you want αoverallαoverall to be 0.05 and there are 5 metrics then αindividualαindividual will be 0.05/5=0.01

Bonferroni methods may be very conservative. Alternatives include closed testing procedureBoole-Bonferroni bound and Holm-Bonferroni method. The αoverallαoverall above is often referred to as the familywise error rate (FWER). Another measure is the contol false discovery rate (FDR) defined as the (# false positives)/(#rejections). CDR makes sense if you have a large number (200) metrics.

An alternative to using multiple metrics is to use an ‘Overall Evaluation Criterion’ (OEC).

If this is too complicated, the number of test variants can be limited right from the start. Based on experience, no more than three variants plus the control should be run in one test. In doing so, the variants should always be well-conceived and hypothesis-driven.

There are different opinions on whether one user should be allowed in a number of tests at the same time.

While some argue that this is not a problem because the effects balance out, others see contamination effects as a result of parallel testing: interactions between tests and variants can lead to non-linear effects, greater variances and sinking validity.


If the risk is low and traffic overlap between the tests is manageable, two tests can absolutely be allowed to run in parallel, so that one and the same user can be in multiple tests.

There are two solutions for tests that interact strongly:

Solution A: Play it safe; run tests separate of one another

On the one hand, all common testing tools allow dividing-up traffic to a number of tests. It is possible, for example, to test the placement of the recommendations at 50% traffic, while the other 50% is used for the sale articles in the recommendations. This procedure ensures that the user sees only one of the running test concepts. In this case, you should make sure, that the test runtime is appropriately extended for every test.

If a 50/50 split is not possible, tests can always be run sequentially. In this case, you should always prioritize the tests and accordingly and synchronize them in your roadmap.

Solution B: Multivariate testing

Another possibility is to combine the test concepts to one multivariate test. This makes sense if the test concepts optimize on the same goal and run on the same page (like the example regarding the product recommendations). This means the variants must be reasonably combinable. If one test, for example, is changing the main-navigation and the other is the UVPs in checkout, then it makes little sense because there are hardly any interactions between the concepts.


Plan an experiment

How much you prepare before starting experimentation depends on your problem. You might want to go through the following steps:

Define the problem
Developing a good problem statement helps ensure you are studying the correct variables. At this step, you identify the questions that you want to answer.
Define the goal
A well-defined goal will ensure that the experiment answers the correct questions and yields practical, usable information. At this step, you define the goals of the experiment.
Develop an experimental plan that will provide meaningful information
Be sure to consider relevant background information, such as theoretical principles, and knowledge obtained through observation or previous experimentation. For example, you might need to identify which factors or process conditions affect process performance and contribute to process variability. Or, if the process is already established and you have identified influential factors, you might want to determine optimal process conditions.
Ensure the process and measurement systems are in control
Ideally, both the process and the measurements should be in statistical control as measured by a functioning statistical process control (SPC) system. Even if you do not have the process completely in control, you must be able to reproduce process settings. You also need to determine the variability in the measurement system. If the variability in your system is greater than the difference/effect that you consider important, experimentation will not yield useful results.

Designing an Experiment

  1. Choose subject: What are the units in the population you are going to run the test on? (unit of diversion)
  2. Choose population: What population are you going to use (US only?)
  3. Size
  4. Duration

Typically you want to assign people and not events since the same user may see different changes. If you use a person, you typically use a cookie which may change by platform. The alternative then is to use a user id

Screening phase

In many process development and manufacturing applications, the number of potential variables (factors) is large. Screening (process characterization) is used to reduce the number of factors by identifying the most important factors that affect product quality. This reduction lets you concentrate process improvement efforts on the few most important factors. Different types of screening designs can screen different types of terms and detect or model curvature. If necessary, further optimization experiments can be done to model more complex interactions or to more precisely define the nature of the response surface.

The following designs are often used for screening:

  • Definitive screening designs can estimate complex models for a small number of important factors that were in an experiment with many factors.
  • 2-level full and fractional factorial designs are used extensively in industry.
  • Plackett-Burman designs have low resolution, but their usefulness in some screening experimentation and robustness testing is widely recognized.

Optimization phase

After you have identified the important terms by screening, you need to determine the optimal values for the experimental factors. Optimal factor values depend on the process goal. For example, you might want to maximize process yield or reduce product variability.

Verification phase

Verification involves performing a subsequent experiment at the predicted optimal conditions to confirm the optimization results. For example, you can do a few verification runs at the optimal settings, then obtain a confidence interval for the mean response.

Aliasing, also known as confounding, occurs in fractional factorial designs because the design does not include all of the combinations of factor levels. 

What is  a block? 

A block is a categorical variable that explains variation in the response variable that is not caused by the factors. Although each measurement should be taken under consistent experimental conditions (other than the factors that are being varied as part of the experiment), this is not always possible. Use blocks in designed experiments and analysis to minimize bias and variance of the error because of nuisance factors.

Block can be day.

What is a hard-to-change factor?

A hard-to-change factor is a factor that is difficult to randomize completely because of time or cost constraints. For example, temperature is a common hard-to-change factor because adjusting temperature often requires significant time to stabilize.

Hard-to-change factors are often confused with blocking variables. However, there are several important differences between blocks and hard-to-change factors:

  • In a blocked design, the blocks are nuisance factors that are only included in a design to obtain a more precise estimate of the error term. However, you are interested in estimating the effect of hard-to-change factors, such as how temperature affects the moisture of a cake.
  • In a blocked experiment, the interaction between the blocking variable and the factors is not of interest. When you have a hard-to-change factor, you might be interested in interactions between the hard-to-change variable and other factors in the experiment.
  • Designs with hard-to-change and easy-to-change factors have two different sizes of experimental units. The hard-to-change factors are applied to a large experimental unit. Within this unit, the observational units are small experimental units used to study the easy-to-change factors. With a block design, the experimental units are all the same size.
  • Blocks are usually random factors while hard-to-change factors are usually fixed.
  • Blocks are a collection of experimental units. Hard-to-change factors are applied to the experimental units.

What is orthogonality?

Two vectors are orthogonal if the sum of the products of their corresponding elements is 0. For example, consider the following vectors a and b:

You can multiply the corresponding elements of the vectors to show the following result:

a*b = 2(–4) + 3(1) + 5(1) + 0(4) = –8 + 3 + 5 + 0 = 0

This shows that the two vectors are orthogonal.

The concept of orthogonality is important in Design of Experiments because it says something about independence. Experimental analysis of an orthogonal design is usually straightforward because you can estimate each main effect and interaction independently. If your design is not orthogonal, either by plan or by accidental loss of data, your interpretation might not be as straightforward.

3 factors each have 2 levels

Consider a 23 full factorial with eight runs.
1 –1 –1
1 –1 1
–1 –1 1
–1 1 –1
–1 1 1
–1 –1 –1
1 1 1
1 1 –1

To show that each column (vector) is orthogonal to the other columns, multiply A*B, A*C and B*C.

  • A*B = 1(–1) +1(–1) – 1(–1) – 1(1) – 1(1) – 1(–1) + 1(1) + 1(1) = –4 + 4 = 0
  • A*C = 1(–1) +1(1) – 1(1) – 1(–1) – 1(1) – 1(–1) + 1(1) + 1(–1) = –4 + 4 = 0
  • B*C = –1(–1) – 1(1) – 1(1) + 1(–1) + 1(1) – 1(–1) + 1(1) + 1(–1) = –4 + 4 = 0

So in a sense, factor A is estimated independently from B and C and vice versa.

The estimates for the effects and coefficients will remained unchanged when you remove interactions from the model. The other output will change as the experimental error (MSE) is adjusted accordingly with more degrees of freedom.

In conclusion, a designed experiment is orthogonal if the effects of any factor balance out (sum to zero) across the effects of the other factors. Orthogonality guarantees that the effect of one factor or interaction can be estimated separately from the effect of any other factor or interaction in the model.

In general, coefficients can be calculated from this formula:

Term Description
X the design matrix
Y the response vector

For a balanced, orthogonal design with no covariates, coefficients for main effects have a simple relationship to factor means.

With the results of a MVT, you naturally first check which variant has achieved the highest (and significant) uplift. However, you only get information on which combination of factors achieved this uplift. It’s also important to analyze the influence of individual factors on the conversion rate. This can be done with the help of a so-called analysis of variance. This method isolates the effect of the individual factors (in the example, colors and layout) on the conversion rate.

Validate the results in the follow-up test

To increase faith in your MVT results, you can also validate the test winner by running a subsequent A/B test. You simply run the winning combination against the relevant control.

The effects of individual factors can be calculated in isolation (e.g. colors)

Analysis of variance (ANOVA) is a collection of statistical models and their associated procedures (such as “variation” among and between groups) used to analyze the differences among group means. ANOVA was developed by statistician and evolutionary biologistRonald Fisher. In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups. ANOVA is useful for comparing (testing) three or more means (groups or variables) for statistical significance. It is conceptually similar to multiple two-sample t-tests, but is more conservative (results in less type I error)[1]

In statisticsmultivariate analysis of variance (MANOVA) is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables,[1]and is typically followed by significance tests involving individual dependent variables separately. It helps to answer: [2]

  1. Do changes in the independent variable(s) have significant effects on the dependent variables?
  2. What are the relationships among the dependent variables?
  3. What are the relationships among the independent variables?

MANOVA is a generalized form of univariate analysis of variance (ANOVA),[1] although, unlike univariate ANOVA, it uses the covariance between outcome variables in testing the statistical significance of the mean differences.

One thing to be wary of is Simpson’s paradox, where the effect in aggregate may indicate one trend, and at a granular level may show an opposite trend.

Effect may ramp out as you implement the change. There could be seasonal effects. For e.g. students on summer break have very different behavior than when they come back. Similarly during black friday and other holidays. One of the ways is to leave a small sample out as a hold-out to track them over time.

Comments & Responses

Leave a Reply

Your email address will not be published. Required fields are marked *