Because resources are limited, it is very important to get the most information from each experiment you do.
Well-designed experiments can produce significantly more information and often require fewer runs than haphazard or unplanned experiments.
Also, a well-designed experiment will ensure that you can evaluate the effects that you have identified as important.
As a rule, we tend to accept a confidence level of 95%. This means that you accept a 5% probability of a type 1 error (“alpha error,” or a false positive). In 5% of all cases an assumption of a significant effect is made, even though in reality there is none at all. This low error probability, however, is valid only in the case of ONE test variant. It increases exponentially when you introduce multiple variants.
Cumulative alpha = 1-(1-Alpha)^n
Alpha = selected significance level, as a rule 0.05
n = number of test variants in your test (without the control)
As you increase the number of metrics, you can use a higher confidence level to overcome false positives.
A different method used in practice is Bonferroni correction. It has the advantages of being simple, makes no assumptions, and guaranteed to give as low as you have specified.
To use it, calculate
For e.g. if you want to be 0.05 and there are 5 metrics then will be
Bonferroni methods may be very conservative. Alternatives include closed testing procedure, Boole-Bonferroni bound and Holm-Bonferroni method. The above is often referred to as the familywise error rate (FWER). Another measure is the contol false discovery rate (FDR) defined as the (# false positives)/(#rejections). CDR makes sense if you have a large number (200) metrics.
An alternative to using multiple metrics is to use an ‘Overall Evaluation Criterion’ (OEC).
If this is too complicated, the number of test variants can be limited right from the start. Based on experience, no more than three variants plus the control should be run in one test. In doing so, the variants should always be well-conceived and hypothesis-driven.
There are different opinions on whether one user should be allowed in a number of tests at the same time.
While some argue that this is not a problem because the effects balance out, others see contamination effects as a result of parallel testing: interactions between tests and variants can lead to non-linear effects, greater variances and sinking validity.
If the risk is low and traffic overlap between the tests is manageable, two tests can absolutely be allowed to run in parallel, so that one and the same user can be in multiple tests.
There are two solutions for tests that interact strongly:
Solution A: Play it safe; run tests separate of one another
On the one hand, all common testing tools allow dividing-up traffic to a number of tests. It is possible, for example, to test the placement of the recommendations at 50% traffic, while the other 50% is used for the sale articles in the recommendations. This procedure ensures that the user sees only one of the running test concepts. In this case, you should make sure, that the test runtime is appropriately extended for every test.
If a 50/50 split is not possible, tests can always be run sequentially. In this case, you should always prioritize the tests and accordingly and synchronize them in your roadmap.
Solution B: Multivariate testing
Another possibility is to combine the test concepts to one multivariate test. This makes sense if the test concepts optimize on the same goal and run on the same page (like the example regarding the product recommendations). This means the variants must be reasonably combinable. If one test, for example, is changing the main-navigation and the other is the UVPs in checkout, then it makes little sense because there are hardly any interactions between the concepts.
Plan an experiment
How much you prepare before starting experimentation depends on your problem. You might want to go through the following steps:
- Define the problem
- Developing a good problem statement helps ensure you are studying the correct variables. At this step, you identify the questions that you want to answer.
- Define the goal
- A well-defined goal will ensure that the experiment answers the correct questions and yields practical, usable information. At this step, you define the goals of the experiment.
- Develop an experimental plan that will provide meaningful information
- Be sure to consider relevant background information, such as theoretical principles, and knowledge obtained through observation or previous experimentation. For example, you might need to identify which factors or process conditions affect process performance and contribute to process variability. Or, if the process is already established and you have identified influential factors, you might want to determine optimal process conditions.
- Ensure the process and measurement systems are in control
- Ideally, both the process and the measurements should be in statistical control as measured by a functioning statistical process control (SPC) system. Even if you do not have the process completely in control, you must be able to reproduce process settings. You also need to determine the variability in the measurement system. If the variability in your system is greater than the difference/effect that you consider important, experimentation will not yield useful results.
Designing an Experiment
- Choose subject: What are the units in the population you are going to run the test on? (unit of diversion)
- Choose population: What population are you going to use (US only?)
Typically you want to assign people and not events since the same user may see different changes. If you use a person, you typically use a cookie which may change by platform. The alternative then is to use a user id
In many process development and manufacturing applications, the number of potential variables (factors) is large. Screening (process characterization) is used to reduce the number of factors by identifying the most important factors that affect product quality. This reduction lets you concentrate process improvement efforts on the few most important factors. Different types of screening designs can screen different types of terms and detect or model curvature. If necessary, further optimization experiments can be done to model more complex interactions or to more precisely define the nature of the response surface.
The following designs are often used for screening:
- Definitive screening designs can estimate complex models for a small number of important factors that were in an experiment with many factors.
- 2-level full and fractional factorial designs are used extensively in industry.
- Plackett-Burman designs have low resolution, but their usefulness in some screening experimentation and robustness testing is widely recognized.
After you have identified the important terms by screening, you need to determine the optimal values for the experimental factors. Optimal factor values depend on the process goal. For example, you might want to maximize process yield or reduce product variability.
Verification involves performing a subsequent experiment at the predicted optimal conditions to confirm the optimization results. For example, you can do a few verification runs at the optimal settings, then obtain a confidence interval for the mean response.
Aliasing, also known as confounding, occurs in fractional factorial designs because the design does not include all of the combinations of factor levels.
What is a block?
A block is a categorical variable that explains variation in the response variable that is not caused by the factors. Although each measurement should be taken under consistent experimental conditions (other than the factors that are being varied as part of the experiment), this is not always possible. Use blocks in designed experiments and analysis to minimize bias and variance of the error because of nuisance factors.
Block can be day.
What is a hard-to-change factor?
A hard-to-change factor is a factor that is difficult to randomize completely because of time or cost constraints. For example, temperature is a common hard-to-change factor because adjusting temperature often requires significant time to stabilize.
Hard-to-change factors are often confused with blocking variables. However, there are several important differences between blocks and hard-to-change factors:
- In a blocked design, the blocks are nuisance factors that are only included in a design to obtain a more precise estimate of the error term. However, you are interested in estimating the effect of hard-to-change factors, such as how temperature affects the moisture of a cake.
- In a blocked experiment, the interaction between the blocking variable and the factors is not of interest. When you have a hard-to-change factor, you might be interested in interactions between the hard-to-change variable and other factors in the experiment.
- Designs with hard-to-change and easy-to-change factors have two different sizes of experimental units. The hard-to-change factors are applied to a large experimental unit. Within this unit, the observational units are small experimental units used to study the easy-to-change factors. With a block design, the experimental units are all the same size.
- Blocks are usually random factors while hard-to-change factors are usually fixed.
- Blocks are a collection of experimental units. Hard-to-change factors are applied to the experimental units.
What is orthogonality?
Two vectors are orthogonal if the sum of the products of their corresponding elements is 0. For example, consider the following vectors a and b:
You can multiply the corresponding elements of the vectors to show the following result:
a*b = 2(–4) + 3(1) + 5(1) + 0(4) = –8 + 3 + 5 + 0 = 0
This shows that the two vectors are orthogonal.
The concept of orthogonality is important in Design of Experiments because it says something about independence. Experimental analysis of an orthogonal design is usually straightforward because you can estimate each main effect and interaction independently. If your design is not orthogonal, either by plan or by accidental loss of data, your interpretation might not be as straightforward.
3 factors each have 2 levels