MultiArmed Bandits and ContextualBandit
 Posted by lhmay
 on Apr, 29, 2018
 in Data Science
 Blog No Comments.
Multiarmed bandit uses machine learning algorithms to minimize opportunity costs and minimize regret. They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a “final answer” at the end of an experiment. They’re faster because samples that would have gone to obviously inferior variations can be assigned to potential winners. The extra data collected on the highperforming variations can help separate the “good” arms from the “best” ones more quickly.”
When to Use Bandits
The situations where bandit testing seems to flourish are:
 Headlines and ShortTerm Campaigns
 Automation for Scale
 Targeting
 Blending Optimization with Attribution
Bandits are appropriate any time that testing is required. As discussed above, traditional AB testing can be thought of as a special case of the multibandit problem, but it is certainly not the optimal way of balancing exploration with exploitation. Therefore, by definition, it cannot hurt to reimagine any AB testing framework as a multiarmed bandit problem. Moreover, significant gains may be obtained by switching to a more sophisticated testing algorithm such as epsilongreedy. That said, the degree of improvements that can be expected depend on several factors.
Short term applications
When the window of time to exploit is less than or equal to the length of time required to complete an AB test, large gains can be made by using a multiarmed bandit approach in order to begin exploiting as early as possible. This can occur, for example, in short term holiday promotions.
Capturing dynamic changes
When the phenomenon being tested changes significantly enough that the results of an AB test can become invalid over time, multiarm bandits provide an alternative to repeatedly retesting. By continuously exploring, one can maintain an optimal strategy.
Automating testing process
Multiarmed bandits can also provide value by eliminating the need for repeated intervention by analysts in order to perform repeated AB tests.
Contextual Bandit
Use additional information to make a better decision among all actions of MAB
 Epsilongreedy: a good balance between simplicity and performance
keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. For example: 20% of the time, we choose a lever at random. The other 80% of the time, we choose the lever that has the highest expectation of rewards.
In computer science, a greedy algorithm is one that always takes the action that seems best at that moment. So, an epsilongreedy algorithm is almost a fully greedy algorithm – most of the time it picks the option that makes sense at that moment.

def choose(): if math.random() < 0.1: # exploration! # choose a random lever 10% of the time. else: # exploitation! # for each lever, # calculate the expectation of reward. # This is the number of trials of the lever divided by the total reward # given by that lever. # choose the lever with the greatest expectation of reward. # increment the number of times the chosen lever has been played. # store test data in redis, choice in session key, etc.. def reward(choice, amount): # add the reward to the total for the given lever.
Explore: randomly select action certain percent of time (say 20%)
Exploit (play greedy): pick the current best percent of time (say 80%)
There are some pros and cons to the epsilonGreedy method. Pros include:
It’s simple and easy to implement.
It’s usually effective.
It’s not as affected by seasonality.
 Explorefirst
 Bagging
 Online cover
Advantages
 Optimum strategy for maximum successes.
 No requirement for predetermined sample sizes or other parameters.
 Codifying the process arguably makes adhoc alterations (phacking) less likely.
 Higher levels of significance become practical.
 Unlimited peeking.
 The test can incorporate prior knowledge or risk assessment, via the choice of the initial sampling weights.
Disadvantages
 Sampling ratios must be adjusted. Google Analytics content experiments already run multiarmed bandit tests for you, but for other tools you may need to use acalculator and update the sampling ratio yourself.
 Convergence is slower relative to total sample size. A fixed 50/50 sampling ratio aims for the fewest total samples, whereas multiarmed bandit aims for the fewest total failures.
MAB vs A/B testing: practical considerations
MAB: Find a good enough variant, while maximizing return, versions with the higher conversion rate receive a higher proportion of traffic
AB: versions that aren’t performing as well continue to receive the same amount of traffic as the versions with a higher conversion rate.
comparing A/B testing and multiarmed bandit algorithms head to head is wrong because they are clearly meant for different purposes. A/B testing is meant for strict experiments where focus is on statistical significance, whereas multiarmed bandit algorithms are meant for continuous optimization where focus is on maintaining higher average conversion rate.
If one of your variants is performing so poorly that business goals are threatened, of course you’re going to want to shut off this variant. Similarly, people will often start variants with low traffic allocation and then increase over time (rampup). Finally, sometimes people change midway through a test.
All three of the above can distort the results: For A/B testing, the test results are representative of the entire testing time and, that is, in equal parts. If you change the traffic this leads to certain time periods being over or underrepresented.
Shutting off a variant in the middle of the test has the same effect. If the content of a variant is changed, the test result will be a weird mixture of different hypotheses and concepts that finally no longer provides any insight at all.
Here we need to find a healthy balance of scientific rigor and validity & practical business concerns.
Here are some tips to achieve that balance:
 If a variant is shut down as early as the first few days after the start of the test, it is recommended that the test should be started again but without the variant. In this case, the loss of time is not too high.
 Before you change the variants during the test, better to start a new test and test the adapted variant against the control.
 If you change the traffic to a higher level from day to day, make sure that you cover a sufficient amount of time with the maximum desired traffic quantity, whereby the traffic remains stable across the time.
MBA in Practice
Reassign traffic
Batches
Only on less visible features
 UI/Copy change template
 ML algo parameter (e.g. ads)
 Heavily used in recommendation system
example: 3 weeks of promotion, to maximize sales, MAB is used.
MAB Limitations:
Not great for iterations and learn. In the long run, always exploring still wastes traffic on losing arm.
Why not stop the MAB when done?
MAB needs more traffic to reach statistical significance, more so when delta is small, vulnerable to seasonality/time cycle etc: may wrongly allocate traffic to worse arms.