Multi-Armed Bandits and Contextual-Bandit

Multi-armed bandit uses machine learning algorithms to minimize opportunity costs and minimize regret. They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a “final answer” at the end of an experiment. They’re faster because samples that would have gone to obviously inferior variations can be assigned to potential winners. The extra data collected on the high-performing variations can help separate the “good” arms from the “best” ones more quickly.”

When to Use Bandits

The situations where bandit testing seems to flourish are:

  • Headlines and Short-Term Campaigns
  • Automation for Scale
  • Targeting
  • Blending Optimization with Attribution

Bandits are appropriate any time that testing is required. As discussed above, traditional A-B testing can be thought of as a special case of the multi-bandit problem, but it is certainly not the optimal way of balancing exploration with exploitation. Therefore, by definition, it cannot hurt to reimagine any A-B testing framework as a multi-armed bandit problem. Moreover, significant gains may be obtained by switching to a more sophisticated testing algorithm such as epsilon-greedy. That said, the degree of improvements that can be expected depend on several factors.

Short term applications

When the window of time to exploit is less than or equal to the length of time required to complete an A-B test, large gains can be made by using a multi-armed bandit approach in order to begin exploiting as early as possible. This can occur, for example, in short term holiday promotions.

Capturing dynamic changes

When the phenomenon being tested changes significantly enough that the results of an A-B test can become invalid over time, multi-arm bandits provide an alternative to repeatedly retesting. By continuously exploring, one can maintain an optimal strategy.

Automating testing process

Multi-armed bandits can also provide value by eliminating the need for repeated intervention by analysts in order to perform repeated A-B tests.

 Image result for multi-armed bandit



Contextual Bandit

Use additional information to make a better decision among all actions of MAB

  • Epsilon-greedy:  a good balance between simplicity and performance

keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. For example: 20% of the time, we choose a lever at random. The other 80% of the time, we choose the lever that has the highest expectation of rewards.

In computer science, a greedy algorithm is one that always takes the action that seems best at that moment. So, an epsilon-greedy algorithm is almost a fully greedy algorithm – most of the time it picks the option that makes sense at that moment.

  • def choose():
        if math.random() < 0.1:
            # exploration!
            # choose a random lever 10% of the time.
            # exploitation!
            # for each lever, 
                # calculate the expectation of reward. 
                # This is the number of trials of the lever divided by the total reward 
                # given by that lever.
            # choose the lever with the greatest expectation of reward.
        # increment the number of times the chosen lever has been played.
        # store test data in redis, choice in session key, etc..
    def reward(choice, amount):
        # add the reward to the total for the given lever.


Explore: randomly select action certain percent of time (say 20%)

Exploit (play greedy): pick the current best percent of time (say 80%)

epsilon greedy bandit test

There are some pros and cons to the epsilon-Greedy method. Pros include:

It’s simple and easy to implement.

It’s usually effective.

It’s not as affected by seasonality.

  • Explore-first
  • Bagging
  • Online cover


  • Optimum strategy for maximum successes.
  • No requirement for predetermined sample sizes or other parameters.
  • Codifying the process arguably makes ad-hoc alterations (p-hacking) less likely.
  • Higher levels of significance become practical.
  • Unlimited peeking.
  • The test can incorporate prior knowledge or risk assessment, via the choice of the initial sampling weights.


  • Sampling ratios must be adjusted. Google Analytics content experiments already run multi-armed bandit tests for you, but for other tools you may need to use acalculator and update the sampling ratio yourself.
  • Convergence is slower relative to total sample size. A fixed 50/50 sampling ratio aims for the fewest total samples, whereas multi-armed bandit aims for the fewest total failures.

MAB vs A/B testing: practical considerations

MAB: Find a good enough variant, while maximizing return, versions with the higher conversion rate receive a higher proportion of traffic

AB: versions that aren’t performing as well continue to receive the same amount of traffic as the versions with a higher conversion rate.

comparing A/B testing and multi-armed bandit algorithms head to head is wrong because they are clearly meant for different purposes. A/B testing is meant for strict experiments where focus is on statistical significance, whereas multi-armed bandit algorithms are meant for continuous optimization where focus is on maintaining higher average conversion rate.

If one of your variants is performing so poorly that business goals are threatened, of course you’re going to want to shut off this variant. Similarly, people will often start variants with low traffic allocation and then increase over time (ramp-up). Finally, sometimes people change midway through a test.

All three of the above can distort the results: For A/B testing, the test results are representative of the entire testing time and, that is, in equal parts. If you change the traffic this leads to certain time periods being over- or under-represented.

Shutting off a variant in the middle of the test has the same effect. If the content of a variant is changed, the test result will be a weird mixture of different hypotheses and concepts that finally no longer provides any insight at all.

Here we need to find a healthy balance of scientific rigor and validity & practical business concerns.

Here are some tips to achieve that balance:

  • If a variant is shut down as early as the first few days after the start of the test, it is recommended that the test should be started again but without the variant. In this case, the loss of time is not too high.
  • Before you change the variants during the test, better to start a new test and test the adapted variant against the control.
  • If you change the traffic to a higher level from day to day, make sure that you cover a sufficient amount of time with the maximum desired traffic quantity, whereby the traffic remains stable across the time.

MBA in Practice

Reassign traffic


Only on less visible features

  • UI/Copy change template
  • ML algo parameter (e.g. ads)
  • Heavily used in recommendation system

example: 3 weeks of promotion, to maximize sales, MAB is used.

MAB Limitations:

Not great for iterations and learn. In the long run, always exploring still wastes traffic on losing arm.

Why not stop the MAB when done?

MAB needs more traffic to reach statistical significance, more so when delta is small, vulnerable to seasonality/time cycle etc: may wrongly allocate traffic to worse arms.

Comments & Responses

Leave a Reply

Your email address will not be published. Required fields are marked *