Multi-Armed Bandit

Reinforcement learning powered recommendation engines.

Published in

Rat's Nest

5 min readDec 4, 2017

The August edition of Design+AI was jam packed with theories, terminology, and ideas. We welcomed Inmar Giovoni — current Autonomy Engineering Manager at Uber ATG, and former Head of Data Science at Kobo — to share a case study with us. Inmar talked us through the thinking and decisions her Kobo team made when using reinforcement learning to suggest ebooks to their customers. It took a little longer than usual to get this write up out — here is a roundup of the discussion from the event.

Multi-Armed Bandit

Inmar’s team wanted a data-driven way to determine the optimal arrangement of book carousels on the Kobo website to drive book purchases. This was a challenge in terms of knowing which variables to include on the website, as well as what arrangement to display these variables, per customer segment. So, she used the multi-armed bandit algorithm.

Multi-Armed Bandit (MAB) algorithms are a form of reinforcement learning. And the MAB problem comes from slot machines, a.k.a. one-armed bandit. Imagine you are at the casino in front of a row of slot machines. You want to maximize your winnings and have a limited amount of money to gamble. Since you have no prior knowledge about which machines pay out more often, you just start playing them; adjusting which machines you play, in what order, and how often — in order to bias towards playing the machines that maximize your reward. That is the concept behind MAB. That you try a number of options, but once you have a sense of which one is offering the most success, you play that one more than the others (also known as the exploration exploitation tradeoff — more on that below).

So, Inmar’s team took this method and applied it to organizing the variables on their e-book website — to figure out which book carousel combinations and recommendations resulted in more online book sales.

MAB is different (better!) than AB Testing in two major ways.

One of the most exciting things about MAB is that it enables you to boost intended outcomes while testing is still live. For Inmar’s team, this meant that once a particular combination of book carousels (genre, popular now, recommended for you) started to perform well, the algorithm would boost that combination as well as decrease the number of customers seeing a combination that was performing less well. This is interesting for companies as it means that money lost during testing is minimized. In contrast with a/b testing, even if option b is yielding a less desirable outcome (less sales), half of customers are shown this option until the end of the test — meaning that the company is potentially missing out on sales that they could have captured if they had sent customers to option a.

As you might imagine, maximizing profits while testing is desirable for companies. MAB also has healthcare applications — for example, when running clinical trials for a new pharmaceutical drug. MAB used in this setting means that, if the new drug being tested is creating good results for participants, the trial administrators can allow more people to get actual medicine and fewer participants receive the placebo. Thus, MAB has the potential to improve health outcomes for people, and maybe even save lives.

Furthermore, MAB can be useful where there are numerous possible combinations of variables to consider — more than would be possible to test in an a/b test or very costly and time consuming to do so. This was the case for the Kobo website; there were so many combinations of different ways to display book options on the website that it would have taken too much time to test all of these using an a/b test.

Contextual Multi-Armed Bandit

Inmar’s team recognized the diversity of people’s reading tastes and wanted to create more personalization in their algorithm — to make it smarter and more delightful for customers over time. So, they tweaked the MAB to be a Contextual Multi-Armed Bandit. How it worked was: if I am someone who normally reads mystery novels (bucket a), sometimes reads leadership and management books (bucket b), and once in a blue moon reads a biography (bucket c), the contextual MAB starts to take these preferences into account.

This tweak means that the algorithm would generally show me books and carousels that match the segmentation that I am closest to (bucket a and b), but occasionally it would recommend books out of categories that I very rarely choose from (bucket c). This starts to mimic what people do in real life and allow for the serendipity of seeing books that I might be interested in but are outside of my usual buying behaviour.

Some Terminology

Cold Start Problem

What happens when a company wants to recommend a book, a movie, a song to a new customer — but they don’t know anything about what the customer likes/dislikes, their behaviours, their purchasing habits, or their routines? That is the cold start problem: the challenge of not having information about a new user/customer when they first join a platform, and thus, it is difficult to segment them. It is also the reason that recommendations improve the longer you use the platform — the better the algorithm knows you and your preferences, the better the recommendations can be.

Exploration / Exploitation

Imagine you’re at a restaurant. Do you order the thing on the menu that you know you’ll like? Or, do you risk trying something new in case you’re missing out? At what point do you decide that you’ve tried enough different things and you just want the turkey burger? There is a similar trade-off happening with the MAB. The exploration exploitation trade-off in reinforcement learning is illustrated in the multi-armed bandit problem, where the algorithm must decide between acquiring new knowledge and maximizing reward.

Where else have you seen the Multi-Armed Bandit used? What about the Contextual MAB?

-Satsuko

References

In writing this post, I discovered a great podcast on data science and machine learning. They had some great episodes on the multi-armed bandit and reinforcement learning.