Getting directions to a new location these days can be as simple as accessing a driving app like Google Maps. These types of apps give recommendations on which routes are optimal between point A and point B. How they reach their conclusions is dependent on an aspect of learning theory known as a multi-armed bandit problem. “Multi-armed bandit problems explore the general theme of how to get computers to run experiments and learn about the world through that experimentation the way people do,” says Robert D. Kleinberg, Computer Science.
The originators of this type of problem saw it as similar to a gambler with limited money trying to figure out which one-armed bandit slot machine gives the best odds before he runs out of cash. However in this case, the bandit has many possible solutions, or arms.
Your Navigation App—How the Algorithm Works
“A navigation app has an algorithm that conducts a sequence of experiments that try out each road segment as frequently as needed in order to estimate how good each segment is,” Kleinberg says. “Then it strings that experiential information together with knowledge of the road map and the potential ways to combine different segments into routes. By having data about each segment, the algorithm can make statistically accurate predictions about how quickly a driver using that route can get from point A to point B.”
Those who use a navigation app can end up taking a suggested route that actually ends up slower or so out-of-the-way that they wonder why the app suggested it in the first place. “The app may actually use person X, who logged in, say at 8:05, to run driving experiments that may be relevant to person Y, who logs in at 8:10,” Kleinberg explains. “Learning at the scale of an entire population like that creates great opportunities to learn much more efficiently. But it also creates game theoretic challenges because when you’re running experiments, you don’t always do what the individual user thinks is optimal.”
Product Recommendation Platforms—Another Multi-Armed Bandit Problem
This aspect of multi-armed bandit problems is especially salient to something like Amazon.com product recommendations, Kleinberg notes. Buyers on large platforms with many products are inclined to choose either the product they’ve bought before or one positively reviewed by many users. Yet, there could be a new product which doesn’t yet have reviews. “It’s in the platform’s interest to get as many people as possible to try the new product because it could be better,” Kleinberg says. “We’ll never know unless some people try it despite the prevailing belief that something else is better.” Along with recommendations, the platform can use incentives, such as sale prices, to encourage buyers to try the new product. But what is the trade-off between how much the platform is willing to lower prices and how much experimentation a customer is willing to engage in?
Tackling this question of how to bridge the gap between what the platform wants and what the individual wants, Kleinberg joined together with Peter I. Frazier, Operations Research and Information Engineering; Jon Kleinberg, Computer Science/Information Science; and David Kempe at the University of Southern California to explore the most optimal way to engage in what they termed incentivized exploration. The researchers created an equation that tells the designers of a platform, such as Amazon, the best learning outcomes they can achieve for a given investment.
“The import of our paper comes from delineating in quantitative terms, with a limited budget for experimentation, how much loss you should expect to experience in the long-run recommendation quality of your system,” Kleinberg says. “Knowing roughly what shape that trade-off has could offer useful guidance for people who are planning to launch a recommendation service or thinking of ways to refine their interface.”
Hearing Your Favorite Song, Again and Again—Recharging Bandits
“If you hear your favorite song and your enjoyment level is at five stars, and then you hear it again a minute later, you might only give it one star.”
For another paper, Kleinberg collaborated with Nicole Immorlica, senior researcher at Microsoft Research, to tackle a multi-armed bandit problem they called recharging bandits. These types of problems center on scheduling recurrent interventions that prefer to be spaced further apart in time. Think a song recommendation platform such as Pandora, where you enjoy hearing your favorite song but don’t want to hear it over and over. “The idea is that after you perform one of these actions and get benefit from it, say listening to your favorite song, it takes a while for that action to recharge that same amount of benefit again,” Kleinberg explains. “If you hear your favorite song and your enjoyment level is at five stars, and then you hear it again a minute later, you might only give it one star. If you hear it an hour later, it might be up to two stars, but if you hear it three days after that, your enjoyment has recharged back to five stars.”
Scheduling problems, even without learning, are very complicated to solve, and recharging bandit problems bring the additional complication of many parameters that must be learned about every alternative. “It’s not just how good or how bad a particular solution is on a one-dimensional continuum,” Kleinberg says, “but you also have to look at the dynamics of how the action recharges over time after you do it.”
Kleinberg and Immorlica created a learning algorithm that engages in multi-armed bandit-style experimentation and then takes that data and solves, to near optimality, this type of scheduling problem. “There are two forms of our algorithm,” Kleinberg says. “One is a practical algorithm you can run fairly quickly and is guaranteed to get at least 81 percent as good as the optimal schedule. The other one gets 99.9 percent as good as the optimal schedule but has an astronomically long run time. So depending on your computational resources, you can accept the 81 percent solution, or you can invest a huge amount of computation and get a solution that’s extremely close to 100 percent.”
Always a Passionate Mathematician
Kleinberg has been passionately drawn to math from his earliest days. “I can’t remember a time when I didn’t experience the world through the lens of math,” he says. He recalls moments from his childhood: creating a mathematical diagram to explain the odds of winning on the television game show $25,000 Pyramid; working out a formula to calculate the degrees of relationship between cousins in his family tree; creating a parabolic curve in elementary school out of segments of string passed through holes drilled in cardboard on an x-y axis—the same parabolic curve that shows up when the results of the incentivized exploration equation are plotted.
“I wish I could go back in time and talk to my fourth-grade self,” he says. “I’d say, ‘You know that art project you’re doing right now? One day you’re going to write a paper to solve a math problem and the answer is going to be that curve.’”