SuSTaIn
About
News
Postgraduate degrees
Events
Research highlights
Jobs
Management
Statistics Group
Statistics Home
Research
Members
Seminars
Statistics@Bristol
Mathematics Home
External Links
APTS
Complexity science
Royal Statistical Society
International Society for Bayesian Analysis
|
|
Optimistic Bayesian Sampling in Contextual-Bandit Problems
by Benedict C. May, Nathan Korda, Anthony Lee and David S. Leslie
In sequential decision problems in an unknown environment, the decision maker often faces a
dilemma over whether to explore to discover more about the environment, or to exploit current
knowledge. We address the exploration-exploitation dilemma in a general setting encompassing
both standard and contextualised bandit problems. The contextual bandit problem has recently
resurfaced in attempts to maximise click-through rates in web based applications, a task with significant
commercial interest.
In this article we consider an approach of Thompson (1933) which makes use of samples from
the posterior distributions for the instantaneous value of each action. We extend the approach by
introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of
playing an action increases with the uncertainty in the estimate of the action value. This results in
better directed exploratory behaviour.
We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour
with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and
measure its performance in simulated Bernoulli bandit and linear regression domains, and also
when tested with the task of personalised news article recommendation on a Yahoo! Front Page
Today Module data set. We find that OBS performs competitively when compared to recently
proposed benchmark algorithms and outperforms Thompson's method throughout.
Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation,
Thompson sampling.
Full text of the paper (pdf),
which has recently appeared in Journal of Machine Learning Research volume (2012) 2069-2106.
|