In this post, I would like to discuss some strategies to evaluate contextual multi-armed bandit (cMAB) models in an unbiased offline evaluation setting. The setting is not only applicable for cMAB models but also can be served as a standard process for any machine learning models from online feedback data.
Review of Contextual Multi-armed Bandit Problems
A cMAB problem is a sequential decision making process. Bandit problems invovle making a decision in each round. Once a decision is made an observation is collected and the corresponding reward computed. The contextual-bandit formalism generalizes this classic setting by introducing contextual information in the interaction loop.
Formally, we define by , a set of actions, a contextual vector , a reward vector , where each and a policy . A contextual bandit describes a round-by-round interaction between a learner and the environment. At each round , the problem can be decomposed into following steps:
- The environment chooses from some unknown distribution . Only is revealed to the learner while the reward is not.
- Upon observing , the learner uses some policy to choose an action , and in return observes the corresponding reward .
- (Optional) The policy is revised with the data collected for this round.
Here, we define is parameterized by an unknown parameter . Ideally, we would like to choose the policy maximizing the expected reward:
An important fact in cMAB is that, only rewards of the chosen actions are observed. An online learning must therefore find a good balance between exploration and exploitation, a challenge in bandit problems. For offline policy evaluation, such partial observation raises even further difficulties. Data in a contextual bandit is often in the form of , where is the action chosen for context when collecting the data, and is the corresponding reward. If this data is used to evaluate policy , which chooses a different action , then we simply do not have the reward signal to evaluate the policy in that context.
Policy Evaluation
In order to evaluate different policies in an unbiased environment, we need to gather data using a logging policy. This policy is specially designed such that the data can be re-use for offline evaluation. As pointed out in [1], a randomized data collection, which has been known to be critical condition for drawing valid causal conclusions. From [1], the authors proposed the following data collection procedure. At each round,
- Environment chooses , i.i.d. from some unknown distribution , and only reveals context .
- Based on , one compute a multinomial distribution over the actions . A random action is drawn according to the distribution, and the corresponding reward and probability mass are logged. Note that may depend on .
A key point here is that, rather than choosing the optimal action at each round, we would like to choose a random action according some distribution over actions. Also, the probability to select such action should be recorded.
At the end of the process, we have a dataset containing data of the form . In statistics, the probabilities are also known as propensity scores.
In order evaluate the value of a policy in such dataset, the following estimator is used:
(1)
This evaluation method is also used in [2] and [3].Unbiased Offline Evaluation
We define the value of a policy as:
(2)
When the policy is deterministic, meaning that, given a contextual vector , the arm is always chosen, the above value becomes:(3)
In both cases, we want to prove:(4)
In the following discussion, we focus on the deterministic case. Let us expand the left hand side as:(5)
The important step in the proof is that we need to make use of the following quantity:(6)
Thus, on expectation, we have:(7)
As from are sampled from i.i.d., is a random sample from . The integral also mounts to . Note that the key part is Equation \eqref{randomarm} we need to select arms stochastically.References
- L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual estimation and optimization of click metrics for search engines. Technical Report MSR-TR-2014-32, Microsoft Research, 2014.
- L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, pages 297–306, New York, NY, USA, 2011. ACM.
- A. L. Strehl, J. Langford, L. Li, and S. Kakade. Learning from logged implicit exploration data. In NIPS, pages 2217–2225, 2010.
Hi,
I have a question about the data gathering: the performance/user experience, etc will be affected when randomly choosing an action, right? How is this handled?
Thanks
This “random” is NOT uniform random and therefore, you could generate a random yet user-friend result set.