Posts

Machine Learning as an Engineering Discipline

This entry was posted in Research in General on November 23, 2015 by Liangjie Hong

Recently, Leon Bottou delivered two exiting keynote speeches on machine learning as an engineering discipline. I suggest that anyone who is using machine learning techniques to build software needs to read through them at least once. There are deep thinking about machine learning and software engineering.

Two big challenges in machine learning given on ICML 2015.
How Big Data changes Statistical Machine Learning given on IEEE BigData 2015.

Structured Sparse Regression for Recommender Systems

This entry was posted in Collaborative Filtering on November 22, 2015 by Liangjie Hong

Feature based latent factor models have received increasing attention in recent years due to its capability to effectively solve the cold-start problem. There have been many feature based collaborative filtering (CF) models proposed recently, which can be grouped into two categories. The first type of models includes all variants of latent factor models (LFM) which have been proven as an effective approach to personalization and recommender systems. The core of LFM is to learn user-specific and item-specific features from user-item interactions and utilize these features for future predictions/recommendations. State-of-the-art LFM exploits low-rank latent spaces of users and features and treats latent factors that are learnt from user-item historical data as features. This type of models has gained significant successes in a number of applications, including the Netflix competition. The second category is the factorization machine (FM), which explicitly learns the mapping function from features to rating score circumventing the dependency on user/item latent factors as in the latent factor models, resulting in an effective model for the cold start problem [9].

Although these feature-based CF models have been shown to be effective, they do not utilize the feature structure information. For example, conventional latent factor models (e.g., matrix factorization or tensor factorization models) like RLFM [1, 2] learn mapping functions from user/item features to user/item latent vectors assuming the features have a flat first-order structure. Later, [10] showed that this kind of mapping can be extended to any non-linear models. Although the formalism is flexible, it leaves too much room for practitioners to choose which non-linear model to use for a particular application. Also, it is hard to incorporate human prior knowledge on the feature structure into the framework, unless through careful feature engineering, and the proposed inference algorithm is difficult to use in large-scale settings. Similar to RLFM, Gantner et al. [4] proposed a model to explicitly learn the mapping function from features to latent factors, resulting in an effective model for the cold start problem. But it still makes the flat first-order feature structure assumption. In the other line of work, Rendle et al. [9] proposed a more compact model called factorization machine FM. Basically FM is a second-order regression model which directly maps the user-item-event concatenated features to rating score by learning the implicit mapping functions from features to latent factors, resulting in an effective model for the cold start problem. However, the issue of encoding structural human prior information still boils down to sophisticated feature engineering and it’s not clear how to incorporate the heterogeneous feature structures into model training to enhance the rating prediction performance. Though FM considers the second-order feature structure, it simply uses all the feature pairs for prediction.

Quite a lot of work in sparse coding area have shown that many signals tend to have a sparse representation from basic components in nature, and a sparse model often outperforms a dense model and also has the variable selection effect. Inspired by this, a sparse model that uses an appropriate subset of feature pairs might have a better performance. In practices, human prior knowledge or explicit structure information about these features is also sometimes available. For example, the topical categories on news articles may naturally be organized into hierarchies or graphs. Another good example would be demographical information about users, especially their geo-locations that are aligned with countries, states and cities, defined in real-world geo-political settings. These prior knowledge and structures are invaluable information for better user understanding and profiling and eventually better recommendation results. However, it is not straightforward to encode this kind of structural prior knowledge into state-of-the-art recommendation models like RLFM and FM. One approach might be to construct features capturing these structures and embed them into regression models. But the interplay between an optimal way to construct such features and train a better regression model based on these features to map to latent features becomes non-trivial in this case. Some previous work has been proposed to impose structural information on latent variable models, which are not necessarily directed graphical models. For instance, He et al. [5] proposed a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. Although the paper shares a similar idea with our framework, their work cannot handle heterogeneous types of features and complex dependencies. Also, it is hard to link their work to state-of-the-art LFM used in CF settings. Along the line of undirected graphical models, Min et al. [8] proposed sparse high-order Boltzmann machines, aiming to capture dependencies between latent variables. Again, it is not obvious to plug the model into state-of-the-art CF approaches. In this paper, we propose a structured sparse second-order regression model with structural prior knowledge in a principled way. The notion of types of features is introduced such that different types of features would have different structures (e.g., topical categories versus geographical locations). We consider two kinds of structures. For inter-typed features (which are of different kinds), the model is able to learn sparse relationships between different types of features (e.g., age and gender). For intra-typed features (which are in the same kind) that have a hierarchy (tree), e.g., we have the country-state-city hierarchical tree for the geo-location feature terms, the model learns a sparse hierarchical structure on the tree such that if a parent feature edge (or interchangeably a feature pair) is selected, then all its descendant edges should be selected as well, and if a parent edge is removed then all its descendant edges should be removed too.

You can view the paper in details here: [PDF].

Reference

D. Agarwal and B.-C. Chen. Regression-based latent factor models. In Proceedings of KDD, pages 19–28. ACM, 2009.
D. Agarwal and B.-C. Chen. fLDA: matrix factorization through latent Dirichlet allocation. In Proceedings of WSDM, pages 91–100. ACM, 2010.
T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu. SVDFeature: A toolkit for feature-based collaborative filtering. The Journal of Machine Learning Research, 13(1):3619–3622, Dec. 2012.
Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle, and L. Schmidt-Thieme. Learning attribute-to-feature mappings for cold-start recommendations. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages 176–185, 2010.
Y. He, Y. Qi, K. Kavukcuoglu, and H. Park. Learning the dependency structure of latent factors. In Advances in Neural Information Processing Systems 25, pages 2375–2383. 2012.
R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. The Journal of Machine Learning Research, 12:2297–2334, 2011.
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, Aug. 2009.
M. R. Min, X. Ning, C. Cheng, and M. Gerstein. Interpretable sparse high-order boltzmann machines. In AISTATS, pages 614–622, 2014.
S. Rendle. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, 3(3):57:1–57:22, May 2012.
L. Zhang, D. Agarwal, and B.-C. Chen. Generalizing matrix factorization through flexible regression priors. In Proceedings of RecSys, pages 13–20. ACM, 2011.
P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, pages 3468–3497, 2009.

Poisson Matrix Factorization 1

This entry was posted in Collaborative Filtering on August 17, 2015 by Liangjie Hong

In recent years, Matrix Factorization (MF) methods play a pivotal role in the research of Collaborative Filtering and Recommender Systems. The basic assumption is that, the data can be formed into a $N \times M$ matrix $X$ where $N$ is the number of users and $M$ is the number of items. Each rating $X_{i,j}$ is modeled as a dot-product of the latent factor $\theta_{i}$ for the user $i$ and the latent factor $\phi_{j}$ for the item $j$ . In the classic Probabilistic Matrix Factorization, the rating $X_{i,j}$ is defined as:

(1) $\begin{equation*}X_{i,j} \sim \mathcal{N}(\theta_{i}^{T}\phi_{j}, \sigma^{2})\end{equation*}$

where each user factor and each item factor is drawn from another set of Gaussians:

(2) $\begin{align*}\theta_{i} \sim \mathcal{N}(0, \sigma_{U}^{2}I) \\\phi_{j} \sim \mathcal{N}(0, \sigma_{I}^{2}I)\end{align*}$

This definition has two issues:

It does not prevent that the ratings become negative, which is a natural result of the Gaussian assumption.
If no heuristics are applied, the model needs to model all zero ratings and therefore, dramatically impacting the predictive performance.

In order to demonstrate the second point, we could see that the log likelihood of the data can be give as:

(3) $\begin{equation*}\log P(X \, | \, \theta, \phi ) = - \frac{1}{2\sigma^{2}} \sum_{i} \sum_{j} ( X_{i,j} - \theta_{i}^{T}\phi_{j})^{2} - \frac{1}{2} N \times M \log \sigma^{2} + C\end{equation*}$

where $C$ is a constant. It is obvious that, if $X_{i,j}$ is zero, the model still needs to explain those ratings as well. In [1], the authors just use a variable to indicate whether the rating is observed or not, simply ignoring them, which might be effective, but not justified in the model. In [2], the authors do not ignore unobserved or zero ratings, they put a small confident number for those ratings, a heuristic still.

Recently, Prem Gopalan et al. [3, 4, 5, 6] have proposed a new model called Poisson Factorization (PF) to address these two issues. The central idea is to replace Gaussian assumption with Poisson distribution:

(4) $\begin{equation*}X_{i,j} \sim \mathrm{Poisson}(\theta_{i}^{T}\phi_{j})\end{equation*}$

where $\theta_{i}$ and $\phi_{j}$ are drawn from user specific and item specific Gamma distributions. One obvious outcome of this model is that, $X_{i,j}$ are modeled as non-negative values and the latent factors are also non-negative as well. To the second issue mentioned above, we could see the probability of data is：

(5) $\begin{equation*}P(X_{i,j} \, | \, \theta_{i}, \phi_{j}) = (\theta_{i}^{T}\phi_{j})^{X_{i,j}}\frac{\exp \left\{ -\theta_{i}^{T} \phi_{j} \right\}}{X_{i,j}!}\end{equation*}$

and therefore, for all items, the log probability is:

(6) $\begin{equation*}\log P(X \, | \, \theta, \phi) = \sum_{i,j} \left\{ X_{i,j}\log \left( \theta_{i}^{T}\phi_{j} \right) - \log X_{i,j}! - \theta_{i}^{T} \phi_{j} \right\}\end{equation*}$

As $0!=1$ , thus, the first term would be zero if $X_{i,j}$ and the second term becomes $1$ and thus, only the terms that $X_{i,j}$ would remain. Therefore, as mentioned in the papers, the model, by definition, can support sparse matrices.

Another interesting property of PF, which is mentioned in [5, 6], is that, we can rewrite the Poisson observation model as a two stage process where a user $u$ first decides on a budget $b_{u}$ she has to spend on items, and then spends this budget rating items that she is interested in:

(7) $\begin{align*}b_{u} &\sim \mathrm{Poisson}(\theta_{u}^{T}\sum_{j} \phi_{j}) \nonumber \\[X_{i1},\cdots,X_{iM}] &\sim \mathrm{Multinomial}(b_{u}, \frac{\theta_{u}^{T}\phi_{j}}{\theta_{u}^{T}\sum_{j}\phi_{j}})\end{align*}$

This shows that learning a PF model for user-item ratings is effectively the same as learning a budget for each user while also learning how that budget is distributed across items.

Reference

Ruslan Salakhutdinov, Andriy Mnih: Probabilistic Matrix Factorization. NIPS 2007: 1257-1264
Yifan Hu, Yehuda Koren, Chris Volinsky: Collaborative Filtering for Implicit Feedback Datasets. ICDM 2008: 263-272
Prem Gopalan, Laurent Charlin, David M. Blei: Content-based recommendations with Poisson factorization. NIPS 2014: 3176-3184
Prem Gopalan, Francisco J. Ruiz, Rajesh Ranganath, David M. Blei: Bayesian Nonparametric Poisson Factorization for Recommendation Systems. AISTATS 2014: 275-283
Prem Gopalan, Jake M. Hofman, David M. Blei: Scalable Recommendation with Poisson Factorization. CoRR abs/1311.1704 (2013)
Prem Gopalan, Jake M. Hofman, David M. Blei: Scalable Recommendation with Poisson Factorization. UAI 2015.

Experiences and Lessons in Developing Machine Learning Software

This entry was posted in Practical Programming Research in General on June 27, 2015 by Liangjie Hong

A very useful talk by Chih-Jen Lin, the author of LibSVM and LibLinear: Experiences and Lessons in Developing Machine Learning Software. Four main points:

Data scaling
Parameter tuning
Easy to use software v.s. complicated software with marginal improvement
Numerical stability

Interesting and important lessons!

Statistical Significant Testing in IR

This entry was posted in Research in General on March 6, 2015 by Liangjie Hong

Ben Carterette from University of Delaware had an interesting tutorial on ICTIR 2013 about Statistical Significant Testing in Theory and in Practice (the same one also talked on SIGIR 2014 as well). This should be read by any IR or ML researcher. One thing of particular interesting is multiple testing problem.

Smoothing Techniques for Complex Relationships between Probability Distributions

This entry was posted in Research in General on February 17, 2015 by Liangjie Hong

Imagine you want to estimate the probability of how likely a user $u$ in San Francisco would click on a particular news article $i$ . We represent this probability as $P(a_{i} =1 \, | \, u)$ where $a_{i}$ is a binary variable for article $i$ to encode whether the user clicks on it or not. The simplest way to estimate such probability is to stick to Maximum Likelihood Estimation (MLE) where $P$ can be easily computed as:

(1) $\begin{equation*}P( a_{i} = 1 \, | \, u ) = \frac{C(i,u)}{N(i,u)}\end{equation*}$

where $C(i,u)$ is the number of times user $u$ clicks on article $i$ and $N(i,u)$ is the number of times article $i$ is shown to the user $u$ .

Everything works fine until we face a new article, let’s say $j$ , that has never shown to the user before. In this case, $C(j, u)$ is effectively zero and we could show the article to the user a couple of times, in hoping of gathering a couple of clicks. However, this is not guaranteed. Moreover, we should be able to estimate $P$ before showing to the user where $N$ is also zero!

In statistical modeling, a common practice for the situation mentioned above is to impose a prior distribution over $P$ where pseudo counts are added to $C$ and $N$ , yielding non-zero ratio of the Equation \eqref{eq:mle}. For example, one could specify a Beta distribution over $P$ where the posterior distribution is another Beta distribution, making the inference straightforward. A thorough study on how these smoothing techniques applying to Information Retrieval has been intensively explored in [1].

The smoothing scenario becomes complicated when multiple related probability distributions get involved. For instance, instead of just estimating how likely a user would click on an article in general, you would be interested in estimating how likely a user would click an article in a particular time of day (e.g., morning or night), or in a particular geographical location (e.g., coffee shop) like $P( a_{i}=1 \, | \, u, t )$ or $P( a_{i} = 1 \, | \, u, g )$ where $t$ and $g$ represent a particular context. Of course, we could use MLE again to infer these probabilities but, in general, they will suffer from even severe data sparsity issues and stronger and smarter smoothing techniques are required.

Given related probabilities like $P (a_{i} = 1 \, | \, u )$ , $P( a_{i}=1 \, | \, u, t )$ and $P( a_{i} = 1 \, | \, u, g )$ , it is not straightforward to define a hierarchical structure to impose a prior distribution over all these distributions. The general idea would be that, they should be smoothed over each other altogether.

Traditionally, smoothing over a complex structure, or usually, when the structure can be defined as a graph, is a well-studied sub-area of machine learning called Graph-based Semi-Supervised Learning (SSL). Good references are like [2,3]. One example of how such algorithms can be applied to probability distributions is [4]. In general, we perform the following two steps to conduct SSL:

Construct a graph for quantities we care about, in this case, they are probability distributions.
Perform a specific form of label propagation on the graph, resulting in a stationary values of quantities we care about.

However, one critical issue with [2,3,4] is that, the fundamental algorithm is designed based on square errors, assuming Gaussian models underneath, which is not appropriate for probability distributions.

A recently proposed framework nicely tackled the problem mentioned above [5], which is to learn probability distributions over a graph. Let’s list the main framework from the paper:

(2) $\begin{equation*}\mathcal{C}_{KL} = \sum_{i=1}^{l} D_{KL} (r_{i} || p_{i}) + \mu \sum_{i=1}^{m} \sum_{j \in \mathcal{N}_{i}} w_{ij} D_{KL}(p_{i} || p_{j}) - \nu \sum_{i=1}^{m}H(p_{i})\end{equation*}$

where $r_{i}$ is the original probability distributions, $p_{i}$ is the smoothed probability distribution, $D_{KL}$ is the standard KL divergence between two distributions and $H$ is Shannon entropy. Here, $\mu$ and $\nu$ are two hyper-parameters. $l$ is the number of labeled nodes, $m$ is the number of all nodes. Equation \eqref{eq:kl} is a general framework to learn distributions over labeled and unlabeled nodes. The first term imposes the closeness between smoothed version and the original version of distributions and the second term is the smoothness over the neighborhood graph and the third term is to penalize shape distributions, encouraging uniform distributions.

In paper [5], the authors proposed alternative minimization techniques and also mentioned methods of multipliers to solve the problem.

Reference

Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to ad-hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’01). ACM, New York, NY, USA, 334-342.
D. Zhou, O. Bousquet, T.N. Lal, J. Weston and B. Schölkopf. Learning with Local and Global Consistency. Advances in Neural Information Processing Systems (NIPS) 16, 321-328. (Eds.) S. Thrun, L. Saul and B. Schölkopf, MIT Press, Cambridge, MA, 2004.
Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In The 20th International Conference on Machine Learning (ICML), 2003.
Qiaozhu Mei, Duo Zhang, and ChengXiang Zhai, A General Optimization Framework for Smoothing Language Models on Graph Structures. in Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’08), pp. 611-618, 2008.
Amarnag Subramanya and Jeff Bilmes. 2011. Semi-Supervised Learning with Measure Propagation. The Journal of Machine Learning Research 12 (November 2011), 3311-3370.

Unbiased Offline Evaluation Process of Contextual Multi-armed Bandit Models 2

This entry was posted in Research in General on April 14, 2014 by Liangjie Hong

In this post, I would like to discuss some strategies to evaluate contextual multi-armed bandit (cMAB) models in an unbiased offline evaluation setting. The setting is not only applicable for cMAB models but also can be served as a standard process for any machine learning models from online feedback data.

Review of Contextual Multi-armed Bandit Problems

A cMAB problem is a sequential decision making process. Bandit problems invovle making a decision in each round. Once a decision is made an observation is collected and the corresponding reward computed. The contextual-bandit formalism generalizes this classic setting by introducing contextual information in the interaction loop.

Formally, we define by $\mathcal{A} = \{1,2,\cdots, K\}$ , a set of actions, a contextual vector $\mathbf{x}_{t} \in \mathcal{X}$ , a reward vector $\mathbf{r}_{t} = \{\mathbf{r}_{t,1}, \cdots, \mathbf{r}_{t,K}\}$ , where each $\mathbf{r}_{t,a} \in [0,1]$ and a policy $\pi : \mathcal{X} \rightarrow \mathcal{A}$ . A contextual bandit describes a round-by-round interaction between a learner and the environment. At each round $t$ , the problem can be decomposed into following steps:

The environment chooses $(\mathbf{x}_{t}, \mathbf{r}_{t})$ from some unknown distribution $\mathbb{D}$ . Only $\mathbf{x}_{t}$ is revealed to the learner while the reward is not.
Upon observing $\mathbf{x}_{t}$ , the learner uses some policy $\pi$ to choose an action $a \in \mathcal{A}$ , and in return observes the corresponding reward $\mathbf{r}_{t,a}$ .
(Optional) The policy $\pi$ is revised with the data collected for this round.

Here, we define $\pi$ is parameterized by an unknown parameter $\theta$ . Ideally, we would like to choose the policy maximizing the expected reward:

$\begin{align*}\arg\max_{\pi} \mathbb{E}_{\mathbf{x}, \mathbf{r} \sim \mathcal{D}} \Bigr[ \mathbf{r}_{\pi(\mathbf{x};\theta)}\Bigl]\end{align*}$

An important fact in cMAB is that, only rewards of the chosen actions are observed. An online learning must therefore find a good balance between exploration and exploitation, a challenge in bandit problems. For offline policy evaluation, such partial observation raises even further difficulties. Data in a contextual bandit is often in the form of $\mathbf{x}, a, \mathbf{r}_{a}$ , where $a$ is the action chosen for context $x$ when collecting the data, and $\mathbf{r}_{a}$ is the corresponding reward. If this data is used to evaluate policy $\pi$ , which chooses a different action $\pi(x) \neq a$ , then we simply do not have the reward signal to evaluate the policy in that context.

Policy Evaluation

In order to evaluate different policies in an unbiased environment, we need to gather data using a logging policy. This policy is specially designed such that the data can be re-use for offline evaluation. As pointed out in [1], a randomized data collection, which has been known to be critical condition for drawing valid causal conclusions. From [1], the authors proposed the following data collection procedure. At each round,

Environment chooses $\mathbf{x}, \mathbf{r}$ , i.i.d. from some unknown distribution $\mathbb{D}$ , and only reveals context $\mathbf{x}$ .
Based on $x$ , one compute a multinomial distribution $\mathbf{p}$ over the actions $\mathcal{A}$ . A random action $a$ is drawn according to the distribution, and the corresponding reward $\mathbf{r}_{a}$ and probability mass $\mathbf{p}_{a}$ are logged. Note that $\mathbf{p}$ may depend on $\mathbf{x}$ .

A key point here is that, rather than choosing the optimal action at each round, we would like to choose a random action according some distribution over actions. Also, the probability to select such action should be recorded.

At the end of the process, we have a dataset $\mathcal{D}$ containing data of the form $( \mathbf{x}, a, \mathbf{r}_{a}, \mathbf{p}_{a} )$ . In statistics, the probabilities $\mathbf{p}_{a}$ are also known as propensity scores.

In order evaluate the value of a policy $\pi$ in such dataset, the following estimator is used:

(1) $\begin{align*}\hat{V}_{\mbox{offline}}(\pi) = \sum_{(x,a,\mathbf{r}_{a},\mathbf{p}_{a}) \in \mathcal{D}} \frac{\mathbf{r}_{a} \mathbb{I}(\pi(x) == a)}{\mathbf{p}_{a}}\end{align*}$

This evaluation method is also used in [2] and [3].

Unbiased Offline Evaluation

We define the value of a policy $\pi$ as:

(2) $\begin{align*}V(\pi) &= \mathbb{E}_{(x,d) \in \mathbb{D}}\Bigr[ \mathbf{r}_{a} P(\pi(x) = a \, |\, x) \Bigl] \\&= \int_{(x,r) \in \mathbb{D}} \mathbf{r}_{a} P(\pi(x) = a \, |\, x) P(x,r) \, dx dr\end{align*}$

When the policy is deterministic, meaning that, given a contextual vector $x$ , the arm $a$ is always chosen, the above value becomes:

(3) $\begin{align*}V(\pi) &= \mathbb{E}_{(x,d) \in \mathbb{D}}\Bigr[ \mathbf{r}_{\pi_{x}} \Bigl] \\&= \int_{(x,r) \in \mathbb{D}} \mathbf{r}_{\pi_{x}} P(x,r) \, dx dr\end{align*}$

In both cases, we want to prove:

(4) $\begin{align*}\mathbb{E}_{\mathcal{D} \in \mathbb{D}} \Bigr[ \hat{V}_{\mbox{offline}}(\pi)\Big] = V(\pi)\end{align*}$

In the following discussion, we focus on the deterministic case. Let us expand the left hand side as:

(5) $\begin{align*}\int_{\mathcal{D} \in \mathbb{D}} \Bigr[ \sum_{(x,a,\mathbf{r}_{a},\mathbf{p}_{a}) \in \mathcal{D}} \frac{\mathbf{r}_{a} \mathbb{I}(\pi(x) == a)}{\mathbf{p}_{a}} \Bigl] P(\mathcal{D}) \, d\mathcal{D}\end{align*}$

The important step in the proof is that we need to make use of the following quantity:

(6) $\begin{align*}\mathbb{E}_{a \in \mathbf{p}} \Bigr[ \frac{\mathbf{r}_{a} \mathbb{I}(\pi(x) == a)}{\mathbf{p}_{a}} \Bigl] = \mathbf{r}_{a} \mathbb{I}(\pi(x) == a) = \mathbf{r}_{\pi(x)} \end{align*}$

Thus, on expectation, we have:

(7) $\begin{align*}\int_{\mathcal{D} \in \mathbb{D}} \Bigr[ \sum_{(x,a,\mathbf{r}_{a},\mathbf{p}_{a}) \in \mathcal{D}} \frac{\mathbf{r}_{a} \mathbb{I}(\pi(x) == a)}{\mathbf{p}_{a}} \Bigl] P(\mathcal{D}) \, d\mathcal{D} = \int_{\mathcal{D} \in \mathbb{D}} \Bigr[ \sum_{(x, \mathbf{r}) \in \mathcal{D}} \mathbf{r}_{\pi(x)} \Bigl] P(\mathcal{D}) \, d\mathcal{D}\end{align*}$

As $(x,r)$ from $\mathcal{D}$ are sampled from $\mathbb{D}$ i.i.d., $\mathcal{D}$ is a random sample from $\mathbb{D}$ . The integral also mounts to $\mathbb{D}$ . Note that the key part is Equation \eqref{randomarm} we need to select arms stochastically.

References

L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual estimation and optimization of click metrics for search engines. Technical Report MSR-TR-2014-32, Microsoft Research, 2014.
L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, pages 297–306, New York, NY, USA, 2011. ACM.
A. L. Strehl, J. Langford, L. Li, and S. Kakade. Learning from logged implicit exploration data. In NIPS, pages 2217–2225, 2010.

Embedding Pig Latin in Python – Practical Guide

This entry was posted in Practical Programming on December 27, 2013 by Liangjie Hong

Introduction

Hadoop is a de-facto parallel computing platform for many industrial companies. However, it is non-trivial to write MapReduce programs on Hadoop for complex data analysis tasks. Therefore, high-level structure query languages like Pig can help data scientists better perform analysis tasks without distracting to details of MapReduce programs on Hadoop. One problem with Pig is that, it is lack of capability to deal with complex conditions such as “if”, “while” and “switch”, which are widely supported by other general-purpose programming languages. To tackle the issue, one solution is to embed Pig into a general-purpose programming language.

There are several great references for introductory materials about this topic. For example, [here], [here], [here] and [here]. In this post, we do not want to discuss the basic idea of embedding Pig into Python but demonstrate some important practices.

One important fact that needs to pay attention is that “Python” here is essentially Jython, a JVM-based Python implementation which uses the same syntax with Python. However, the standard libraries used in Python may or may not be available in Jython.

Pig User-Defined Functions

Within Pig, user-defined functions (UDF) are usually used for performing complicated tasks which are not supported well enough from Pig native functions. One example is to draw a Gaussian random variable from each row of the data source, centered by some field names.

Many languages can be used to write Pig UDFs. One easy choice is Jython. For general information about Jython UDFs, please see here.

When functions are included in the main script of the embedding code, it is by default treated as Pig UDFs. For instance,

def abc:
    print 'Hello World'
if __name__ == '__main__':
    from org.apache.pig.scripting import Pig
    P = Pig.compileFromFile(...)
    Q = P.bind(...)
    results = Q.run()

Here, function “abc” will be treated as a Pig UDF not a normal Jython function.

Python User-Defined Functions

As normal functions defined in the script will be treated as Pig UDFs, how can we write functions without putting too much program into the main script. One way is to put those functions into a separate “.py” file and used as a module. Therefore, we can use those functions by calling modules and corresponding functions. For instance,

import ComplexTask
if __name__ == '__main__':
    tasks = ComplexTask.ParseTasks(...)
    from org.apache.pig.scripting import Pig
    P = Pig.compileFromFile(...)
    Q = P.bind(...)
    results = Q.run()

Here, “ComplexTask” is another Python script “ComplexTask.py” in the same directory. In such way, we can put different functionalities into different modules.

Summary

Once we can use Python to control the workflow of Pig scripts, it would be even better that we can start and configure different workflows from different configuration files. Unfortunately, we do not have such tools so far. Thus, we probably need to read and parse XML configuration files by hand.

Grid Representations of Texts

This entry was posted in IR Topic Model on December 16, 2013 by Liangjie Hong

One ever-lasting theme of text mining is to model text in a proper space. A simple yet powerful space is in a one-dimension space where words are generated from an uncorrelated array of terms. This representation is usually called bag-of-word (BOW) representation.

BOW is also a foundation to more complicated approaches like latent Dirichlet allocation (LDA) or latent Semantic indexing (LSI). In such models, terms in a document are essentially generated from different 1D arrays (topics) where each array has its distinct distribution over the whole vocabulary. However, like simple BOW, these 1D arrays are uncorrelated.

The idea of grid representations of texts is essentially to model correlations between words. One such example is called “Multi-dimensional counting grids” [1] and its “admixture” version of the model “Componential counting grids” [2]. The basic assumption here is that, texts are generated by a moving window of grids where each grid is a distribution over terms. One benefit of such representation is that it would be more natural to handle thematic shifts in the framework. Also, depending on the window size, the n-gram effect is also automatically considered.

Although from the surface, the true advantage of such representation over BOW is not very clear, it is nevertheless an interesting idea to explore.

References

[1] Jojic, N., Perina, A.: Multidimensional counting grids: Inferring word order from disordered bags of words: In UAI. 2011 547-556 [PDF]
[2] Perina, A., Jojic, N., Bicego, M. and Turski, A.: Documents as Multiple Overlapping Windows into a Grid of Counts: In NIPS 2013 [PDF]

Combining Regression and Ranking

This entry was posted in Research in General on April 26, 2013 by Liangjie Hong

It would be hard to imagine that we want to combine the regression problems and ranking problems together. They’re inherently different. Traditionally, ranking problems only concern the order of ranked items. Even you could have very bad scores for each item, the final order matters.

However, there’re situations we care both the scores for each item, as well as their order. One would argue that eventually this is a regression problem because if you predict 100% accurate on all items and therefore the order should be correct as well. This is true only if you have a perfect regression model. There’re a lot of cases where you have a nearly perfect regression model but the results are terrible in terms of ranking. For instance, you could have a very imbalanced dataset where the majority class has $y=0$ . Thus, it is possible to achieve a near-perfect regression performance by always predicting $0$ , which gives a useless ranking.

D. Sculley provided an easy solution to the problem (published on KDD 2010) above. Basically, we just naively combine two problems together as a single optimization problem, balanced by a parameter as

$\min_{\mathbf{w} \in \mathbb{R}^{m}} \alpha L(\mathbf{w}, D) + (1-\alpha)L(\mathbf{w}, P) + \frac{\lambda}{2}||\mathbf{w}||^{2}_{2}$

where $\alpha \in [0,1]$ . The overall optimization problem is solved through Stochastic Gradient Descent (SGD).

The paper also described efficiently sampling from $P$ and other scalability issues.

Hong, LiangJie

Director of Engineering – AI at LinkedIn Corporation

Director of Engineering – AI at LinkedIn Corporation

Posts

Machine Learning as an Engineering Discipline

Structured Sparse Regression for Recommender Systems

Reference

Poisson Matrix Factorization 1

Reference

Experiences and Lessons in Developing Machine Learning Software

Statistical Significant Testing in IR

Smoothing Techniques for Complex Relationships between Probability Distributions

Reference

Unbiased Offline Evaluation Process of Contextual Multi-armed Bandit Models 2

Review of Contextual Multi-armed Bandit Problems

Policy Evaluation

Unbiased Offline Evaluation

References

Embedding Pig Latin in Python – Practical Guide

Introduction

Pig User-Defined Functions

Python User-Defined Functions

Summary

Grid Representations of Texts

References

Combining Regression and Ranking