Posts


Some Recent Papers About Topic Models

In this post, I would like to talk about several recent papers about topic models. These papers may not belong to the same direction of applying or extending topic models. However, some of them are quite interesting and worth to be discussed here.

The first one is

Enhong Chen, Yanggang Lin, Hui Xiong, Qiming Luo, and Haiping Ma. 2011. Exploiting probabilistic topic models to improve text categorization under class imbalanceJournal of Information Processing and Management. 47, 2 (March 2011), 202-214.

The idea is straightforward and simple. The author proposed a two-step approach to mitigate the problem of unbalanced data. The first step is to learn topic models from the existing unbalanced data. Here, for each class label, a separate set of topics is learned. Once the models are obtained, synthetic documents or new samples are drawn from learned models. This is possible since topic distribution and word distribution are fixed after learning process. The number of new samples is determined by the difference between the dominant class and the rare class. A more aggressive method is also proposed, which is used to avoid noisy labeled data. The idea is to use all synthetic samples to train a classifier, rather than original samples. The experimental results demonstrate some performance improvement of this method over other ones that are proposed to tackle the same problem.

The second paper is

Wayne Xin Zhao, Jing Jiang, Hongfei Yan, and Xiaoming Li. 2010. Jointly modeling aspects and opinions with a MaxEnt-LDA hybrid. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 56-65.

The paper is interesting because it also demonstrates a method to incorporate term-level features into a topic model. The list of features for each term is embedded through a Maximum Entropy Model. The supervised learning part of the model learns the fixing weights of these features and Gibbs sampling for the topic model uses these weights. For details, please refer to the paper.

The next one is

Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan and Xiaoming Li. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval (ECIR’11) (full paper), 2011.

The paper has several interesting aspects. First, it is claimed as a first study of topics obtained on Twitter and other traditional media. The authors use a standard LDA model to discover topics from NewYorkTimes corpus and a modified topic model for Twitter, separately. Then, they proposed a heuristic method to map Twitter topics onto NYT topics.  In addition, they manually assigned topic types to all the topics found by models. By doing all these, common topics and corpus-specific topics are obtained heuristically. It’s a little bit strange that they do not consider any techniques to mine topics from multiple corpus. Secondly, they do not compare to the method where only LDA is used. Note, the same Twitter-LDA is used in:

Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achanauparp, Ee-Peng Lim and Xiaoming Li. Topical keyphrase extraction from Twitter. To appear in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11) (long paper), 2011.

 


Reviews on Binary Matrix Decomposition 2

In this post, I would like to review several existing techniques to binary matrix decomposition.

 

  • Andrew I. Schein, Lawrence K.  Saul, and Lyle H. Ungar. A Generalized Linear Model for Principal Component Analysis of Binary Data. Appeared in Proceedings of the 9’th International Workshop on Artificial Intelligence and Statistics. January 3-6, 2003. Key West, FL.
    This paper introduced a logistic version of PCA to binary data. The model assumes that each observation is from a single latent factor and there exists multiple latent factors. The model is quite straightforward and the inference is been done by Alternative Least Square.
  • Tao Li. 2005. A general model for clustering binary data. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD ’05). ACM, New York, NY, USA, 188-197.
    In this paper, the author introduced the problem of “binary data decomposition”. The paper demonstrated several techniques that are popular for normal matrix factorization to binary data, like k-means, spectral clustering. The proposed method is to factorize the binary matrix into two binary matrices, where the binary indicators suggest membership.
  • Tomas Singliar and Milos Hauskrecht. 2006. Noisy-OR Component Analysis and its Application to Link AnalysisJ. Mach. Learn. Res. 7 (December 2006), 2189-2213.
    This paper introduced a probabilistic view of binary data. Like other latent factor models, each observation can be viewed as a sample from multiple binary latent Bernoulli factors, essentially a mixture model. A variational inference is conducted in the paper. The weak part of the paper is that the comparison of the model with PLSA and LDA is not quite convincing.
  • Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. 2007. Binary Matrix Factorization with Applications. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining (ICDM ’07). IEEE Computer Society, Washington, DC, USA, 391-400.
    This paper indeed introduced a variant of Non-negative Matrix Factorization to binary data, meaning that a binary matrix will be always decomposed into two matrices bounded by 0 to 1. The proposed method is a modification of NMF. However, in a document clustering problem, the performance difference between proposed method and NMF is very small.
  • Miettinen, P.; Mielikainen, T.; Gionis, A.; Das, G.; Mannila, H.; , “The Discrete Basis Problem,Knowledge and Data Engineering, IEEE Transactions on , vol.20, no.10, pp.1348-1362, Oct. 2008.
    Miettinen, P.; , “Sparse Boolean Matrix Factorizations,” Data Mining (ICDM), 2010 IEEE 10th International Conference on , vol., no., pp.935-940, 13-17 Dec. 2010
    These two papers stated another view of factorization of binary data. Rather than directly using some SVD based or NMF based methods, these papers introduced a “cover” based discrete optimization method to the problem. However, through experiments, the performance advantages over traditional SVD or NMF methods are not very clear. Another drawback of their method is that some other existing methods are difficult to be incorporated with.
  • Andreas P. Streich, Mario Frank, David Basin, and Joachim M. Buhmann. 2009. Multi-assignment clustering for Boolean data. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09). ACM, New York, NY, USA, 969-976.
    This paper introduced a probabilistic view of the binary data. The observation is assumed to be generated either by “signal” or by “noise”, both are Bernoulli distributions. The switch variable is also sampled from the third Bernoulli distribution. This is essentially a simplified PLSA. The inference is done by deterministic annealing.
  • Ata Kaban, Ella Bingham, Factorisation and denoising of 0-1 data: A variational approach, Neurocomputing, Volume 71, Issues 10-12, Neurocomputing for Vision Research; Advances in Blind Signal Processing, June 2008, Pages 2291-2308, ISSN 0925-2312.
    This paper is somewhat similar “Noisy-OR” model and Logistic PCA as well. However, unlike Logistic PCA, the proposed model is a mixture model, meaning that a single observation is “generated” by multiple latent factors. The authors put a Beta prior over latent factors and the inference is done by Variational Inference.
    Ella Bingham, Ata Kaban, and Mikael Fortelius. 2009. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal. Appl. 12, 1 (January 2009), 55-78.
    This paper goes back to the assumption that each observation is sampled from a simple factor. The inference is done by EM.

In all, it seems that the performance advantages of specifically designed binary data models are small. However, the biggest advatange of these model is that they can give better interpretations sometimes. For computational models, NMF seems a good approximation. For probablistic models, a modified PLSA or LDA seems quite resonable.



Reviews on User Modeling in Topic Models

In this post, I would like to review several papers that wish to extend standard topic models with incorporating user information. The first paradigm or group of papers is introduced by M. Rosen-Zvi et al.

These three papers define a “Author-Topic” model, a simple extension of LDA. The generation process is as follows:

  1. For each document $latex d$:
    1. For each word position:
      1. Sample an author $latex x$ uniformly sampled from the group of authors \mathbf{a}_{d} for this document.
      2. Sample an topic assignment $latex z$ from per-author multinomial distribution over topics $latex \theta_{x}$.
      3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

The inference of the model is done by Gibbs Sampling. The biggest drawback of the model is that it loses the distribution over topics for documents. In “Learning Author-Topic Models from Text Corpora“, the authors proposed a heuristic solution to this problem: adding a fictitious author for each document. The second group of papers is from UMass.

They  proposed several models. The first one is “Author-Recipient-Topic” model, which is suitable for message data, like emails. The generation process is as follows:

  1. For each document $latex d$, we observe its author $latex a_{d}$ and a set of recipients $latex \mathbf{r}_{d}$:
    1. For each word position:
      1. Sample a recipient $latex x$ uniformly sampled from $latex \mathbf{r}_{d}$.
      2. Sample an topic assignment $latex z$ from author-recipient multinomial distribution over topics $latex \theta_{a_{d},x}$.
      3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

This model is further extended into “Role-Author-Recipient-Topic” model. The idea is that each author or recipient may play different roles in the exchange of messages. Therefore, it is better to explicitly model them. Three possible variants are introduced. The first variant is that for each word position, we first sample a role for author and for the sampled recipient as well. Once the roles are sampled, the topic assignments are sampled from role-role pair-determined multinomial distribution over topics. The second variant is that only one role is generated for the author of the message. However, for recipients, each one has a role. For each word position, a recipient with his corresponding role is firstly sampled and a topic assignment is sampled from author-role author-role pair multinomial distribution over topics. The third variant is that all recipients share a single role. The third model is “Author-Persona-Topic” model. The generation process is as follows:

  1. For each author $latex a$:
    1. Sample a multinomial distribution over persona $latex \eta_{a}$.
    2. For each persona $latex g$, sample a multinomial distribution over topics $latex \theta_{g}$.
  2. For each document $latex d$ with author $latex a_{d}$:
    1. Sample a persona $latex g_{d}$ from $latex \eta_{a_{d}}$.
    2. For each word position:
      1. Sample an topic assignment $latex z$ from $latex \theta_{g_{d}}$.
      2. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

All these models do not have a per-document distribution for topics.

The third group of papers is from Noriaki Kawamae. Both models introduced in these papers extended the ideas of “Author-Topic” model and “Author-Persona-Topic” model in particular.

The first model is “Author-Interest-Topic” model. It introduced a notion of “document-class”. The authors have a distribution over document-classes and for each document class, it has a distribution over topics. Here, we can think of document-class as “persona” in previous models. For each document, it firstly samples a document-class from per-author distibution over document classes. Then, by using this document-class, we can draw topics from this particular class. The difference between “Author-Interest-Topic” model and “Author-Persona-Topic” model is that the distribution over topics for each persona is under author level in “Author-Persona-Topic” but they are global variables in “Author-Interest Topic” model. The “Latent-Interest-Topic” model is much complicated than all previous models. It adds another layer of abstraction, author-classes. For each author, it has variable to indicate his author-class, which is drawn from a multinomial distribution. For each author-class, there is a multinomial distribution over topics. For each document, we first draw a document-class from its author’s per author-class distribution over document-classes. Then, the later generation process is same as “Author-Interest-Topic“. The key for “Author-Interest-Topic” and “Latent-Interest-Topic” models is that they are clustering models, in the sense that authors or documents are forced clustered into either author classes or document classes.

The last group of papers is from Jie Tang et al. All the proposed models are based on “Author-Topic” model.

They firstly proposed three variants of “Author-Conference-Topic” model. For each author, there is a multinomial distribution over topics. For each token in the document, an author is uniformly sampled and the topic assignment is sampled from per-author multinomial distribution over topics. The differences between three variants are how the conference stamp is generated. We omit the discussion here.


WSDM 2011 Paper Reading

In this post, I would like to review several papers from WSDM 2011, which worth to read, in my opinion.

  • Personalizing Web Search using Long Term Browsing History” by Nicolaas Matthijs and Filip Radlinski
    This paper investigated the possibility to incorporate the whole browsing history into the personalization framework such that the ranking performance can be significantly improved. A user is represented as a “user profile”, which consists of  terms extracted from visited web pages, the queries and some meta information (e.g., clicks, time-stamps). These features are weighted with several weighting schemes discussed, such as TF-IDF, modified BM25 and raw term frequencies. In addition, several re-ranking strategies are discussed as well. The experiments are rich. The authors conducted both off-line and on-line experiments against non-trivial baselines, including non-personalized version of Google ranking results and a previous study of personalized ranking. All results show significant improvement of ranking results by using long-term browsing history. This paper is interesting and also surprising in the sense that the approach is straightforward and simple while the result is strong. I’m really impressed that no one did this before.
  • Quality-Biased Ranking of Web Documents” By Michael Bendersky, W. Bruce Croft and Yanlei Diao
    This paper introduced a way to incorporate quality factors into ranking functions. It is a little bit unexpected that quality factors are never considered in most ranking models. The proposed method is straightforward, based on Markov Random Field IR model, which is claimed one of the state-of-the-art IR models. It is surprisingly easy to embed these new features into the ranking model. The experiments demonstrated a significant improvement over baselines and also the one with PageRank. Overall, this is an paper worth to read.
  • Mining Named Entities with Temporally Correlated Bursts from Multilingual Web News Streams” by Alexander Kotov et al.
    This paper provides a novel approach to a heated discussed research topic, mining bursts events or name entities in this paper, from correlated text streams. Many previous methods are based on topic models. Here, the authors proposed a method based on Markov Modulated Poisson Process (MMPP). Their approach is a two-stage approach where the first stage is to fit the MMPP model and use the fitted model to align correlated bursts by dynamic programming. The complexity of the approach is much simpler than topic models. Although it is overall interesting, the paper lacks of certain comparison with other similar methods, yielding the results that we do not know how well this approach is in reality.
  • Everyone’s an Influencer: Quantifying Influence on Twitter” by Eytan Bakshy et al.
    This paper analyzed how users influence their followers on Twitter for a particular type of messages, the messages containing URLs. More specifically, they focused on only “bit.ly” URLs. The authors conducted three finds of experiments. First, by using several simple features, they found that the past influence provides the most informative  features as well as the number of followers. The second set of experiments was conducted on a small dataset where the authors asked Amazon Mechanical Turks to classify these messages into topical categories (including spams). Unfortunately, they found the predictive performance decreased as content-based features added. They claimed that the content features are noise to detect the influential role of individual post. The third set of experiments is “targeting strategies”, namely how a hypothetical marketer might optimize the diffusion of information by systematically targeting certain classes of users. A simple cost function is proposed and the authors demonstrated how different assumptions may lead to different costs. Overall, I feel this part a little bit shallow and premature to be included in the paper. In this paper, “influencer” is not pre-defined, but rather as a function of several features.
  • Identifying Topical Authorities in Microblogs” by Aditya Pal and Scott Counts
    This paper has the similar goal as the previous one. However, they proposed a totally different approach. On high level, they utilized a clustering algorithm (Gaussian Mixture Model) with a set of wide range of features. However, the authors only focused on the clusters which have high average values of three features (Topical Signal, Retweet Impact and Mention Impact), discarding all other clusters. In addition, they proposed a ranking mechanism using the CDF of Gaussian function for each feature and combined all features using multiplication. In the experiments, they first compared their methods with graph-based ranking methods (e.g., PageRank). They found that empirically, their method can discover users who are more interesting and more authoritative. In their later experiments, they focused on comparing the ratings (from human judges) of different methods. I feel that these comparisons less intuitive. Overall, the paper does not really demonstrate something new or totally surprising.
  • #TwitterSearch: A Comparison of Microblog Search and Web Search” by Jaime Teevan, Daniel Ramage and Meredith Ringel Morris
    This paper in general explored a question that whether information seeking behavior on Twitter is different from Web search. They first initialized their study by a user-study, conducted within Microsoft. Although it might be interesting to know how people use Twitter search, the small scale of user-study and the highly biased sample prevent us to generalize the conclusion made in the paper. In later study, they focused on a large query log from Bing Toolbar. Several interesting things: 1) They found that Twitter search more focused on celebrities and temporal events. 2) Users do issue the same queries to both Web search and Twitter search, implying some underlying information needs associated with. 3) Users do issue the same queries to Twitter search, indicating re-finding needs. They paper suggested several directions that future search tools can improve upon 1) Enhancing Temporal Queries 2) Enriching People Search 3) Leveraging Hashtags 4) Employing User History. 5) Providing Query Disambiguation.
  • Using Graded-Relevance Metrics for Evaluating Community QA Answer Selection” by Tetsuya Sakai et al.
    This paper introduced a graded-relevance metrics for QA answer retrieval task. They hired a number of human judges to evaluate the relevance of answers and adopted a number of graded-relevance metrics, like NDCG, to answer retrieval. Two major conclusions made in the paper: 1) they can detect many substantial differences between systems that would have been overlooked by Best-Answer based evaluation. 2) they can better identify hard questions compared to Best-Answer based evaluation. Although these two points are novel to CQA community, it is not totally surprising that graded-relevance is better than binary relevance, if we consider the experiences in IR.
  • Recommender Systems with Social Regularization” by Hao Ma et al.
    The idea of the paper is pretty straightforward. The social connections have the influence on the results of recommendation. Two models are proposed to the basic matrix factorization framework. The first model assumes that the user’s tasts should be simlar to the average of his or her friends. The second model assumes the regularization should be put on individual user pairs. Both models can utilize user-defined similarity functions where in the paper, the authors showed the results on consine similarity and Pearson Correlation Coefficient. The paper is easy to understand.
  • Low-order Tensor Decompositions for Social Tagging Recommendation” by Yuanzhe Cai et al.
    This paper is interesting. It provides an improvement to the popular tensor factorization method to the problem social tagging recommendation. The basic idea is that the 3rd decomposition of a tensor can be improved by zero-order, 1st order and 2nd order. Indeed, these lower orders can be seen as an average to the corresponding dimensions. For instance, zero-order is essentially the average over all elements. Thus, these order statistics are indeed adding biases into the model, which is a popular technique in matrix factorization based recommendation systems. The paper also discussed how to handle missing value problem. Overall, this paper is worth to read.
  • eBay: An E-Commerce Marketplace as a Complex Network” by Zeqian Shen and Neel Sundaresan
    This paper is a good reference to understand eBay as a complex network. There’s nothing strikingly new here but it confirms a lot of things with other types of media, such as Web and Wikipedia. Two interesting facts: 1) in terms of Bow-Tie structure, the Strongly Connected Component (SCC) part and IN part on eBay is very small. 2) in terms of popular triad structures, the most significant motif is “two sellers sell to the same buyer and they also sell products to each other”. In addition, they found that sellers have more interactions than buyers. The paper is worth to skip.
  • Let Web Spammers Expose Themselves” by Zhicong et al.
    This paper is interesting. It provides another view of identifying spammers. The basic idea is to mine web forums of SEO information since spammers may seek link exchange in those forums. They formulated the problem into an optimization framework with semi-supervised learning techniques. They demonstrated substantial improvement of performance.
  • Improving Social Bookmark Search Using Personalized Latent Variable Language Models” by Morgan Harvey, Ian Ruthven and Mark J. Carman
    This paper has at least two interesting point. First, the two models proposed in the paper is essentially like Probabilistic Tensor Factorization. Second, the superior of the second model demonstrated that it might be a better choice to “generate” response variables from latent factors, rather than the other way around. Thus, it would be nice to see the third model that the user is also the result of latent factors, a fully resemble to Tensor Factorization. The drawbacks of the paper is evaluation. It does not use any standard datasets.
  • Topical Semantics of Twitter Links” by Michael J. Welch et al.
    There are several interesting points in the paper, though it is straightforward overall and sometimes it seems lacking of rigorous experiments to support some ideas. First, the authors demonstrated that the PageRank for follow relationships is significantly different from Retweet induced graph. In fact, the authors would better to provide Kendall’s tau or something to demonstrate this. They showed that there is a drop in both rankings although the paper did not go deep in this point. I personally would think that this may due to the fact that the core of the Twitter network is extremely dense, compared to the other parts. The second part of the paper tried to demonstrate that retweet induced links carry much stronger semantic meanings, although the results are not very convincing.
  • A Framework for Quantitative Analysis of Cascades on Networks” by Rumi Ghosh and Kristina Lerman
    This paper is interesting and worth to be studied more thoroughly. First, the authors proposed a “cascade generating function”, which characterizes the cumulative effect on current node, gathering from its connections.  This simple function can be used to calculate a number of cascading properties, such as size, diameter, number of paths and average length. Then, the authors introduced a computational framework for the generating function. More specifically, they construct a cascade graph from cascades and then generate two matrices, contagion and length matrices. These two matrices encoded all information of all cascades. In addition, we can reconstruct or approximate the original cascades from the matrices. Equipped with these tools, the authors examined the cascades from Digg and showed some interesting results.
  • Linking Online News and Social Media” by Manos Tsagkias, Maarten de Rijke and Wouter Weerkamp
    First, the idea of the paper is interesting while I think the experiments are a little bit confusing. The authors tried to link social media to news articles. The original research question posed by the authors is: given a news article, find social media utterances that implicitly reference it. However, in the end, the task becomes to retrieve blog posts by using news articles as queries and tried to explore what kind of query representation is better. In the end, it seems that the query model based on article itself outperforms all others.
  • Predicting Future Reviews: Sentiment Analysis Models for Collaborative Filtering” by Noriaki Kawamae
    This paper introduced a fairly complicated extension of Topic Models to collaborative filtering and sentiment analysis. It introduced a number of new latent variables to the model to explain words, items and ratings. One interesting point is that this model also use the formalism that response variables are fully explained by latent factors. To be honest, as topic models, the evaluation is not totally convincing.

Reviews on Probabilistic Models for User Profiles

In this post, I would like to review some probabilistic models for user profiling. More specifically, I’m looking at the models that taking users’ preferences into account and try to predict certain quantities based on these preferences, which is a normal scenario for collaborative filtering.

  • Latent semantic models for collaborative filtering” by Thomas Hofmann, ACM Transactions on Information Systems, 2004
    The proposed model is based on pLSA. In order to incorporate ratings, the authors propose the following decomposition scheme:
    \( p(v|u,y) = \sum_{z}p(z|u)p(v|z,y) \) where \( p(v|z,y) \) follows Gaussian distribution. The paper also introduced practical techniques to normalize user ratings. The model is learned through (tempered) EM.
  • Modeling User Rating Profiles For Collaborative Filtering” by Benjamin Marlin, NIPS 2003
    The model proposed in the paper is essentially  LDA in the context of collaborative filtering. The original document-term matrix was replaced by a user-item (user-rating) matrix. Unlike this pLSA model for collaborative filtering, this model introduced the decomposition scheme as:\( p(r|u)=\sum_{z} p(r|z)p(z|u) \)where no “dummy” variable \( y \) gets involved. The model is learned through variational inference.
  • Flexible Mixture Model for Collaborative Filtering” by Luo Si and Rong Jin, ICML 2003
    The model proposed in the paper is an extension of two-side clustering model of pLSA. It assumes that users are belong to multiple clusters and items are also belong to multiple clusters. The rating of a particular item is based on the user clusters and item clusters. Therefore, \( p(x,y,r) = \sum_{z_{x}} \sum_{z_{y}} p(z_{x})p(y_{x})p(x|z_{x})p(y|z_{y})p(r|z_{x},z_{y}) \) where \( z_{x} \) are latent factors for users and \( z_{y}\) are latent factors for items. All distributions here are multinomial distributions. The model is learned through EM.
    A full Bayesian treatment of the model is introduced in “Latent Grouping Models for User Preference Prediction” by Eerika Savia, Kai Puolamaki and Samuel Kaski in Machine Learning 2009, which is learned through Gibbs Sampling.
  • The Multiple Multiplicative Factor Model For Collaborative Filtering” by Benjamin Marlin and Richard S. Zemel, ICML 2004
    Rather than using a same set of latent factors to “explain” all ratings, the Multiple Multiplicative Factor Model (MMF) tries to use different latent factors to “explain” different ratings. Therefore, for each user, the model has a \( K\)-dimensional binary vector \( Z \) where each element \( z_{k} \) represents whether \( k \)-th factor is “active” or not. For rating \( X \), the authors introduced to use softmax function to map arbitrary real values to simplex. The model is learned through variational inference.
  • Efficient Bayesian Hierarchical User Modeling For Recommendation Systems” by Yi Zhang and Jonanthan Koren, SIGIR 2007
    The model introduced in this paper is similar to the pLSA version of user profiles. The rating \( r \) of a item \( y \) by user \( u \) is decomposed as:\( p(r|y,u)=p(u)p(r|u,y) \) where \( p(u) \) is a Gaussian distribution and \( p(r|u,y) \) is modeled through a Generalized Linear Model \( r=u^{T}y+\epsilon \), essentially another Gaussian distribution in the paper. The novel part of the model is that \( y \) is the document representation of an item. Therefore, the authors assume that the rating is weighted sum of terms of documents where the weights are user specific. The model is learned through EM.

A study about several variants of pLSA on collaborative filtering is done by Rong Jin, Luo Si and Chengxiang Zhai.

A study about how to normalize ratings is done by Rong Jin and Luo Si:


Passing Parameters and Arguments to Mapper and Reducer in Hadoop 2

In Hadoop, it is sometimes difficult to pass arguments to mappers and reducers. If the number of arguments is huge (e.g., big arrays), DistributedCache might be a good choice. However, here, we’re discussing small arguments, usually a hand of configuration parameters.

In fact, the way to configure these parameters is simple. When you initialize “JobConf” object to launch a mapreduce job, you can set the parameter by using “set” method like:

1
2
JobConf job = (JobConf)getConf();
job.set("NumberOfDocuments", args[0]);

Here, “NumberOfDocuments” is the name of parameter and its value is read from “args[0]”, a command line argument. Once you set this arguments, you can retrieve its value in reducer or mapper as follows:

1
2
3
4
private static Long N;
public void configure(JobConf job) {
     N = Long.parseLong(job.get("NumberOfDocuments"));
}

Note, the tricky part is that you cannot set parameters like this:

1
2
Configuration con = new Configuration();
con.set("NumberOfDocuments", args[0]);

and hope that all mappers or reducers can retrieve this parameter. This will fail in running.


Notes on Calculating Log Sum of Exponentials 1

Sometimes, you want to calculate the logarithm of a sum of exponentials, like:
\[
\begin{align}
\log \sum_{i=1}^{n} \exp (x_{i})
\end{align}
\]The problem to directly calculate this equation is that if one \( x_{i} \) gets sufficient large, it will overflow when you calculate the exponential of the number. We can get more arithmetic stability with a little algebra. First, find the maximum value \( m \) in \( \mathbf{x}\):
\[
\begin{align}
m = \underset{x^{\prime}}{\arg\max} \, x_{i}
\end{align}
\]Then, we have:
\[
\begin{align}
\log \sum_{i=1}^{n} \exp (x_{i}) &= \log \Bigr( \sum_{i=1}^{n} \frac{\exp (m)}{\exp (m)} \exp(x_{i}) \Bigl) \\
&= \log \Bigr( \exp (m) \sum_{i=1}^{n} \frac{1}{\exp (m)} \exp(x_{i}) \Bigl) \\
&= \log \exp( m) + \log \Bigr( \sum_{i=1}^{n} \exp(-m) \exp(x_{i}) \Bigl) \\
&= m + \log \Bigr( \sum_{i=1} \exp( x_{i} -m ) \Bigl)
\end{align}
\]So the largest value passed to the exponential function is 0. If there are really tiny values after subtraction, they’ll become zero and drop out, as they should with limited precision arithmetic.

By using the technique introduced here, we can also calculate:

\[
\begin{align}
\log \sum_{i=1}^{n} y_{i} & = \log \Bigr( \sum_{i=1}^{n} \exp \log (y_{i}) \Bigl) \\
&= \log m + \log \Bigr( \sum_{i=1}^{n} \exp ( \log y_{i} – \log m) \Bigl)
\end{align}
\]where \( m = \underset{y^{\prime}}{\arg\max} \, y_{i} \).

Reference

[1] Log Sum of Exponentials
[2] log-sum-exp trick


SIGIR 2010 Paper Reading

In this post, I would like to talk about several interesting papers from SIGIR 2010. Note, this only reflects my view of these scientific work and does not necessarily correct and thorough.

  • On Statistical Analysis and Optimization of Information Retrieval Effectiveness Metrics
    This paper is more theoretical rather than practical. The main contribution is that the authors argue that the optimal ranking problem should be factorized into two distinct yet interrelated stages: the relevance prediction stage and ranking decision stage. The paper shows that a number of IR metrics (e.g., Average Precision, DCG, Reciprocal Rank) can be decomposed into the two stages mentioned above. Therefore, the overall strategy is to directly optimize the decomposed metrics. The authors show the improved performance over simple language models. However, the paper does not compare to Learning to Rank techniques where the metrics are also optimized. In all, this is an interesting paper for whose really work in Ad-Hoc retrieval fields.
  • Evaluating and Predicting Answer Quality in Community QA
    This paper is straightforward. The authors wanted to predict the best answers (in their words, “answer quality”) in the Community QA sites.  They firstly used a number of subjective features obtained from Amazon Technical Turks and found it difficulty to do so. Then, they used a bunch of automatically extracted features (most are meta-information features) and show the improved performance. The work is simple and indeed related to my work in SIGIR 2009.  They still do not answer a question that whether a so-called “best answer” really a true “best” answer among all others to the corresponding questions. Moreover, classification approaches are not compared to retrieval-based methods in this paper.
  • Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cQA Services
    This is another paper in QA. This work is one extension to many previous work. For example, in “question detection”, the authors proposed a one-class SVM method to obtain the training dataset. In addition, the authors proposed a graph-based method to segment questions into multiple sub-questions. Overall, the authors show that their method can give a significant boost to question matching and retrieval, compared to traditional Bag-of-Word methods. Additionally, the authors show that the Sequential Patterns Mining and Syntactical Patterns Mining can also improve the performance of question detection. One thing is not clear is that which retrieval model the authors used in the paper.
  • Multi-Style Language Model for Web Scale Information Retrieval
    This paper is interesting. It introduces two interesting points. First, it shows the significant gap between query language model and document model where the paper also demonstrated that the anchor and title language model are more near the queries. The second point made by this paper is how to estimate a language model by considering an open vocabulary, namely, an infinite vocabulary. The problem for an open vocabulary language model is how to assign probability mass to unseen terms and how to adjust the mass to seen terms. This paper show one simple method with closed form expressions. This “smoothed” language model is also embedded with a multi-component language model where the model utilizes multiple fields for a document.
  • Mining the Blogosphere for Top News Stories Identification
    This paper is straightforward and interesting. The problem addressed in the paper is to rank news stories according to blogosphere in a given day. Here, the authors treated the “date” as the query. The overall framework falls into language model framework. In order to know how likely all blog posts relevant to the query date, the authors utilize a clustering method to group blog posts into topics and estimate the query language model from these clusters. News headline language model is estimated by a standard Dirichlet smoothed language model. Then, the distance between language model is calculated through KL-divergence. The authors proposed two heuristics to identify the importance of news stories. In all, the paper is well-written and well-organized. However, it is not clear why the authors do not use Multiple-Document representation for a blog, compared to a clustering algorithm. In addition, there are several important parameters are tuned manually, for example, the spread of a news story. This prevent the system used in real applications.
  • Serendipitous Recommendations via Innovators
    This paper reveals one interesting yet not heavily explored area in recommendation systems, the “surprise” of recommendations. The author argues that a recommender which achieved high accuracy may not help users a lot since most recommended items are popular items that can be discovered by users anyway. If a recommender wants to show something really interesting, it should provide provide some items that may not be found by users without any help. Therefore, the author proposed to use “time” as a measure to identify the success of recommendation.  However, the algorithm proposed in the paper is not very intuitive. Anyway, I think it’s still an interesting paper and worth to read.
  • Temporal Diversity in Recommender Systems
    This paper is simple and easy to follow. The main idea of the paper is to show that the temporal dynamics of recommender systems, especially in Netflix. One “obvious” observation of the paper is that users lose “patient” when they see same recommendations over time. Therefore, the authors claim that diversity should be taken into account by recommenders.

Notes on Forest Fire Model 1

Basic Model

The Forest Fire model was firstly proposed in [1] and [2]. It shows its applications in [3] and [4]. The basic setting of the model is as follows. The model has two parameters, a forward burning probability p and a backward burning ratio r. Consider a node v joining the network at time t > 1, and let G_{t} be the graph constructed thus far. G_{1} will consist of just a single node. Node v forms outlinks to nodes in G_{t} according to the following process:

(a) v first chooses an ambassador node w uniformly at random and forms a link to w.

(b) Generate two random numbers, x and y, that are geometrically distributed with means \frac{p}{1-p} and \frac{rp}{1- rp}, respectively. Node v selects x outlinks and y in-links of w incident to nodes that were not yet visited. Let w_{1}, w_{2},\cdots, w_{x+y} denotes the other ends of these selected links. If not enough in or outlinks are available, v selects as many as it can.

(c) v forms outlinks to w_{1}, w_{2} , \cdots, w_{x+y} and then applies step (b) recursively to each of w_{1}, w_{2} , \cdots, w_{x+y}. As the process continues, nodes cannot be visited a second time, preventing the construction from cycling.

Thus, the burning of links in the Forest Fire model begins at w, spreads to w_{1}, w_{2} , \cdots, w_{x+y} and proceeds recursively until it dies out. The original authors claim that this model can preserve the following three properties of a graph as growing:

  • Heavy-tailed in-degree and out-degree distribution
  • Densification power law
  • Shrinking diameter

Application to Graph Sampling

In [3], the authors described how to apply Forest Fire model to graph sampling as follows. We first choose node v uniformly at random. We then generate a random number x that is geometrically distributed with mean \frac{p_{f}}{1-p_{f}}. Node v selects x out-links incident to nodes that were not yet visited. Let w_{1}, w_{2}, \cdots, w_{x} denote the other ends of these selected links. We then apply this step recursively to each of w_{1}, w_{2}, \cdots, w_{x} until enough nodes have been burned. As the process continues, nodes cannot be visited a second time, preventing the construction from cycling. If the fire dies, then we restart it, i.e. select new node v uniformly at random. We call the parameter p_{f} the forward burning probability. In [4], the authors start from a set of “seed” users, which are chosen in three ways: 1) totally random 2) geo-location random 3) randomly according to user activities.

Remarks

Forest Fire model is indeed an ad-hoc model. As stated in [2], it lacks of rigorous analysis of the model itself.

Reference

[1] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, KDD ’05, pages 177–187, New York, NY, USA, 2005. ACM.
[2] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1, March 2007.
[3] J. Leskovec and C. Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 631–636, New York, NY, USA, 2006. ACM.
[4] M. De Choudhury, Y.-R. Lin, H. Sundaram, K. S. Candan, L. Xie, and A. Kelliher. How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media? In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, May 2010.


CIKM 2010 Paper Reading

In this post, I would like to talk about several interesting papers from CIKM 2010.  Note, this is only a personal view of the research conducted in those papers, which might be incorrect and biased.

  1. Web Search Solved? All Result Rankings the Same?
    This paper is a kind of “meta” research. It compares search results from three major search engines, from a sampled 1000 queries in late 2008. The main takeaway is the performance of  frequent queries, especially navigational queries are well served on all three engines. In addition, there is no such an engine that significantly outperform on all other two engines. In fact, all three engines can beat others in certain queries. One interesting point made in the paper is that, based on the analysis, the authors gave an order to prioritize investment on which type of queries search engines should put “more money”, brining this work more practical sense. Two major problems are a) only 1000 queries (although the authors argue this is “big”) and b) dataset in 2008.
  2. SHRINK: A Structural Clustering Algorithm for Detecting Hierarchical Communities in Networks
    The idea of this paper is simple and clear. It uses “density”-based method to group nodes and utilizes Modularity to measure the “goodness” of the grouping. The approach is essentially an extension of previously two-phase greedy approixmation of Modularity clustering, except that the inner loop has changed to a “density” method. In the experiments, the authors showed that this approach can overcome the “resolution limitation” of Modularity clustering. Overall, the paper is interesting but we really don’t know that how this can be applied to real large graphs. Right now, all experiments seem too “small”.
  3. What can Quantum Theory bring to Information Retrieval?
    Wow, Quantum Theory for IR! I came across Rijsbergen’s book on the same topic before but this paper is more concrete.  After reading the paper, I would say that most of the techniques they considered in the paper is indeed to “hide” a series of well-known IR methods under a Quantum-based framework. Although it may provide some theoretical advantages to do so, the real benefit is still questionable. In fact, shown in their experimental part, the proposed method cannot outperform a standard BM25 method. Additionally, there are places that authors use heuristics, for example, how to construct the representation of documents and queries, which are not really justified. In a word, the work is interesting but really does not show the justification of the new framework.
  4. PTM: Probabilistic Topic Mapping Model for Mining Parallel Document Collections
    The method proposed is simple and straightforward. However, there is one “implicit” assumption in the paper. Every word in target collection should be mapped firstly into a topic in source collection. This work is clearly related to translation topic models and collection topic models.
  5. Mining Topic-level Influence in Heterogeneous Networks
    The method proposed in the paper consists of two steps. The first step is a “linked” topic model, discovering topics from a linked network. The second step is somewhat “strange” in the sense that it is very like PageRank or random walk on the topic-level. If that’s the case, the novelty of the work will be re-judged. However, the authors do not provide any clue on it.  The experiments are conducted on small-size datasets. It is really interesting to see the comparison with a variety of topical PageRank algorithms.
  6. Collaborative Dual-PLSA: Mining Distinction and Commonality across Multiple Domains for Text Classification
    This work is similar to (4) yet more general. The latent variables to generate documents and words are decomposed. Therefore, the latent variable associated to documents can be easily considered as “labels” in traditional classification settings. That’s where they showed the power of their method. It is a little surprising that they only conducted experiments on 20-Newsgroup dataset. Similar to (4), one drawback for the model is  that all domains share the same number and same set of latent variables, enforcing the topics matched across domains. The authors are aware of this limitation and discussed extensions in the paper.
  7. Network Growth and the Spectral Evolution Model
    The result presented in this paper is somewhat unexpected. The authors show in several cases, the evolution of networks can be captured by an eigenvalue evolution model where most of eigenvectors remain the same. The authors also introduced an approach to automatically learn the link/prediction function for each eigenvalues. The models proposed also links to many existing methods. This paper and related techniques require more time to be read.
  8. Mining Interesting Link Formation Rules in Social Networks
    This paper goes beyond studying a single link formation pattern but mining a group of formation patterns. The authors extended the state-of-the-art sub graph pattern mining tool — gSpan to the link pattern scenario. One interesting step forward is to investigate the correctness of these patterns. Currently, through the experiments, there is no justification of the patterns found by their tool. In addition, it is not clear that how these found patterns really characterize the evolution of social networks. Nevertheless, this paper provides a method to gather these patterns.
  9. Learning a User-Thread Alignment Manifold for Thread Recommendation in Online Forum
    The method proposed in the paper is fairly complicated. First, the problem considered in the paper is a ranking problem, to rank threads to users according to their interests. The framework consists of three factors, user-user interactions; thread-thread similarities and thread-user alignment. User-user relationships are captured through a weighted graph. For thread-thread relationship, a low-rank representation (embedding) of threads is found, induced by a local thread matrix. The mapping between users and threads are formulated into another manifold learning problem, induced by a user-thread matrix. It is not clear that whether the intuition can be captured by simpler models. But the authors indeed show that the performance is good, compared to some simple methods.
  10. Latent Interest-Topic Model: Finding the Causal Relationships behind Dyadic Data
    Indeed, this is an extend work of a very similar work, published in SIGIR 2010. The model adds two layer between authors and documents. A document-class layer encodes the the distributions over topics for documents. Each document is only belong to one document-class. The choice of document-class depends on another layer, author-class, where it controls that how a document-class is chosen for a particular author. In the end, both authors and documents are naturally associated with topics. It seems that the model can be efficiently estimated through standard Gibbs sampling. However, the experimental results from ACM papers do not really correspond to authors’ expertise, in my opinion.