Research in General

Notes on Calculating Log Sum of Exponentials 1

This entry was posted in Research in General on January 7, 2011 by Liangjie Hong

Sometimes, you want to calculate the logarithm of a sum of exponentials, like:
\[
\begin{align}
\log \sum_{i=1}^{n} \exp (x_{i})
\end{align}
\]The problem to directly calculate this equation is that if one \( x_{i} \) gets sufficient large, it will overflow when you calculate the exponential of the number. We can get more arithmetic stability with a little algebra. First, find the maximum value \( m \) in \( \mathbf{x}\):
\[
\begin{align}
m = \underset{x^{\prime}}{\arg\max} \, x_{i}
\end{align}
\]Then, we have:
\[
\begin{align}
\log \sum_{i=1}^{n} \exp (x_{i}) &= \log \Bigr( \sum_{i=1}^{n} \frac{\exp (m)}{\exp (m)} \exp(x_{i}) \Bigl) \\
&= \log \Bigr( \exp (m) \sum_{i=1}^{n} \frac{1}{\exp (m)} \exp(x_{i}) \Bigl) \\
&= \log \exp( m) + \log \Bigr( \sum_{i=1}^{n} \exp(-m) \exp(x_{i}) \Bigl) \\
&= m + \log \Bigr( \sum_{i=1} \exp( x_{i} -m ) \Bigl)
\end{align}
\]So the largest value passed to the exponential function is 0. If there are really tiny values after subtraction, they’ll become zero and drop out, as they should with limited precision arithmetic.

By using the technique introduced here, we can also calculate:

\[
\begin{align}
\log \sum_{i=1}^{n} y_{i} & = \log \Bigr( \sum_{i=1}^{n} \exp \log (y_{i}) \Bigl) \\
&= \log m + \log \Bigr( \sum_{i=1}^{n} \exp ( \log y_{i} – \log m) \Bigl)
\end{align}
\]where \( m = \underset{y^{\prime}}{\arg\max} \, y_{i} \).

Reference

[1] Log Sum of Exponentials
[2] log-sum-exp trick

SIGIR 2010 Paper Reading

This entry was posted in Conference Papers IR Research in General on November 30, 2010 by Liangjie Hong

In this post, I would like to talk about several interesting papers from SIGIR 2010. Note, this only reflects my view of these scientific work and does not necessarily correct and thorough.

“On Statistical Analysis and Optimization of Information Retrieval Effectiveness Metrics”
This paper is more theoretical rather than practical. The main contribution is that the authors argue that the optimal ranking problem should be factorized into two distinct yet interrelated stages: the relevance prediction stage and ranking decision stage. The paper shows that a number of IR metrics (e.g., Average Precision, DCG, Reciprocal Rank) can be decomposed into the two stages mentioned above. Therefore, the overall strategy is to directly optimize the decomposed metrics. The authors show the improved performance over simple language models. However, the paper does not compare to Learning to Rank techniques where the metrics are also optimized. In all, this is an interesting paper for whose really work in Ad-Hoc retrieval fields.
“Evaluating and Predicting Answer Quality in Community QA”
This paper is straightforward. The authors wanted to predict the best answers (in their words, “answer quality”) in the Community QA sites. They firstly used a number of subjective features obtained from Amazon Technical Turks and found it difficulty to do so. Then, they used a bunch of automatically extracted features (most are meta-information features) and show the improved performance. The work is simple and indeed related to my work in SIGIR 2009. They still do not answer a question that whether a so-called “best answer” really a true “best” answer among all others to the corresponding questions. Moreover, classification approaches are not compared to retrieval-based methods in this paper.
“Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cQA Services”
This is another paper in QA. This work is one extension to many previous work. For example, in “question detection”, the authors proposed a one-class SVM method to obtain the training dataset. In addition, the authors proposed a graph-based method to segment questions into multiple sub-questions. Overall, the authors show that their method can give a significant boost to question matching and retrieval, compared to traditional Bag-of-Word methods. Additionally, the authors show that the Sequential Patterns Mining and Syntactical Patterns Mining can also improve the performance of question detection. One thing is not clear is that which retrieval model the authors used in the paper.
“Multi-Style Language Model for Web Scale Information Retrieval”
This paper is interesting. It introduces two interesting points. First, it shows the significant gap between query language model and document model where the paper also demonstrated that the anchor and title language model are more near the queries. The second point made by this paper is how to estimate a language model by considering an open vocabulary, namely, an infinite vocabulary. The problem for an open vocabulary language model is how to assign probability mass to unseen terms and how to adjust the mass to seen terms. This paper show one simple method with closed form expressions. This “smoothed” language model is also embedded with a multi-component language model where the model utilizes multiple fields for a document.
“Mining the Blogosphere for Top News Stories Identification”
This paper is straightforward and interesting. The problem addressed in the paper is to rank news stories according to blogosphere in a given day. Here, the authors treated the “date” as the query. The overall framework falls into language model framework. In order to know how likely all blog posts relevant to the query date, the authors utilize a clustering method to group blog posts into topics and estimate the query language model from these clusters. News headline language model is estimated by a standard Dirichlet smoothed language model. Then, the distance between language model is calculated through KL-divergence. The authors proposed two heuristics to identify the importance of news stories. In all, the paper is well-written and well-organized. However, it is not clear why the authors do not use Multiple-Document representation for a blog, compared to a clustering algorithm. In addition, there are several important parameters are tuned manually, for example, the spread of a news story. This prevent the system used in real applications.
“Serendipitous Recommendations via Innovators”
This paper reveals one interesting yet not heavily explored area in recommendation systems, the “surprise” of recommendations. The author argues that a recommender which achieved high accuracy may not help users a lot since most recommended items are popular items that can be discovered by users anyway. If a recommender wants to show something really interesting, it should provide provide some items that may not be found by users without any help. Therefore, the author proposed to use “time” as a measure to identify the success of recommendation. However, the algorithm proposed in the paper is not very intuitive. Anyway, I think it’s still an interesting paper and worth to read.
“Temporal Diversity in Recommender Systems”
This paper is simple and easy to follow. The main idea of the paper is to show that the temporal dynamics of recommender systems, especially in Netflix. One “obvious” observation of the paper is that users lose “patient” when they see same recommendations over time. Therefore, the authors claim that diversity should be taken into account by recommenders.

CIKM 2010 Paper Reading

This entry was posted in Conference Papers Research in General on November 9, 2010 by Liangjie Hong

In this post, I would like to talk about several interesting papers from CIKM 2010. Note, this is only a personal view of the research conducted in those papers, which might be incorrect and biased.

“Web Search Solved? All Result Rankings the Same?”
This paper is a kind of “meta” research. It compares search results from three major search engines, from a sampled 1000 queries in late 2008. The main takeaway is the performance of frequent queries, especially navigational queries are well served on all three engines. In addition, there is no such an engine that significantly outperform on all other two engines. In fact, all three engines can beat others in certain queries. One interesting point made in the paper is that, based on the analysis, the authors gave an order to prioritize investment on which type of queries search engines should put “more money”, brining this work more practical sense. Two major problems are a) only 1000 queries (although the authors argue this is “big”) and b) dataset in 2008.
“SHRINK: A Structural Clustering Algorithm for Detecting Hierarchical Communities in Networks”
The idea of this paper is simple and clear. It uses “density”-based method to group nodes and utilizes Modularity to measure the “goodness” of the grouping. The approach is essentially an extension of previously two-phase greedy approixmation of Modularity clustering, except that the inner loop has changed to a “density” method. In the experiments, the authors showed that this approach can overcome the “resolution limitation” of Modularity clustering. Overall, the paper is interesting but we really don’t know that how this can be applied to real large graphs. Right now, all experiments seem too “small”.
“What can Quantum Theory bring to Information Retrieval?”
Wow, Quantum Theory for IR! I came across Rijsbergen’s book on the same topic before but this paper is more concrete. After reading the paper, I would say that most of the techniques they considered in the paper is indeed to “hide” a series of well-known IR methods under a Quantum-based framework. Although it may provide some theoretical advantages to do so, the real benefit is still questionable. In fact, shown in their experimental part, the proposed method cannot outperform a standard BM25 method. Additionally, there are places that authors use heuristics, for example, how to construct the representation of documents and queries, which are not really justified. In a word, the work is interesting but really does not show the justification of the new framework.
“PTM: Probabilistic Topic Mapping Model for Mining Parallel Document Collections”
The method proposed is simple and straightforward. However, there is one “implicit” assumption in the paper. Every word in target collection should be mapped firstly into a topic in source collection. This work is clearly related to translation topic models and collection topic models.
“Mining Topic-level Influence in Heterogeneous Networks”
The method proposed in the paper consists of two steps. The first step is a “linked” topic model, discovering topics from a linked network. The second step is somewhat “strange” in the sense that it is very like PageRank or random walk on the topic-level. If that’s the case, the novelty of the work will be re-judged. However, the authors do not provide any clue on it. The experiments are conducted on small-size datasets. It is really interesting to see the comparison with a variety of topical PageRank algorithms.
“Collaborative Dual-PLSA: Mining Distinction and Commonality across Multiple Domains for Text Classification”
This work is similar to (4) yet more general. The latent variables to generate documents and words are decomposed. Therefore, the latent variable associated to documents can be easily considered as “labels” in traditional classification settings. That’s where they showed the power of their method. It is a little surprising that they only conducted experiments on 20-Newsgroup dataset. Similar to (4), one drawback for the model is that all domains share the same number and same set of latent variables, enforcing the topics matched across domains. The authors are aware of this limitation and discussed extensions in the paper.
“Network Growth and the Spectral Evolution Model”
The result presented in this paper is somewhat unexpected. The authors show in several cases, the evolution of networks can be captured by an eigenvalue evolution model where most of eigenvectors remain the same. The authors also introduced an approach to automatically learn the link/prediction function for each eigenvalues. The models proposed also links to many existing methods. This paper and related techniques require more time to be read.
“Mining Interesting Link Formation Rules in Social Networks”
This paper goes beyond studying a single link formation pattern but mining a group of formation patterns. The authors extended the state-of-the-art sub graph pattern mining tool — gSpan to the link pattern scenario. One interesting step forward is to investigate the correctness of these patterns. Currently, through the experiments, there is no justification of the patterns found by their tool. In addition, it is not clear that how these found patterns really characterize the evolution of social networks. Nevertheless, this paper provides a method to gather these patterns.
“Learning a User-Thread Alignment Manifold for Thread Recommendation in Online Forum”
The method proposed in the paper is fairly complicated. First, the problem considered in the paper is a ranking problem, to rank threads to users according to their interests. The framework consists of three factors, user-user interactions; thread-thread similarities and thread-user alignment. User-user relationships are captured through a weighted graph. For thread-thread relationship, a low-rank representation (embedding) of threads is found, induced by a local thread matrix. The mapping between users and threads are formulated into another manifold learning problem, induced by a user-thread matrix. It is not clear that whether the intuition can be captured by simpler models. But the authors indeed show that the performance is good, compared to some simple methods.
“Latent Interest-Topic Model: Finding the Causal Relationships behind Dyadic Data”
Indeed, this is an extend work of a very similar work, published in SIGIR 2010. The model adds two layer between authors and documents. A document-class layer encodes the the distributions over topics for documents. Each document is only belong to one document-class. The choice of document-class depends on another layer, author-class, where it controls that how a document-class is chosen for a particular author. In the end, both authors and documents are naturally associated with topics. It seems that the model can be efficiently estimated through standard Gibbs sampling. However, the experimental results from ACM papers do not really correspond to authors’ expertise, in my opinion.

Hong, LiangJie

Director of Engineering – AI at LinkedIn Corporation