Daily Archives: March 9, 2011


Reviews on User Modeling in Topic Models

In this post, I would like to review several papers that wish to extend standard topic models with incorporating user information. The first paradigm or group of papers is introduced by M. Rosen-Zvi et al.

These three papers define a “Author-Topic” model, a simple extension of LDA. The generation process is as follows:

  1. For each document $latex d$:
    1. For each word position:
      1. Sample an author $latex x$ uniformly sampled from the group of authors \mathbf{a}_{d} for this document.
      2. Sample an topic assignment $latex z$ from per-author multinomial distribution over topics $latex \theta_{x}$.
      3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

The inference of the model is done by Gibbs Sampling. The biggest drawback of the model is that it loses the distribution over topics for documents. In “Learning Author-Topic Models from Text Corpora“, the authors proposed a heuristic solution to this problem: adding a fictitious author for each document. The second group of papers is from UMass.

They  proposed several models. The first one is “Author-Recipient-Topic” model, which is suitable for message data, like emails. The generation process is as follows:

  1. For each document $latex d$, we observe its author $latex a_{d}$ and a set of recipients $latex \mathbf{r}_{d}$:
    1. For each word position:
      1. Sample a recipient $latex x$ uniformly sampled from $latex \mathbf{r}_{d}$.
      2. Sample an topic assignment $latex z$ from author-recipient multinomial distribution over topics $latex \theta_{a_{d},x}$.
      3. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

This model is further extended into “Role-Author-Recipient-Topic” model. The idea is that each author or recipient may play different roles in the exchange of messages. Therefore, it is better to explicitly model them. Three possible variants are introduced. The first variant is that for each word position, we first sample a role for author and for the sampled recipient as well. Once the roles are sampled, the topic assignments are sampled from role-role pair-determined multinomial distribution over topics. The second variant is that only one role is generated for the author of the message. However, for recipients, each one has a role. For each word position, a recipient with his corresponding role is firstly sampled and a topic assignment is sampled from author-role author-role pair multinomial distribution over topics. The third variant is that all recipients share a single role. The third model is “Author-Persona-Topic” model. The generation process is as follows:

  1. For each author $latex a$:
    1. Sample a multinomial distribution over persona $latex \eta_{a}$.
    2. For each persona $latex g$, sample a multinomial distribution over topics $latex \theta_{g}$.
  2. For each document $latex d$ with author $latex a_{d}$:
    1. Sample a persona $latex g_{d}$ from $latex \eta_{a_{d}}$.
    2. For each word position:
      1. Sample an topic assignment $latex z$ from $latex \theta_{g_{d}}$.
      2. Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.

All these models do not have a per-document distribution for topics.

The third group of papers is from Noriaki Kawamae. Both models introduced in these papers extended the ideas of “Author-Topic” model and “Author-Persona-Topic” model in particular.

The first model is “Author-Interest-Topic” model. It introduced a notion of “document-class”. The authors have a distribution over document-classes and for each document class, it has a distribution over topics. Here, we can think of document-class as “persona” in previous models. For each document, it firstly samples a document-class from per-author distibution over document classes. Then, by using this document-class, we can draw topics from this particular class. The difference between “Author-Interest-Topic” model and “Author-Persona-Topic” model is that the distribution over topics for each persona is under author level in “Author-Persona-Topic” but they are global variables in “Author-Interest Topic” model. The “Latent-Interest-Topic” model is much complicated than all previous models. It adds another layer of abstraction, author-classes. For each author, it has variable to indicate his author-class, which is drawn from a multinomial distribution. For each author-class, there is a multinomial distribution over topics. For each document, we first draw a document-class from its author’s per author-class distribution over document-classes. Then, the later generation process is same as “Author-Interest-Topic“. The key for “Author-Interest-Topic” and “Latent-Interest-Topic” models is that they are clustering models, in the sense that authors or documents are forced clustered into either author classes or document classes.

The last group of papers is from Jie Tang et al. All the proposed models are based on “Author-Topic” model.

They firstly proposed three variants of “Author-Conference-Topic” model. For each author, there is a multinomial distribution over topics. For each token in the document, an author is uniformly sampled and the topic assignment is sampled from per-author multinomial distribution over topics. The differences between three variants are how the conference stamp is generated. We omit the discussion here.