In this post, I would like to review several papers that wish to extend standard topic models with incorporating user information. The first paradigm or group of papers is introduced by M. Rosen-Zvi et al.
- The author-topic model for authors and documents, M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth, Proceedings of the 20th Annual Conference on Uncertainty in Artificial Intelligence, 2004.
- Probabilistic author-topic models for information discovery M. Steyvers, P. Smyth, M. Rosen-Zvi, T. Griffiths, Proceedings of the Tenth ACM SIGKDD Conference, 2004.
- Learning Author-Topic Models from Text Corpora. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., & Steyvers, M. (2010). ACM Transactions on Information Systems, 28(1), 1-38.
These three papers define a “Author-Topic” model, a simple extension of LDA. The generation process is as follows:
- For each document $latex d$:
- For each word position:
- Sample an author $latex x$ uniformly sampled from the group of authors for this document.
- Sample an topic assignment $latex z$ from per-author multinomial distribution over topics $latex \theta_{x}$.
- Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.
- For each word position:
The inference of the model is done by Gibbs Sampling. The biggest drawback of the model is that it loses the distribution over topics for documents. In “Learning Author-Topic Models from Text Corpora“, the authors proposed a heuristic solution to this problem: adding a fictitious author for each document. The second group of papers is from UMass.
- The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email, Andrew McCallum, Andres Corrada-Emmanuel and Xuerui Wang. The 18th Annual Conference on Neural Information Processing Systems Workshop on Structured Data and Representations in Probabilistic Models for Categorization, also appeared as UMass Technical Report UM-CS-2004-096, 2004.
- Topic and Role Discovery in Social Networks, Andrew McCallum, Andres Corrada-Emmanuel and Xuerui Wang. Proceedings of 19th International Joint Conference on Artificial Intelligence, pp.786-791, 2005.
- Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Andrew McCallum, Xuerui Wang and Andres Corrada-Emmanuel. Journal of Artificial Intelligence Research, Vol. 30, pp. 249-272, 2007.
- Expertise Modeling for Matching Papers with Reviewers David Mimno and Andrew McCallum. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2007, San Jose, CA.
They proposed several models. The first one is “Author-Recipient-Topic” model, which is suitable for message data, like emails. The generation process is as follows:
- For each document $latex d$, we observe its author $latex a_{d}$ and a set of recipients $latex \mathbf{r}_{d}$:
- For each word position:
- Sample a recipient $latex x$ uniformly sampled from $latex \mathbf{r}_{d}$.
- Sample an topic assignment $latex z$ from author-recipient multinomial distribution over topics $latex \theta_{a_{d},x}$.
- Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.
- For each word position:
This model is further extended into “Role-Author-Recipient-Topic” model. The idea is that each author or recipient may play different roles in the exchange of messages. Therefore, it is better to explicitly model them. Three possible variants are introduced. The first variant is that for each word position, we first sample a role for author and for the sampled recipient as well. Once the roles are sampled, the topic assignments are sampled from role-role pair-determined multinomial distribution over topics. The second variant is that only one role is generated for the author of the message. However, for recipients, each one has a role. For each word position, a recipient with his corresponding role is firstly sampled and a topic assignment is sampled from author-role author-role pair multinomial distribution over topics. The third variant is that all recipients share a single role. The third model is “Author-Persona-Topic” model. The generation process is as follows:
- For each author $latex a$:
- Sample a multinomial distribution over persona $latex \eta_{a}$.
- For each persona $latex g$, sample a multinomial distribution over topics $latex \theta_{g}$.
- For each document $latex d$ with author $latex a_{d}$:
- Sample a persona $latex g_{d}$ from $latex \eta_{a_{d}}$.
- For each word position:
- Sample an topic assignment $latex z$ from $latex \theta_{g_{d}}$.
- Sample a word $latex w$ from topic $latex z$, a multinomial distribution over words.
All these models do not have a per-document distribution for topics.
The third group of papers is from Noriaki Kawamae. Both models introduced in these papers extended the ideas of “Author-Topic” model and “Author-Persona-Topic” model in particular.
- Author interest topic model. Noriaki Kawamae. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’10). ACM, New York, NY, USA, 887-888.
- Latent interest-topic model: finding the causal relationships behind dyadic data. Noriaki Kawamae. In Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM ’10). ACM, New York, NY, USA, 649-658.
The first model is “Author-Interest-Topic” model. It introduced a notion of “document-class”. The authors have a distribution over document-classes and for each document class, it has a distribution over topics. Here, we can think of document-class as “persona” in previous models. For each document, it firstly samples a document-class from per-author distibution over document classes. Then, by using this document-class, we can draw topics from this particular class. The difference between “Author-Interest-Topic” model and “Author-Persona-Topic” model is that the distribution over topics for each persona is under author level in “Author-Persona-Topic” but they are global variables in “Author-Interest Topic” model. The “Latent-Interest-Topic” model is much complicated than all previous models. It adds another layer of abstraction, author-classes. For each author, it has variable to indicate his author-class, which is drawn from a multinomial distribution. For each author-class, there is a multinomial distribution over topics. For each document, we first draw a document-class from its author’s per author-class distribution over document-classes. Then, the later generation process is same as “Author-Interest-Topic“. The key for “Author-Interest-Topic” and “Latent-Interest-Topic” models is that they are clustering models, in the sense that authors or documents are forced clustered into either author classes or document classes.
The last group of papers is from Jie Tang et al. All the proposed models are based on “Author-Topic” model.
- ArnetMiner: extraction and mining of academic social networks. Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’08). ACM, New York, NY, USA, 990-998.
- Topic Level Expertise Search over Heterogeneous Networks. Jie Tang, Jing Zhang, Ruoming Jin, Zi Yang, Keke Cai, Li Zhang, and Zhong Su. Machine Learning Journal, Volume 82, Issue 2 (2011), Pages 211-237.
They firstly proposed three variants of “Author-Conference-Topic” model. For each author, there is a multinomial distribution over topics. For each token in the document, an author is uniformly sampled and the topic assignment is sampled from per-author multinomial distribution over topics. The differences between three variants are how the conference stamp is generated. We omit the discussion here.