Grid Representations of Texts


One ever-lasting theme of text mining is to model text in a proper space. A simple yet powerful space is in a one-dimension space where words are generated from an uncorrelated array of terms. This representation is usually called bag-of-word (BOW) representation.

BOW is also a foundation to more complicated approaches like latent Dirichlet allocation (LDA) or latent Semantic indexing (LSI). In such models, terms in a document are essentially generated from different 1D arrays (topics) where each array has its distinct distribution over the whole vocabulary. However, like simple BOW, these 1D arrays are uncorrelated.

The idea of grid representations of texts is essentially to model correlations between words. One such example is called “Multi-dimensional counting grids” [1] and its “admixture” version of the model “Componential counting grids” [2]. The basic assumption here is that, texts are generated by a moving window of grids where each grid is a distribution over terms. One benefit of such representation is that it would be more natural to handle thematic shifts in the framework. Also, depending on the window size, the n-gram effect is also automatically considered.

Although from the surface, the true advantage of such representation over BOW is not very clear, it is nevertheless an interesting idea to explore.

References

[1] Jojic, N., Perina, A.: Multidimensional counting grids: Inferring word order from disordered bags of words: In UAI. 2011 547-556 [PDF]
[2] Perina, A., Jojic, N., Bicego, M. and Turski, A.: Documents as Multiple Overlapping Windows into a Grid of Counts: In NIPS 2013 [PDF]

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.