Additive regularizarion of topic models with fast text vectorizartion

 pdf (322K)

The probabilistic topic model of a text document collection finds two matrices: a matrix of conditional probabilities of topics in documents and a matrix of conditional probabilities of words in topics. Each document is represented by a multiset of words also called the “bag of words”, thus assuming that the order of words is not important for revealing the latent topics of the document. Under this assumption, the problem is reduced to a low-rank non-negative matrix factorization governed by likelihood maximization. In general, this problem is ill-posed having an infinite set of solutions. In order to regularize the solution, a weighted sum of optimization criteria is added to the log-likelihood. When modeling large text collections, storing the first matrix seems to be impractical, since its size is proportional to the number of documents in the collection. At the same time, the topical vector representation (embedding) of documents is necessary for solving many text analysis tasks, such as information retrieval, clustering, classification, and summarization of texts. In practice, the topical embedding is calculated for a document “on-the-fly”, which may require dozens of iterations over all the words of the document. In this paper, we propose a way to calculate a topical embedding quickly, by one pass over document words. For this, an additional constraint is introduced into the model in the form of an equation, which calculates the first matrix from the second one in linear time. Although formally this constraint is not an optimization criterion, in fact it plays the role of a regularizer and can be used in combination with other regularizers within the additive regularization framework ARTM. Experiments on three text collections have shown that the proposed method improves the model in terms of sparseness, difference, logLift and coherence measures of topic quality. The open source libraries BigARTM and TopicNet were used for the experiments.

Keywords: natural language processing, unsupervised learning, topic modeling, additive regularization of topic model, EM-algorithm, PLSA, LDA, ARTM, BigARTM, TopicNet
Citation in English: Irkhin I.A., Bulatov V.G., Vorontsov K.V. Additive regularizarion of topic models with fast text vectorizartion // Computer Research and Modeling, 2020, vol. 12, no. 6, pp. 1515-1528
Citation in English: Irkhin I.A., Bulatov V.G., Vorontsov K.V. Additive regularizarion of topic models with fast text vectorizartion // Computer Research and Modeling, 2020, vol. 12, no. 6, pp. 1515-1528
DOI: 10.20537/2076-7633-2020-12-6-1515-1528

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the List of Russian peer-reviewed journals publishing the main research results of PhD and doctoral dissertations.

International Interdisciplinary Conference "Mathematics. Computing. Education"

The journal is included in the RSCI

Indexed in Scopus