Semantic structuring of text documents based on patterns of natural language entities

 pdf (195K)

The technology of creating patterns from natural language words (concepts) based on text data in the bag of words model is considered. Patterns are used to reduce the dimension of the original space in the description of documents and search for semantically related words by topic. The process of dimensionality reduction is implemented through the formation of patterns of latent features. The variety of structures of document relations is investigated in order to divide them into themes in the latent space.

It is considered that a given set of documents (objects) is divided into two non-overlapping classes, for the analysis of which it is necessary to use a common dictionary. The belonging of words to a common vocabulary is initially unknown. Class objects are considered as opposition to each other. Quantitative parameters of oppositionality are determined through the values of the stability of each feature and generalized assessments of objects according to non-overlapping sets of features.

To calculate the stability, the feature values are divided into non-intersecting intervals, the optimal boundaries of which are determined by a special criterion. The maximum stability is achieved under the condition that the boundaries of each interval contain values of one of the two classes.

The composition of features in sets (patterns of words) is formed from a sequence ordered by stability values. The process of formation of patterns and latent features based on them is implemented according to the rules of hierarchical agglomerative grouping.

A set of latent features is used for cluster analysis of documents using metric grouping algorithms. The analysis applies the coefficient of content authenticity based on the data on the belonging of documents to classes. The coefficient is a numerical characteristic of the dominance of class representatives in groups.

To divide documents into topics, it is proposed to use the union of groups in relation to their centers. As patterns for each topic, a sequence of words ordered by frequency of occurrence from a common dictionary is considered.

The results of a computational experiment on collections of abstracts of scientific dissertations are presented. Sequences of words from the general dictionary on 4 topics are formed.

Keywords: topic modeling, hierarchical agglomerative grouping, ontology, general vocabulary, content authenticity
Citation in English: Ignatev N.A., Tuliev U.Y. Semantic structuring of text documents based on patterns of natural language entities // Computer Research and Modeling, 2022, vol. 14, no. 5, pp. 1185-1197
Citation in English: Ignatev N.A., Tuliev U.Y. Semantic structuring of text documents based on patterns of natural language entities // Computer Research and Modeling, 2022, vol. 14, no. 5, pp. 1185-1197
DOI: 10.20537/2076-7633-2022-14-5-1185-1197

Indexed in Scopus

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the RSCI

International Interdisciplinary Conference "Mathematics. Computing. Education"