All issues
- 2024 Vol. 16
- 2023 Vol. 15
- 2022 Vol. 14
- 2021 Vol. 13
- 2020 Vol. 12
- 2019 Vol. 11
- 2018 Vol. 10
- 2017 Vol. 9
- 2016 Vol. 8
- 2015 Vol. 7
- 2014 Vol. 6
- 2013 Vol. 5
- 2012 Vol. 4
- 2011 Vol. 3
- 2010 Vol. 2
- 2009 Vol. 1
Identification of the author of the text by segmentation method
The paper describes a method for recognizing authors of literary texts by the proximity of fragments into which a separate text is divided to the standard of the author. The standard is the empirical frequency distribution of letter combinations, built on a training sample, which included expertly selected reliably known works of this author. A set of standards of different authors forms a library, within which the problem of identifying the author of an unknown text is solved. The proximity between texts is understood in the sense of the norm in L1 for the frequency vector of letter combinations, which is constructed for each fragment and for the text as a whole. The author of an unknown text is assigned the one whose standard is most often chosen as the closest for the set of fragments into which the text is divided. The length of the fragment is optimized based on the principle of the maximum difference in distances from fragments to standards in the problem of recognition of «friend–foe». The method was tested on the corpus of domestic and foreign (translated) authors. 1783 texts of 100 authors with a total volume of about 700 million characters were collected. In order to exclude the bias in the selection of authors, authors whose surnames began with the same letter were considered. In particular, for the letter L, the identification error was 12%. Along with a fairly high accuracy, this method has another important property: it allows you to estimate the probability that the standard of the author of the text in question is missing in the library. This probability can be estimated based on the results of the statistics of the nearest standards for small fragments of text. The paper also examines statistical digital portraits of writers: these are joint empirical distributions of the probability that a certain proportion of the text is identified at a given level of trust. The practical importance of these statistics is that the carriers of the corresponding distributions practically do not overlap for their own and other people’s standards, which makes it possible to recognize the reference distribution of letter combinations at a high level of confidence.
Indexed in Scopus
Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU
The journal is included in the Russian Science Citation Index
The journal is included in the RSCI
International Interdisciplinary Conference "Mathematics. Computing. Education"