Development of and research on an algorithm for distinguishing features in Twitter publications for a classification problem with known markup

 pdf (1690K)

Social media posts play an important role in demonstration of financial market state, and their analysis is a powerful tool for trading. The article describes the result of a study of the impact of social media activities on the movement of the financial market. The top authoritative influencers are selected. Twitter posts are used as data. Such texts usually include slang and abbreviations, so methods for preparing primary text data, including Stanza, regular expressions are presented. Two approaches to the representation of a point in time in the format of text data are considered. The difference of the influence of a single tweet or a whole package consisting of tweets collected over a certain period of time is investigated. A statistical approach in the form of frequency analysis is also considered, metrics defined by the significance of a particular word when identifying the relationship between price changes and Twitter posts are introduced. Frequency analysis involves the study of the occurrence distributions of various words and bigrams in the text for positive, negative or general trends. To build the markup, changes in the market are processed into a binary vector using various parameters, thus setting the task of binary classification. The parameters for Binance candlesticks are sorted out for better description of the movement of the cryptocurrency market, their variability is also explored in this article. Sentiment is studied using Stanford Core NLP. The result of statistical analysis is relevant to feature selection for further binary or multiclass classification tasks. The presented methods of text analysis contribute to the increase of the accuracy of models designed to solve natural language processing problems by selecting words, improving the quality of vectorization. Such algorithms are often used in automated trading strategies to predict the price of an asset, the trend of its movement.

Keywords: text analysis, natural language processing, Twitter activity, frequency analysis, feature selection, classification problem, financial markets
Citation in English: Makarov I.S., Bagantsova E.R., Iashin P.A., Kovaleva M.D., Gorbachev R.A. Development of and research on an algorithm for distinguishing features in Twitter publications for a classification problem with known markup // Computer Research and Modeling, 2023, vol. 15, no. 1, pp. 171-183
Citation in English: Makarov I.S., Bagantsova E.R., Iashin P.A., Kovaleva M.D., Gorbachev R.A. Development of and research on an algorithm for distinguishing features in Twitter publications for a classification problem with known markup // Computer Research and Modeling, 2023, vol. 15, no. 1, pp. 171-183
DOI: 10.20537/2076-7633-2023-15-1-171-183

Indexed in Scopus

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the RSCI

International Interdisciplinary Conference "Mathematics. Computing. Education"