Результаты поиска по 'reinforcement learning':
Найдено статей: 8
  1. Editor’s note
    Computer Research and Modeling, 2023, v. 15, no. 2, pp. 229-233
  2. Vostrikov D.D., Konin G.O., Lobanov A.V., Matyukhin V.V.
    Influence of the mantissa finiteness on the accuracy of gradient-free optimization methods
    Computer Research and Modeling, 2023, v. 15, no. 2, pp. 259-280

    Gradient-free optimization methods or zeroth-order methods are widely used in training neural networks, reinforcement learning, as well as in industrial tasks where only the values of a function at a point are available (working with non-analytical functions). In particular, the method of error back propagation in PyTorch works exactly on this principle. There is a well-known fact that computer calculations use heuristics of floating-point numbers, and because of this, the problem of finiteness of the mantissa arises.

    In this paper, firstly, we reviewed the most popular methods of gradient approximation: Finite forward/central difference (FFD/FCD), Forward/Central wise component (FWC/CWC), Forward/Central randomization on $l_2$ sphere (FSSG2/CFFG2); secondly, we described current theoretical representations of the noise introduced by the inaccuracy of calculating the function at a point: adversarial noise, random noise; thirdly, we conducted a series of experiments on frequently encountered classes of problems, such as quadratic problem, logistic regression, SVM, to try to determine whether the real nature of machine noise corresponds to the existing theory. It turned out that in reality (at least for those classes of problems that were considered in this paper), machine noise turned out to be something between adversarial noise and random, and therefore the current theory about the influence of the mantissa limb on the search for the optimum in gradient-free optimization problems requires some adjustment.

  3. Gladin E.L., Borodich E.D.
    Variance reduction for minimax problems with a small dimension of one of the variables
    Computer Research and Modeling, 2022, v. 14, no. 2, pp. 257-275

    The paper is devoted to convex-concave saddle point problems where the objective is a sum of a large number of functions. Such problems attract considerable attention of the mathematical community due to the variety of applications in machine learning, including adversarial learning, adversarial attacks and robust reinforcement learning, to name a few. The individual functions in the sum usually represent losses related to examples from a data set. Additionally, the formulation admits a possibly nonsmooth composite term. Such terms often reflect regularization in machine learning problems. We assume that the dimension of one of the variable groups is relatively small (about a hundred or less), and the other one is large. This case arises, for example, when one considers the dual formulation for a minimization problem with a moderate number of constraints. The proposed approach is based on using Vaidya’s cutting plane method to minimize with respect to the outer block of variables. This optimization algorithm is especially effective when the dimension of the problem is not very large. An inexact oracle for Vaidya’s method is calculated via an approximate solution of the inner maximization problem, which is solved by the accelerated variance reduced algorithm Katyusha. Thus, we leverage the structure of the problem to achieve fast convergence. Separate complexity bounds for gradients of different components with respect to different variables are obtained in the study. The proposed approach is imposing very mild assumptions about the objective. In particular, neither strong convexity nor smoothness is required with respect to the low-dimensional variable group. The number of steps of the proposed algorithm as well as the arithmetic complexity of each step explicitly depend on the dimensionality of the outer variable, hence the assumption that it is relatively small.

  4. Kazorin V.I., Kholodov Y.A.
    Framework sumo-atclib for adaptive traffic control modeling
    Computer Research and Modeling, 2024, v. 16, no. 1, pp. 69-78

    This article proposes the sumo-atclib framework, which provides a convenient uniform interface for testing adaptive control algorithms with different limitations, for example, restrictions on phase durations, phase sequences, restrictions on the minimum time between control actions, which uses the open source microscopic transport modeling environment SUMO. The framework shares the functionality of controllers (class TrafficController) and a monitoring and detection system (class StateObserver), which repeats the architecture of real traffic light objects and adaptive control systems and simplifies the testing of new algorithms, since combinations of different controllers and vehicle detection systems can be freely varied. Also, unlike most existing solutions, the road class Road has been added, which combines a set of lanes, this allows, for example, to determine the adjacency of regulated intersections, in cases when the number of lanes changes on the way from one intersection to another, and therefore the road graph is divided into several edges. At the same time, the algorithms themselves use the same interface and are abstracted from the specific parameters of the detectors, network topologies, that is, it is assumed that this solution will allow the transport engineer to test ready-made algorithms for a new scenario, without the need to adapt them to new conditions, which speeds up the development process of the control system, and reduces design overhead. At the moment, the package contains examples of MaxPressure algorithms and the Q-learning reinforcement learning method, the database of examples is also being updated. The framework also includes a set of SUMO scripts for testing algorithms, which includes both synthetic maps and well-verified SUMO scripts such as Cologne and Ingolstadt. In addition, the framework provides a set of automatically calculated metrics, such as total travel time, delay time, average speed; the framework also provides a ready-made example for visualization of metrics.

  5. Rudenko V.D., Yudin N.E., Vasin A.A.
    Survey of convex optimization of Markov decision processes
    Computer Research and Modeling, 2023, v. 15, no. 2, pp. 329-353

    This article reviews both historical achievements and modern results in the field of Markov Decision Process (MDP) and convex optimization. This review is the first attempt to cover the field of reinforcement learning in Russian in the context of convex optimization. The fundamental Bellman equation and the criteria of optimality of policy — strategies based on it, which make decisions based on the known state of the environment at the moment, are considered. The main iterative algorithms of policy optimization based on the solution of the Bellman equations are also considered. An important section of this article was the consideration of an alternative to the $Q$-learning approach — the method of direct maximization of the agent’s average reward for the chosen strategy from interaction with the environment. Thus, the solution of this convex optimization problem can be represented as a linear programming problem. The paper demonstrates how the convex optimization apparatus is used to solve the problem of Reinforcement Learning (RL). In particular, it is shown how the concept of strong duality allows us to naturally modify the formulation of the RL problem, showing the equivalence between maximizing the agent’s reward and finding his optimal strategy. The paper also discusses the complexity of MDP optimization with respect to the number of state–action–reward triples obtained as a result of interaction with the environment. The optimal limits of the MDP solution complexity are presented in the case of an ergodic process with an infinite horizon, as well as in the case of a non-stationary process with a finite horizon, which can be restarted several times in a row or immediately run in parallel in several threads. The review also reviews the latest results on reducing the gap between the lower and upper estimates of the complexity of MDP optimization with average remuneration (Averaged MDP, AMDP). In conclusion, the real-valued parametrization of agent policy and a class of gradient optimization methods through maximizing the $Q$-function of value are considered. In particular, a special class of MDPs with restrictions on the value of policy (Constrained Markov Decision Process, CMDP) is presented, for which a general direct-dual approach to optimization with strong duality is proposed.

  6. Salenek I.A., Seliverstov Y.A., Seliverstov S.A., Sofronova E.A.
    Improving the quality of route generation in SUMO based on data from detectors using reinforcement learning
    Computer Research and Modeling, 2024, v. 16, no. 1, pp. 137-146

    This work provides a new approach for constructing high-precision routes based on data from transport detectors inside the SUMO traffic modeling package. Existing tools such as flowrouter and routeSampler have a number of disadvantages, such as the lack of interaction with the network in the process of building routes. Our rlRouter uses multi-agent reinforcement learning (MARL), where the agents are incoming lanes and the environment is the road network. By performing actions to launch vehicles, agents receive a reward for matching data from transport detectors. Parameter Sharing DQN with the LSTM backbone of the Q-function was used as an algorithm for multi-agent reinforcement learning.

    Since the rlRouter is trained inside the SUMO simulation, it can restore routes better by taking into account the interaction of vehicles within the network with each other and with the network infrastructure. We have modeled diverse traffic situations on three different junctions in order to compare the performance of SUMO’s routers with the rlRouter. We used Mean Absoluter Error (MAE) as the measure of the deviation from both cumulative detectors and routes data. The rlRouter achieved the highest compliance with the data from the detectors. We also found that by maximizing the reward for matching detectors, the resulting routes also get closer to the real ones. Despite the fact that the routes recovered using rlRouter are superior to the routes obtained using SUMO tools, they do not fully correspond to the real ones, due to the natural limitations of induction-loop detectors. To achieve more plausible routes, it is necessary to equip junctions with other types of transport counters, for example, camera detectors.

  7. Chen J., Lobanov A.V., Rogozin A.V.
    Nonsmooth Distributed Min-Max Optimization Using the Smoothing Technique
    Computer Research and Modeling, 2023, v. 15, no. 2, pp. 469-480

    Distributed saddle point problems (SPPs) have numerous applications in optimization, matrix games and machine learning. For example, the training of generated adversarial networks is represented as a min-max optimization problem, and training regularized linear models can be reformulated as an SPP as well. This paper studies distributed nonsmooth SPPs with Lipschitz-continuous objective functions. The objective function is represented as a sum of several components that are distributed between groups of computational nodes. The nodes, or agents, exchange information through some communication network that may be centralized or decentralized. A centralized network has a universal information aggregator (a server, or master node) that directly communicates to each of the agents and therefore can coordinate the optimization process. In a decentralized network, all the nodes are equal, the server node is not present, and each agent only communicates to its immediate neighbors.

    We assume that each of the nodes locally holds its objective and can compute its value at given points, i. e. has access to zero-order oracle. Zero-order information is used when the gradient of the function is costly, not possible to compute or when the function is not differentiable. For example, in reinforcement learning one needs to generate a trajectory to evaluate the current policy. This policy evaluation process can be interpreted as the computation of the function value. We propose an approach that uses a smoothing technique, i. e., applies a first-order method to the smoothed version of the initial function. It can be shown that the stochastic gradient of the smoothed function can be viewed as a random two-point gradient approximation of the initial function. Smoothing approaches have been studied for distributed zero-order minimization, and our paper generalizes the smoothing technique on SPPs.

  8. Chuvilin K.V.
    The use of syntax trees in order to automate the correction of LaTeX documents
    Computer Research and Modeling, 2012, v. 4, no. 4, pp. 871-883

    The problem is to automate the correction of LaTeX documents. Each document is represented as a parse tree. The modified Zhang-Shasha algorithm is used to construct a mapping of tree vertices of the original document to the tree vertices of the edited document, which corresponds to the minimum editing distance. Vertex to vertex maps form the training set, which is used to generate rules for automatic correction. The statistics of the applicability to the edited documents is collected for each rule. It is used for quality assessment and improvement of the rules.

    Citations: 5 (RSCI).

Indexed in Scopus

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the RSCI

International Interdisciplinary Conference "Mathematics. Computing. Education"