Результаты поиска по 'reinforcement':
Найдено статей: 16
  1. Rudenko V.D., Yudin N.E., Vasin A.A.
    Survey of convex optimization of Markov decision processes
    Computer Research and Modeling, 2023, v. 15, no. 2, pp. 329-353

    This article reviews both historical achievements and modern results in the field of Markov Decision Process (MDP) and convex optimization. This review is the first attempt to cover the field of reinforcement learning in Russian in the context of convex optimization. The fundamental Bellman equation and the criteria of optimality of policy — strategies based on it, which make decisions based on the known state of the environment at the moment, are considered. The main iterative algorithms of policy optimization based on the solution of the Bellman equations are also considered. An important section of this article was the consideration of an alternative to the $Q$-learning approach — the method of direct maximization of the agent’s average reward for the chosen strategy from interaction with the environment. Thus, the solution of this convex optimization problem can be represented as a linear programming problem. The paper demonstrates how the convex optimization apparatus is used to solve the problem of Reinforcement Learning (RL). In particular, it is shown how the concept of strong duality allows us to naturally modify the formulation of the RL problem, showing the equivalence between maximizing the agent’s reward and finding his optimal strategy. The paper also discusses the complexity of MDP optimization with respect to the number of state–action–reward triples obtained as a result of interaction with the environment. The optimal limits of the MDP solution complexity are presented in the case of an ergodic process with an infinite horizon, as well as in the case of a non-stationary process with a finite horizon, which can be restarted several times in a row or immediately run in parallel in several threads. The review also reviews the latest results on reducing the gap between the lower and upper estimates of the complexity of MDP optimization with average remuneration (Averaged MDP, AMDP). In conclusion, the real-valued parametrization of agent policy and a class of gradient optimization methods through maximizing the $Q$-function of value are considered. In particular, a special class of MDPs with restrictions on the value of policy (Constrained Markov Decision Process, CMDP) is presented, for which a general direct-dual approach to optimization with strong duality is proposed.

  2. Lyubimov A.K., Kozhanov D.A.
    Modeling the structural element of flexible woven composites under static tension using the method of finite element in ANSYS
    Computer Research and Modeling, 2016, v. 8, no. 1, pp. 113-120

    The article gives the example of finite-element modeling of the structural element is a flexible woven composites. The reinforcing cloth is a plain weave of threads of assembled harness. Threads are represented by elastic material. The matrix of the material is a soft polymer with the possibility of irreversible deformations. Taken into account the possibility of the occurrence of damage in the structure of the material under high loads. Built detailed diagram of deformation under uniaxial tension. The accuracy of the model is conrmed by in situ experiments.

    Views (last year): 1. Citations: 7 (RSCI).
  3. Aksenov A.A., Zhluktov S.V., Kashirin V.S., Sazonova M.L., Cherny S.G., Drozdova E.A., Rode A.A.
    Numerical modeling of raw atomization and vaporization by flow of heat carrier gas in furnace technical carbon production into FlowVision
    Computer Research and Modeling, 2023, v. 15, no. 4, pp. 921-939

    Technical carbon (soot) is a product obtained by thermal decomposition (pyrolysis) of hydrocarbons (usually oil) in a stream of heat carrier gas. Technical carbon is widely used as a reinforcing component in the production of rubber and plastic masses. Tire production uses 70% of all carbon produced. In furnace carbon production, the liquid hydrocarbon feedstock is injected into the natural gas combustion product stream through nozzles. The raw material is atomized and vaporized with further pyrolysis. It is important for the raw material to be completely evaporated before the pyrolysis process starts, otherwise coke, that contaminates the product, will be produced. It is impossible to operate without mathematical modeling of the process itself in order to improve the carbon production technology, in particular, to provide the complete evaporation of the raw material prior to the pyrolysis process. Mathematical modelling is the most important way to obtain the most complete and detailed information about the peculiarities of reactor operation.

    A three-dimensional mathematical model and calculation method for raw material atomization and evaporation in the thermal gas flow are being developed in the FlowVision software package PC. Water is selected as a raw material to work out the modeling technique. The working substances in the reactor chamber are the combustion products of natural gas. The motion of raw material droplets and evaporation in the gas stream are modeled in the framework of the Eulerian approach of interaction between dispersed and continuous media. The simulation results of raw materials atomization and evaporation in a real reactor for technical carbon production are presented. Numerical method allows to determine an important atomization characteristic: average Sauter diameter. That parameter could be defined from distribution of droplets of raw material at each time of spray forming.

  4. Salenek I.A., Seliverstov Y.A., Seliverstov S.A., Sofronova E.A.
    Improving the quality of route generation in SUMO based on data from detectors using reinforcement learning
    Computer Research and Modeling, 2024, v. 16, no. 1, pp. 137-146

    This work provides a new approach for constructing high-precision routes based on data from transport detectors inside the SUMO traffic modeling package. Existing tools such as flowrouter and routeSampler have a number of disadvantages, such as the lack of interaction with the network in the process of building routes. Our rlRouter uses multi-agent reinforcement learning (MARL), where the agents are incoming lanes and the environment is the road network. By performing actions to launch vehicles, agents receive a reward for matching data from transport detectors. Parameter Sharing DQN with the LSTM backbone of the Q-function was used as an algorithm for multi-agent reinforcement learning.

    Since the rlRouter is trained inside the SUMO simulation, it can restore routes better by taking into account the interaction of vehicles within the network with each other and with the network infrastructure. We have modeled diverse traffic situations on three different junctions in order to compare the performance of SUMO’s routers with the rlRouter. We used Mean Absoluter Error (MAE) as the measure of the deviation from both cumulative detectors and routes data. The rlRouter achieved the highest compliance with the data from the detectors. We also found that by maximizing the reward for matching detectors, the resulting routes also get closer to the real ones. Despite the fact that the routes recovered using rlRouter are superior to the routes obtained using SUMO tools, they do not fully correspond to the real ones, due to the natural limitations of induction-loop detectors. To achieve more plausible routes, it is necessary to equip junctions with other types of transport counters, for example, camera detectors.

  5. Chen J., Lobanov A.V., Rogozin A.V.
    Nonsmooth Distributed Min-Max Optimization Using the Smoothing Technique
    Computer Research and Modeling, 2023, v. 15, no. 2, pp. 469-480

    Distributed saddle point problems (SPPs) have numerous applications in optimization, matrix games and machine learning. For example, the training of generated adversarial networks is represented as a min-max optimization problem, and training regularized linear models can be reformulated as an SPP as well. This paper studies distributed nonsmooth SPPs with Lipschitz-continuous objective functions. The objective function is represented as a sum of several components that are distributed between groups of computational nodes. The nodes, or agents, exchange information through some communication network that may be centralized or decentralized. A centralized network has a universal information aggregator (a server, or master node) that directly communicates to each of the agents and therefore can coordinate the optimization process. In a decentralized network, all the nodes are equal, the server node is not present, and each agent only communicates to its immediate neighbors.

    We assume that each of the nodes locally holds its objective and can compute its value at given points, i. e. has access to zero-order oracle. Zero-order information is used when the gradient of the function is costly, not possible to compute or when the function is not differentiable. For example, in reinforcement learning one needs to generate a trajectory to evaluate the current policy. This policy evaluation process can be interpreted as the computation of the function value. We propose an approach that uses a smoothing technique, i. e., applies a first-order method to the smoothed version of the initial function. It can be shown that the stochastic gradient of the smoothed function can be viewed as a random two-point gradient approximation of the initial function. Smoothing approaches have been studied for distributed zero-order minimization, and our paper generalizes the smoothing technique on SPPs.

  6. Chuvilin K.V.
    The use of syntax trees in order to automate the correction of LaTeX documents
    Computer Research and Modeling, 2012, v. 4, no. 4, pp. 871-883

    The problem is to automate the correction of LaTeX documents. Each document is represented as a parse tree. The modified Zhang-Shasha algorithm is used to construct a mapping of tree vertices of the original document to the tree vertices of the edited document, which corresponds to the minimum editing distance. Vertex to vertex maps form the training set, which is used to generate rules for automatic correction. The statistics of the applicability to the edited documents is collected for each rule. It is used for quality assessment and improvement of the rules.

    Citations: 5 (RSCI).
Pages: previous

Indexed in Scopus

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the RSCI

International Interdisciplinary Conference "Mathematics. Computing. Education"