Quantum-inspired episode selection for Monte Carlo reinforcement learning via QUBO optimization

 pdf (4140K)

Monte Carlo (MC) reinforcement learning suffers from high sample complexity, especially in environments with sparse rewards, large state spaces, and strongly correlated trajectories that reduce the statistical efficiency of return estimation. These well-known limitations often lead to slow convergence and unstable learning dynamics, particularly in settings where only a small fraction of collected trajectories is actually informative for policy improvement. A key challenge is therefore to identify a compact yet diverse subset of episodes that contributes most to the accuracy of value estimates while preserving sufficient exploration of the environment. To address this challenge, we reformulate episode selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem and solve it using quantum-inspired sampling techniques. Our method, MC+ QUBO, inserts a combinatorial filtering step into the standard MC policy-evaluation pipeline: given a batch of trajectories, it selects a subset that maximizes cumulative reward and encourages broad state-space coverage. This selection procedure is expressed as a QUBO model, where linear terms favor high-return episodes, quadratic terms penalize redundancy between trajectories, and additional coupling terms can be used to enforce coverage-related constraints or promote structural diversity. Within this framework, we investigate two black-box QUBO solvers: Simulated Quantum Annealing (SQA), which emulates tunneling-based exploration of the search landscape, and Simulated Bifurcation (SB), a dynamical-systems-based iterative optimization method. Both solvers demonstrate the ability to efficiently navigate the combinatorial structure of the trajectory-selection problem and to handle batch sizes that are otherwise computationally expensive for exhaustive or deterministic search. Experiments in a finite-horizon GridWorld environment show that MC+QUBO consistently outperforms vanilla MC in convergence speed, stability of return estimates, and final policy quality. These results highlight the promise of quantum-inspired optimization as a practical decision-making subroutine within reinforcement-learning algorithms, offering a scalable way to improve sample efficiency without modifying the underlying learning paradigm.

Keywords: method Monte Carlo, quantum annealing, quantum computation, reinforcement learning, QUBO
Citation in English: Kholodov Y.A., Salloum H., Jnadi A., Khubiev K.Yu., Petrenko A. Quantum-inspired episode selection for Monte Carlo reinforcement learning via QUBO optimization // Computer Research and Modeling, 2026, vol. 18, no. 2, pp. 273-288
Citation in English: Kholodov Y.A., Salloum H., Jnadi A., Khubiev K.Yu., Petrenko A. Quantum-inspired episode selection for Monte Carlo reinforcement learning via QUBO optimization // Computer Research and Modeling, 2026, vol. 18, no. 2, pp. 273-288
DOI: 10.20537/2076-7633-2026-18-2-273-288

Copyright © 2026 Kholodov Y.A., Salloum H., Jnadi A., Khubiev K.Yu., Petrenko A.

Indexed in Scopus

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the RSCI

International Interdisciplinary Conference "Mathematics. Computing. Education"