Fast and accurate x86 disassembly using a graph convolutional network model

 pdf (1670K)

Disassembly of stripped x86 binaries is an important yet non-trivial task. Disassembly is difficult to perform correctly without debug information, especially on x86 architecture, which has variablesized instructions interleaved with data. Moreover, the presence of indirect jumps in binary code adds another layer of complexity. Indirect jumps impede the ability of recursive traversal, a common disassembly technique, to successfully identify all instructions within the code. Consequently, disassembling such code becomes even more intricate and demanding, further highlighting the challenges faced in this field. Many tools, including commercial ones such as IDA Pro, struggle with accurate x86 disassembly. As such, there has been some interest in developing a better solution using machine learning (ML) techniques. ML can potentially capture underlying compiler-independent patterns inherent for the compiler-generated assembly. Researchers in this area have shown that it is possible for ML approaches to outperform the classical tools. They also can be less timeconsuming to develop compared to manual heuristics, shifting most of the burden onto collecting a big representative dataset of executables with debug information. Following this line of work, we propose an improvement of an existing RGCN-based architecture, which builds control and flow graph on superset disassembly. The enhancement comes from augmenting the graph with data flow information. In particular, in the embedding we add Jump Control Flow and Register Dependency edges, inspired by Probabilistic Disassembly. We also create an open-source x86 instruction identification dataset, based on a combination of ByteWeight dataset and a selection open-source Debian packages. Compared to IDA Pro, a state of the art commercial tool, our approach yields better accuracy, while maintaining great performance on our benchmarks. It also fares well against existing machine learning approaches such as DeepDi.

Keywords: disassembly, machine learning, graph neural network, x86
Citation in English: Strygin N.A., Kudasov N.D. Fast and accurate x86 disassembly using a graph convolutional network model // Computer Research and Modeling, 2024, vol. 16, no. 7, pp. 1779-1792
Citation in English: Strygin N.A., Kudasov N.D. Fast and accurate x86 disassembly using a graph convolutional network model // Computer Research and Modeling, 2024, vol. 16, no. 7, pp. 1779-1792
DOI: 10.20537/2076-7633-2024-16-7-1779-1792

Indexed in Scopus

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the RSCI

International Interdisciplinary Conference "Mathematics. Computing. Education"