Automated citation graph building from a corpora of scientific documents

 pdf (124K)  / List of references

In this paper the problem of automated building of a citation graph from a collection of scientific documents is considered as a sequence of machine learning tasks. The overall data processing technology is described which consists of six stages: preprocessing, metainformation extraction, bibliography lists extraction, splitting bibliography lists into separate bibliography records, standardization of each bibliography record, and record linkage. The goal of this paper is to provide a survey of approaches and algorithms suitable for each stage, motivate the choice of the best combination of algorithms, and adapt some of them for multilingual bibliographies processing. For some of the tasks new algorithms and heuristics are proposed and evaluated on the mixed English and Russian documents corpora.

Keywords: text mining, machine learning, information extraction, citation graph, bibliography, matching, record linkage, labeling, segmentation, conditional random fields
Citation in English: Polezhaev V.A. Automated citation graph building from a corpora of scientific documents // Computer Research and Modeling, 2012, vol. 4, no. 4, pp. 707-719
Citation in English: Polezhaev V.A. Automated citation graph building from a corpora of scientific documents // Computer Research and Modeling, 2012, vol. 4, no. 4, pp. 707-719
DOI: 10.20537/2076-7633-2012-4-4-707-719

Full-text version of the journal is also available on the web site of the scientific electronic library eLIBRARY.RU

The journal is included in the Russian Science Citation Index

The journal is included in the List of Russian peer-reviewed journals publishing the main research results of PhD and doctoral dissertations.

International Interdisciplinary Conference "Mathematics. Computing. Education"

The journal is included in the RSCI

Indexed in Scopus