Volume 3, Issue 3, September 2019, Page: 62-71
MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec
Yonghe Lu, School of Information Management, Sun Yat-sen University, Guangzhou, China
Yuanyuan Zhai, School of Information Management, Sun Yat-sen University, Guangzhou, China
Jiayi Luo, School of Information Management, Sun Yat-sen University, Guangzhou, China
Yongshan Chen, School of Information Management, Sun Yat-sen University, Guangzhou, China
Received: Jul. 15, 2019;       Accepted: Aug. 12, 2019;       Published: Aug. 28, 2019
DOI: 10.11648/j.ajist.20190303.12      View  45      Downloads  10
Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.
MLPV Model, Scientific Papers, Text Representation, Doc2vec, Structural Features
To cite this article
Yonghe Lu, Yuanyuan Zhai, Jiayi Luo, Yongshan Chen, MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec, American Journal of Information Science and Technology. Vol. 3, No. 3, 2019, pp. 62-71. doi: 10.11648/j.ajist.20190303.12
Copyright © 2019 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Yoon S H, Kim J S, Kim S W and Lee C. TL-Rank: A Blend of Text and Link Information for Measuring Similarity in Scientific Literature Databases [J]. IEICE TRANSACTIONS on Information and Systems, 2012, 95 (10): 2556-2559.
Hamedani M R, Kim S W, Kim D J. SimCC: A novel method to consider both content and citations for computing similarity of scientific papers [J]. Information Sciences, 2016, 334: 273-292.
Cao M, Sun X, Zhuge H. The contribution of cause-effect link to representing the core of scientific paper—The role of Semantic Link Network [J]. PloS one, 2018, 13 (6): e0199303.
Liu M, Lang B, Gu Z and Zeeshan A. Measuring similarity of academic articles with semantic profile and joint word embedding [J]. Tsinghua Science and Technology, 2017, 22 (6): 619-632.
Mahdi A E, Joorabchi A. A citation-based approach to automatic topical indexing of scientific literature [J]. Journal of Information Science, 2010, 36 (6): 798-811.
Xu G, Wang H F. Development of topic models in natural language processing. [J]. Chinese J Comput, 2011 (8): 1423-1436. M. Young, The Technical Writer's Handbook. Mill Valley, CA: University Science, 198.
Deerwester S, Dumais S T, Furnas G W, Landauer T K, Harshman R. Indexing by latent semantic analysis [J]. Journal of the Association for Information Science & Technology, 1990, 41 (6): 391-407.
Hofmann T. Probabilistic latent semantic indexing [C]// International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 50-57.
Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation [J]. J Machine Learning Research Archive, 2003, 3: 993-1022.
Luo L, Li L. Defining and evaluating classification algorithm for high-dimensional data based on latent topics [J]. PloS one, 2014, 9 (1): e82119.
Hinton G E. Learning distributed representations of concepts. [C]// Eighth Conference of the Cognitive Science Society. 1986.
Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model [J]. Journal of machine learning research, 2003, 3 (Feb): 1137-1155.
Mikolov T, Le Q V, Sutskever I. Exploiting Similarities among Languages for Machine Translation [J/OL]. arXiv preprint arXiv, 2013: 1309 [2013-9-17]. https://arxiv.org/abs/1309.4168.
Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space [J/OL]. arXiv preprint arXiv, 2013: 1301 [2013-9-7]. https://arxiv.org/abs/1301.3781.
Zhang W T. Research and Application of Synonym Expansion Based on Feature Space Optimization of Word Vector Model [D]. Beijing University of Posts and Telecommunications, 2014.
Zhu X M. Weibo recommendation based on Word2Vec topic extraction [D]. Beijing Institute of Technology, 2014.
Tang M, Zhu L, Zou X C. A Document Vector Representation Based on Word2Vec [J]. Computer Science, 2016, 43 (6): 214-217.
Wang Y, Liu Z, Sun M. Incorporating linguistic knowledge for learning distributed word representations [J]. PloS one, 2015, 10 (4): e0118437.
Alsuhaibani M, Bollegala D, Maehara T, Kawarabayashi K. Jointly learning word embeddings using a corpus and a knowledge base [J]. PloS one, 2018, 13 (3): e0193094.
Li Y, Wei B, Liu Y, et al. Incorporating knowledge into neural network for text representation [J]. Expert Systems with Applications, 2018, 96: 103-114.
Le Q V, Mikolov T. Distributed Representations of Sentences and Documents [J]. 2014, 4: II-1188.
Dai A M, Olah C, Le Q V. Document Embedding with Paragraph Vectors [J/OL]. arXiv preprint arXiv, 2015: 1507 [2015-7-29]. https://arxiv.org/abs/1507.07998.
Fisher G, Israni M, Robert Z. Exploring Optimizations to Paragraph Vectors [J]. https://web.stanford.edu/class/cs224n/reports/2760664.pdf
Grzegorczyk K, Kurdziel M. Binary Paragraph Vectors [J/OL]. arXiv preprint arXiv, 2017: 1611 [2017-6-9]. https://arxiv.org/abs/1611.01116.
Palangi H, Deng L, Shen Y, et al. Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval [J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2015, 24 (4): 694-707.
Kennedy J, Eberhart R. Particle Swarm Optimization. In: Proc IEEE International Conference on Neural Networks. Perth, Australia, 1995: 1942-1948.
Browse journals by subject