استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec

نوع مقاله : مقاله پژوهشی

نویسندگان

1 دانشجوی دکتری هوش‌مصنوعی، دانشگاه صنعتی امیرکبیر

2 استادیار دانشگاه صنعتی امیرکبیر

چکیده

با رشد روز افزون اسناد و متون الکترونیکی به زبان فارسی، به کارگیری روش­هایی سریع و ارزان برای دسترسی بـه متـون مورد نظر از میان مجموعه وسیع این مستندات، اهمیت بیشتری می­یابد. برای رسیدن به این هدف، استخراج کلمات کلیدی که بیانگر مضمون اصلی متن باشند، روشی بسیار مؤثر است. تعداد تکرار یک کلمه در متن نمی­تواند نشان­دهنده­ اهمیت یک کلمه و کلیدی بودن آن باشد. همچنین در اکثر روش­های استخراج کلمات کلیدی مفهوم و معنای متن نادیده گرفته می­شوند. از طرفی دیگر بدون ساختار بودن متون جدید در اخبار و اسناد الکترونیکی، استخراج این کلمات را مشکل می­سازد. در این مقاله روشی بدون نظارت و خودکار برای استخراج این کلمات در زبان فارسی که دارای ساختار مناسبی نمی­باشد، پیشنهاد شده است که نه تنها احتمال رخ دادن کلمه در متن و تعداد تکرار آن را در نظر می­گیرد، بلکه با آموزش مدل word2vec روی متن، مفهوم و معنای متن را نیز درک می­کند. در روش پیشنهادی که روشی ترکیبی از دو مدل آماری و یادگیری ماشین می­باشد، پس از آموزش word2vec روی متن، کلماتی که با سایر کلمات دارای فاصله­ کمی بوده استخراج شده و سپس با استفاده از هم­رخدادی و فرکانس رابطه­ای آماری برای محاسبه امتیاز پیشنهاد شده است. درنهایت با استفاده از حدآستانه کلمات با امتیاز بالاتر به‌عنوان کلمه کلیدی در نظر گرفته می­شوند. ارزیابی­­ها بیانگر کارایی روش با معیار F برابر 53.92% و با 11% افزایش نسبت به دیگر روش‌های استخراج کلمات کلیدی می­باشد.

کلیدواژه‌ها


عنوان مقاله [English]

Automatic Keyword Extraction from Persian short Text Using word2vec

نویسندگان [English]

  • O. Hajipoor 1
  • S. S. Sadidpour 2
1 Malek-Ashtar University of Technology, Tehran, Iran
2 Malek-Ashtar University of Technology, Tehran, Iran
چکیده [English]

With the growing number of Persian electronic documents and texts, the use of quick and inexpensive   methods to access desired texts from the extensive collection of these documents becomes more important. One of the effective techniques to achieve this goal is the extraction of the keywords which represent the main concept of the text. For this purpose, the frequency of a word in the text can not be a proper indication of its significance and its crucial role. Also, most of the keyword extraction methods ignore the concept and semantic of the text. On the other hand, the unstructured nature of new texts in news and electronic        documents makes it difficult to extract these words. In this paper, an automated, unsupervised method for keywords extraction in the Persian language that does not have a proper structure is proposed. This method not only takes into account the probability of occurrence of a word and its frequency in the text, but it also understands the concept and semantic of the text by learning word2vec model on the text. In the proposed method, which is a combination of statistical and machine learning methods, after learning word2vec on the text, the words that have the smallest distance with other words are extracted. Then, a statistical equation is proposed to calculate the score of each extracted word using co-occurence and frequency. Finally, words which have the highest scores are selected as the keywords. The evaluations indicate that the efficiency of the method by the F-measure is 53.92% which is 11% superior to other methods.
 

کلیدواژه‌ها [English]

  • Keyword Extraction
  • Persian Language
  • Text Mining
  • Word Similarity
  • Word2vec
[1] Z. Wu, et al., “An Efficient Wikipedia Semantic Matching Approach to Text Document Classification,” Information Sciences, vol. 393,  pp. 15-28, 2017.##
[2] C. Jia, et al., “Concept Decompositions for Short Text Clustering by Identifying Word Communities,” Pattern Recognition, vol. 76, pp. 691-703, 2018.##
[3] S. K. Bharti and K. S. Babu, “Automatic Keyword Extraction for Text Summarization: a Survey,” arXiv preprint arXiv:1704.03242, 2017.##
[4] M. Yousefi-Azar and L. Hamey, “Text Summarization Using Unsupervised Deep Learning,” Expert Systems with Applications, vol. 68, pp. 93-105, 2017.##
[5] Z. Sepehrian and H. Shirazi, “A New Way To Summarize Persian Texts Based on User Query Expression,” Electronic and Cyber Defense, 2018.##
[6] S. K. Biswas, M. Bordoloi, and J. Shreya, “A Graph Based Keyword Extraction Model Using Collective Node Weight,” Expert Systems with Applications, vol. 97, pp. 51-59, 2018.##
[7] R. Harakawa, T. Ogawa, and M. Haseyama, “Extraction of Hierarchical Structure of Web Communities Including Salient Keyword Estimation for Web Video Retrieval,” in Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015.##
[8] G. Zipf, “Human Behaviour and The Principle of              Least-Effort,” Cambridge MA edn. Reading: Addison-Wesley, 1949.##
[9] B. Das, et al., “Automatic Keyword Extraction From any Text Document Using N-Gram Rigid Collocation,” Int. J. Soft Comput. Eng.(IJSCE), vol. 3(2), pp. 238-242, 2013.##
[10] J. Li and K. Zhang, “Keyword Extraction Based on Tf/Idf for Chinese News Document,” Wuhan University Journal of Natural Sciences, vol. 12(5), pp. 917-921, 2007.##
[11] Y. Matsuo and M. Ishizuka, “Keyword Extraction From a Single Document Using Word Co-Occurrence Statistical Information,” International Journal on Artificial Intelligence Tools, vol. 13(01), pp.  157-169, 2000.##
[12] S. Rose, et al., “Automatic Keyword Extraction From Individual Documents,” Text Mining: Applications and Theory, pp. 1-20, 2010.##
[13] C. Zhang, “Automatic Keyword Extraction From Documents Using Conditional Random Fields,” Journal of Computational Information Systems, vol. 4(3), pp. 1169-1180, 2008.##
[14] E. Frank, et al., “Domain-Specific Keyphrase Extraction,” In 16th International joint conference on artificial intelligence (IJCAI 99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 1999.##
[15] K. Zhang, et al., “Keyword Extraction Using Support Vector Machine,” In International Conference on Web-Age Information Management, Springer, 2006.##
[16] Y. HaCohen-Kerner, Z. Gross, and A. Masa, “Automatic Extraction and Learning of Keyphrases From Scientific Articles,” In International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2005.##
[17] R. Mihalcea and P. Tarau, “Bringing Order Into Text,” Textrank in Proceedings of the 2004 conference on empirical methods in natural language processing, 2004.##
[18] S. Brin and L. Page, “The Anatomy Of A Large-Scale Hypertextual Web Search Engine,” Computer networks and ISDN systems, vol. 30(1-7), pp.        107-117, 1998.##
[19] A. Bougouin, F. Boudin, and B. Daille, “Topicrank:      Graph-Based Topic Ranking For Keyphrase Extraction,” In International Joint Conference on Natural Language Processing (IJCNLP), 2013.##
[20] A. Tixier, F. Malliaros, and M. Vazirgiannis, “A Graph Degeneracy-Based Approach to Keyword Extraction,” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.##
[21] J. Li, et al., “Key Word Extraction for Short Text Via Word2vec, Doc2vec, and Textrank,” Turkish Journal of Electrical Engineering & Computer Sciences, vol. 27(3), pp. 1794-1805, 2019.##
[22] J. R. Thomas, S. K. Bharti, and K. S. Babu, “Automatic Keyword Extraction for Text Summarization in E-Newspapers,” In Proceedings of the International Conference on Informatics and Analytics, ACM, 2016.##
 [23] X. Wan and J. Xiao, “Single Document Keyphrase Extraction Using Neighborhood Knowledge,” In AAAI, 2008.##
[24] R. Naidu, et al., “Text Summarization with Automatic Keyword Extraction in Telugu E-Newspapers,” In Smart Computing and Informatics, Springer, pp.    555-564, 2018.##
[25] T. Mikolov, et al., “Distributed Representations of Words and Phrases and Their Compositionality,” In Advances in neural information processing systems, 2013.##
[26] W. Zhang, T. Yoshida, and X. Tang, “A Comparative Study of TF* IDF, LSI and Multi-Words for Text Classification,” Expert Systems with Applications, vol. 38(3), pp. 2758-2765, 2011.##
[27] J. A. Lossio-Ventura, et al., “Yet Another Ranking Function For Automatic Multiword Term Extraction,” In International Conference on Natural Language Processing, Springer, 2014.##
[28] R. Campos, et al., “Yake! Collection-Independent Automatic Keyword Extractor,” In European Conference on Information Retrieval, Springer, 2018.##
[29] M. Saraswathi and V. Balu, “Preprocessing Techniques for Effective Data Extraction and Computation,” IUP Journal of Computer Sciences, vol. 7(3), p. 27, 2013.##
[30] O. Hajipoor, et al., “Determine the Sentiment for Persian Words and Phrases Using Deep Learning,” Computer Society of Iran Conference, vol. 24, 2019.##