Peringkasan multi-dokumen berita berdasarkan fitur berita dan part of speech tagging
DOI:
https://doi.org/10.26594/register.v4i2.1251Keywords:
grammatical information, multi-document summarization, news document, news feature, part of speech tagging, dokumen berita, fitur berita, informasi gramatikal, peringkasan multi-dokumenAbstract
News Feature Scoring (NeFS) merupakan metode pembobotan kalimat yang sering digunakan untuk melakukan pembobotan kalimat pada peringkasan dokumen berdasarkan fitur berita. Beberapa fitur berita diantaranya seperti word frequency, sentence position, Term Frequency-Inverse Document Frequency (TF-IDF), dan kemiripan kalimat terhadap judul. Metode NeFS mampu memilih kalimat penting dengan menghitung frekuensi kata dan mengukur similaritas kata antara kalimat dengan judul. Akan tetapi pembobotan dengan metode NeFS tidak cukup, karena metode tersebut mengabaikan kata informatif yang terkandung dalam kalimat. Kata-kata informatif yang terkandung pada kalimat dapat mengindikasikan bahwa kalimat tersebut penting. Penelitian ini bertujuan untuk melakukan pembobotan kalimat pada peringkasan multi-dokumen berita dengan pendekatan fitur berita dan informasi gramatikal (NeFGIS). Informasi gramatikal yang dibawa oleh part of speech tagging (POS Tagging) dapat menunjukkan adanya konten informatif. Pembobotan kalimat dengan pendekatan fitur berita dan informasi gramatikal diharapkan mampu memilih kalimat representatif secara lebih baik dan mampu meningkatkan kualitas hasil ringkasan. Pada penelitian ini terdapat 4 tahapan yang dilakukan antara lain seleksi berita, text preprocessing, sentence scoring, dan penyusunan ringkasan. Untuk mengukur hasil ringkasan menggunakan metode evaluasi Recall-Oriented Understudy for Gisting Evaluation (ROUGE) dengan empat varian fungsi yaitu ROUGE-1, ROUGE-2, ROUGE-L, dan ROUGE-SU4. Hasil ringkasan menggunakan metode yang diusulkan (NeFGIS) dibandingkan dengan hasil ringkasan menggunakan metode pembobotan dengan pendekatan fitur berita dan trending issue (NeFTIS). Metode NeFGIS memberikan hasil yang lebih baik dengan peningkatan nilai untuk fungsi recall pada ROUGE-1, ROUGE-2, ROUGE-L, dan ROUGE-SU4 secara berturut-turut adalah 20,37%, 33,33%, 1,85%, 23,14%.
News Feature Scoring (NeFS) is a sentence weighting method that used to weight the sentences in document summarization based on news features. There are several news features including word frequency, sentence position, Term Frequency-Inverse Document Frequency (TF-IDF), and sentences resemblance to the title. The NeFS method is able to select important sentences by calculating the frequency of words and measuring the similarity of words between sentences and titles. However, NeFS weighting method is not enough, because the method ignores the informative word in the sentence. The informative words contained in the sentence can indicate that the sentence is important. This study aims to weight the sentence in news multi-document summarization with news feature and grammatical information approach (NeFGIS). Grammatical information carried by part of speech tagging (POS Tagging) can indicate the presence of informative content. Sentence weighting with news features and grammatical information approach is expected to be able to determine sentence representatives better and be able to improve the quality of the summary results. In this study, there are 4 stages that are carried out including news selection, text preprocessing, sentence scoring, and compilation of summaries. Recall-Oriented Understanding for Gisting Evaluation (ROUGE) is used to measure the summary results with four variants of function; ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU4. Summary results using the proposed method (NeFGIS) are compared with summary results using sentence weighting methods with news feature and trending issue approach (NeFTIS). The NeFGIS method provides better results with increased value for recall functions in ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU4 respectively 20.37%, 33.33%, 1.85%, 23.14%.
References
Aditya, C. S., Fatichah, C., & Purwitasari, D. (2016). Ekstraksi trending issue dengan pendekatan distribusi kata pada pembobotan term untuk peringkasan multi-dokumen berita. JUTI (Jurnal Ilmiah Teknologi Informasi), 14(2), 180-189.
Arifin, A. Z., Abdullah, M. Z., Rosyadi, A. W., Ulumi, D. I., Wahib, A., & Sholikah, R. W. (2018). Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging for Multi-document Summarization. TELKOMNIKA, 16(2), 843-851.
Fachrurrozi, M., Yusliani, N., & Yoanita, R. U. (2013). Frequent Term based Text Summarization for Bahasa Indonesia. International Conference on Innovations in Engineering and Technology (ICIET2013). Bangkok: IIENG.
Ferreira, R., Cabral, L. d., Lins, R. D., Silva, G. P., Freitas, F., Cavalcanti, G. D., . . . Favaro, L. (2013). Assessing sentence scoring techniques for extractive text summarization. Expert Systems with Applications, 40(14), 5755-5764.
Hayatin, N., Fatichah, C., & Purwitasari, D. (2014). Penentuan trending issue data twitter menggunakan cluster importance untuk peringkasan multi dokumen berita. Teknik Informatika. Surabaya: Institut Teknologi Sepuluh Nopember.
He, T., Li, F., Shao, W., Chen, J., & Ma, L. (2008). A new feature-fusion sentence selecting strategy for query-focused multi-document summarization. Proceeding of International Conference Advance Language Processing and Web Information Technology (hal. 81-86). Dalian Liaoning: IEEE.
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop (hal. 74–81). Barcelona: Association for Computational Linguistics.
Lioma, C., & Blanco, R. (2017). Part of Speech Based Term Weighting for Information Retrieval.
Mandar, G., & Gunawan, G. (2017). Peringkasan dokumen berita Bahasa Indonesia menggunakan metode Cross Latent Semantic Analysis. Register: Jurnal Ilmiah Teknologi Sistem Informasi, 3(2), 94-104.
Meena, Y. K., & Gopalani, D. (2014). Analysis of Sentence Scoring Methods for Extractive Automatic Text Summarization. Proceedings of the 2014 International Conference on Information and Communication Technology for Competitive Strategies. Udaipur: ACM.
Meena, Y. K., & Gopalani, D. (2015). Evolutionary algorithms for extractive automatic text summarization. Procedia Computer Science, 48, 244-249.
Mei, J. -P., & Chen, L. (2012). SumCR: A new subtopic-based extractive approach for text summarization. Knowledge and information systems, 31(3), 527-545.
Pisceldo, F., Adriani, M., & Manurung, R. (2009). Probabilistic Part Of Speech Tagging for Bahasa Indonesia. Third International MALINDO Workshop. Cyberjaya.
Radev, D. R., Hovy, E., & McKeown, K. (2002). Introduction to the special issue on summarization. Computational linguistics, 28(4), 399-408.
Rashel, F., Luthfi, A., Dinakaramani, A., & Manurung, R. (2014). Building an Indonesian rule-based part-of-speech tagger. Asian Language Processing (IALP) (hal. 70-73). Kuching: IEEE.
Verdianto, S., Arifin, A. Z., & Purwitasari, D. (2016). Strategi pemilihan kalimat pada peringkasan multi dokumen. NJCA: Nusantara Journal of Computer Applications, 1(2).
Wahib, A., Arifin, A. Z., & Purwitasari, D. (2016). Improving Multi-Document Summary Method Based on Sentence Distribution. TELKOMNIKA, 14(1), 286-293.
Wan, X., Yang, J., & Xiao, J. (2007). Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (hal. 552–559). Prague: Association for Computational Linguistics.
Downloads
Published
How to Cite
Issue
Section
License
Please find the rights and licenses in Register: Jurnal Ilmiah Teknologi Sistem Informasi. By submitting the article/manuscript of the article, the author(s) agree with this policy. No specific document sign-off is required.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
2. Author(s)' Warranties
The author warrants that the article is original, written by stated author(s), has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author(s).
3. User/Public Rights
Register's spirit is to disseminate articles published are as free as possible. Under the Creative Commons license, Register permits users to copy, distribute, display, and perform the work for non-commercial purposes only. Users will also need to attribute authors and Register on distributing works in the journal and other media of publications. Unless otherwise stated, the authors are public entities as soon as their articles got published.
4. Rights of Authors
Authors retain all their rights to the published works, such as (but not limited to) the following rights;
Copyright and other proprietary rights relating to the article, such as patent rights,
The right to use the substance of the article in own future works, including lectures and books,
The right to reproduce the article for own purposes,
The right to self-archive the article (please read out deposit policy),
The right to enter into separate, additional contractual arrangements for the non-exclusive distribution of the article's published version (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal (Register: Jurnal Ilmiah Teknologi Sistem Informasi).
5. Co-Authorship
If the article was jointly prepared by more than one author, any authors submitting the manuscript warrants that he/she has been authorized by all co-authors to be agreed on this copyright and license notice (agreement) on their behalf, and agrees to inform his/her co-authors of the terms of this policy. Register will not be held liable for anything that may arise due to the author(s) internal dispute. Register will only communicate with the corresponding author.
6. Royalties
Being an open accessed journal and disseminating articles for free under the Creative Commons license term mentioned, author(s) aware that Register entitles the author(s) to no royalties or other fees.
7. Miscellaneous
Register will publish the article (or have it published) in the journal if the article’s editorial process is successfully completed. Register's editors may modify the article to a style of punctuation, spelling, capitalization, referencing and usage that deems appropriate. The author acknowledges that the article may be published so that it will be publicly accessible and such access will be free of charge for the readers as mentioned in point 3.