Effect of information gain on document classification using k-nearest neighbor
DOI:
https://doi.org/10.26594/register.v8i1.2397Keywords:
classification, feature selection, information gain, k-Nearest Neighbor, TF-IDF documentAbstract
State universities have a library as a facility to support students’ education and science, which contains various books, journals, and final assignments. An intelligent system for classifying documents is needed to ease library visitors in higher education as a form of service to students. The documents that are in the library are generally the result of research. Various complaints related to the imbalance of data texts and categories based on irrelevant document titles and words that have the ambiguity of meaning when searching for documents are the main reasons for the need for a classification system. This research uses k-Nearest Neighbor (k-NN) to categorize documents based on study interests with information gain features selection to handle unbalanced data and cosine similarity to measure the distance between test and training data. Based on the results of tests conducted with 276 training data, the highest results using the information gain selection feature using 80% training data and 20% test data produce an accuracy of 87.5% with a parameter value of k=5. The highest accuracy results of 92.9% are achieved without information gain feature selection, with the proportion of training data of 90% and 10% test data and parameters k=5, 7, and 9. This paper concludes that without information gain feature selection, the system has better accuracy than using the feature selection because every word in the document title is considered to have an essential role in forming the classification.References
[1] M. B. Line, "The Functions Of The University Library," in University and Research Library Studies, W. L. Saunders, Ed., Pergamon, The University of Sheffield, 1968, pp. 148-158.
[2] M. Azam, T. Ahmed, F. Sabah and M. Hussain, "Feature Extraction based Text Classification using K-Nearest Neighbor Algorithm," IJCSNS Int. J. Comput. Sci. Netw. Secur., vol. 18, p. 95–101, 2018.
[3] B. Azhagusundari and A. S. Thanamani, "Feature Selection based on Information Gain," International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 2, no. 2, pp. 18-21, 2013.
[4] F. Liantoni, R. I. Perwira, S. Muharom, R. A. Firmansyah and A. Fahruzi, "Leaf classification with improved image feature based on the seven moment invariant," IOP Conf. Series: Journal of Physics: Conf. Series., vol. 1175, 2019.
[5] F. Fanny, Y. Muliono and F. Tanzil, "A Comparison of Text Classification Methods k-NN, Naïve Bayes, and Support Vector Machine for News Classification," Jurnal Informatika: Jurnal Pengembangan IT, vol. 3, no. 2, pp. 157-160, 2018.
[6] R. Jodha, S. B. C. Gaur, K. R. Chowdhary and A. Mishra, "Text Classification using KNN with different Features Selection Methods," International Journal of Research Publications, vol. 8, no. 1, 2018.
[7] A. Moldagulova and R. B. Sulaiman, "Using KNN Algorithm for Classification of Textual Documents," in 8th International Conference on Information Technology (ICIT), 2017.
[8] R. Andrian, D. Maharani, M. A. Muhammad and A. Junaidi, "Butterfly identification using gray level co-occurrence matrix (glcm) extraction feature and k-nearest neighbor (knn) classification," Register: Jurnal Ilmiah Teknologi Sistem Informasi, vol. 6, no. 1, pp. 11-21, 2020.
[9] H. C. Rustamaji, O. S. Simanjuntak, S. F. Luhrie, B. Yuwono and J. Juwairiah, "Categorical Data Classification based on Fuzzy K-Nearest Neighbor Approach," in 5th International Conference on Science in Information Technology (ICSITech), 2019.
[10] V. Kalra and R. Aggarwal, "Importance of Text Data Preprocessing & Implementation in RapidMiner," in The First International Conference on Information Technology and Knowledge Management, 2018.
[11] L. A. Mullen, K. Benoit, O. Keyes, D. Selivanov and J. Arnold, "Fast, Consistent Tokenization of Natural Language Text," Journal of Open Source Software, vol. 3, no. 23, p. 655, 2018.
[12] N. Chandra, S. K. Khatri and S. Som, "Anti social comment classification based on kNN algorithm," in 6th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), 2017.
[13] K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes and D. Brown, "Text Classification Algorithms: A Survey," Information, vol. 10, p. 150, 2019.
[14] B. Trstenjak, S. Mikac and D. Donko, "KNN with TF-IDF based Framework for Text Categorization," Procedia Engineering, vol. 69, pp. 1356-1364, 2014.
[15] Y. Doen, M. Murata, R. Otake, M. Tokuhisa and Q. Ma, "Construction of concept network from large numbers of texts for information examination using TF-IDF and deletion of unrelated words," in 2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS), Kitakyushu, Japan, 2014.
[16] W. Zhang, T. Yoshida and X. Tang, "A comparative study of TF*IDF, LSI and multi-words for text classification," Expert Systems with Applications, vol. 38, no. 3, pp. 2758-2765, 2011.
[17] R. Andrian, M. A. Naufal, B. Hermanto, A. Junaidi and F. R. Lumbanraja, "k-Nearest Neighbor (k-NN) Classification for Recognition of the Batik Lampung Motifs," IOP Conf. Series: Journal of Physics: Conf. Series, vol. 1338, 2019.
[18] R. T. Wahyuni, D. Prastiyanto and E. Supraptono, "Penerapan Algoritma Cosine Similarity dan Pembobotan TF-IDF pada Sistem Klasifikasi Dokumen Skripsi," Jurnal Teknik Elektro, vol. 9, no. 1, pp. 18-23, 2017.
[19] M. Ali, D.-H. Son, S.-H. Kang and S.-R. Nam, "An Accurate CT Saturation Classification Using a Deep Learning Approach Based on Unsupervised Feature Extraction and Supervised Fine-Tuning Strategy," Energies, vol. 10, no. 11, p. 1830, 2017.
[20] T. M. Mohamed, "Pulsar selection using fuzzy knn classifier," Future Computing and Informatics Journal, vol. 3, no. 1, 2018.
[21] C.-z. Liu, Y.-x. Sheng, Z.-q. Wei and Y.-Q. Yang, "Research of Text Classification Based on Improved TF-IDF Algorithm," in International Conference of Intelligent Robotic and Control Engineering (IRCE), 2018.
[22] F. S. Al-Anzi and D. AbuZeina, "Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing," Journal of King Saud University - Computer and Information Sciences, vol. 29, no. 2, pp. 189-195, 2017.
Downloads
Published
How to Cite
Issue
Section
License
Please find the rights and licenses in Register: Jurnal Ilmiah Teknologi Sistem Informasi. By submitting the article/manuscript of the article, the author(s) agree with this policy. No specific document sign-off is required.
1. License
The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
2. Author(s)' Warranties
The author warrants that the article is original, written by stated author(s), has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author(s).
3. User/Public Rights
Register's spirit is to disseminate articles published are as free as possible. Under the Creative Commons license, Register permits users to copy, distribute, display, and perform the work for non-commercial purposes only. Users will also need to attribute authors and Register on distributing works in the journal and other media of publications. Unless otherwise stated, the authors are public entities as soon as their articles got published.
4. Rights of Authors
Authors retain all their rights to the published works, such as (but not limited to) the following rights;
Copyright and other proprietary rights relating to the article, such as patent rights,
The right to use the substance of the article in own future works, including lectures and books,
The right to reproduce the article for own purposes,
The right to self-archive the article (please read out deposit policy),
The right to enter into separate, additional contractual arrangements for the non-exclusive distribution of the article's published version (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal (Register: Jurnal Ilmiah Teknologi Sistem Informasi).
5. Co-Authorship
If the article was jointly prepared by more than one author, any authors submitting the manuscript warrants that he/she has been authorized by all co-authors to be agreed on this copyright and license notice (agreement) on their behalf, and agrees to inform his/her co-authors of the terms of this policy. Register will not be held liable for anything that may arise due to the author(s) internal dispute. Register will only communicate with the corresponding author.
6. Royalties
Being an open accessed journal and disseminating articles for free under the Creative Commons license term mentioned, author(s) aware that Register entitles the author(s) to no royalties or other fees.
7. Miscellaneous
Register will publish the article (or have it published) in the journal if the article’s editorial process is successfully completed. Register's editors may modify the article to a style of punctuation, spelling, capitalization, referencing and usage that deems appropriate. The author acknowledges that the article may be published so that it will be publicly accessible and such access will be free of charge for the readers as mentioned in point 3.