Clickbait detection: A literature review of the methods used

Online news portals are currently one of the fastest sources of information used by people. Its impact is due to the credibility of the news produced by actors from the media industry


Introduction
Currently, digital media has replaced print media, with an increase in the number of news portals that provides a variety of information. The growth of online media has a negative impact, such as clickbait, which refers to the use of excessive or sensational headlines that aim to attract traffic and clicks to increase site revenue. It is usually written in misleading and provocative sentences [1]. A title is an essential element in the news, which gives an initial impression and influences the user's perception [2]. Lowenstein proposed a theory which states that clickbait is formed by knowledge gaps created by one's curiosity in certain matters. This gap is capable of affecting one's emotions [3].
In 2016, Potthast et al. conducted an initial study in detecting clickbait titles using machine learning methods. The study used data sets obtained from Twitter, which consisted of 2992 tweets, with 767 identified as clickbait [4]. In addition, Chakraborty et al. analyzed 15,000 clickbait and non-clickbait titles using Stanford CoreNLP [5] to determine several characteristics that distinguish between two categories with the first difference in the sentence structure. Clickbait titles tend to have longer sentence structure consisting of hyperbolic words and slang, with 62% used as titles and containing one to 40 words [6]. Rony et al. used two datasets, headlines dataset provided by Chakraborty et al. [6] and media corpus obtained from the Facebook Graph API. The research aimed to analyze the impact of clickbait, which showed that it gets more user responses [7].
A recent study on the detection of clickbait was conducted by Dong et al. [8]. The study measured the similarity/consistency between titles and news. It used the Bidirectional Gated Recurrent Unit (BiGRU) algorithm as a classification method to determine the correlations between headlines and news content with obtained accuracy values above 85% in two different datasets. Its study in Indonesian articles is still very limited. However, Maulidi & Palandi [9] conducted research using the Naive Bayes classification algorithm (NB). Furthermore, Yavi [10] used the Neural Network (NN) method and obtained an average accuracy of 56%.
There are several studies on clickbait detection, but a review study on its detection is not yet available. A literature review is needed to determine various previous methods and approaches. Therefore, this literature review 1) provides a summary of the earlier methods used, and 2) analyzes several methods and previous research. Furthermore, the structures discussed in this article are research methods, results, discussion, and conclusion.

Methodology
This study referred to the guidelines by Kitchenham & Charters [11], which explained the steps for carrying out a systematic literature review, such as: The identification process resulted in 21 works of literature, which are presented in Table 1.

Result and Discussion
The researcher identified several detection steps from the literature. These include (1) Collection of clickbait and non-clickbait news articles from several social media and sites, with labeling initially carried out manually by several volunteers [17]. (2) Data preprocessing in the form of cleaning. (3) Analysis of feature selection. (4) Its classification, which explains the techniques used in clickbait detection.

Data collection
Data were collected from several online or social media sites such as Facebook, Twitter, and Reddit. While the samples representing the clickbait title were obtained from several websites such as /r/SavedYouAClick, @HuffPoSpoilers (Twitter), and StopClickbait (Facebook), which provided information on clickbait-articles and educated the public on ways to avoid such news. Meanwhile, the non-clickbait news was obtained from Wikinews [6], a trusted site that established a set of specific writing procedures that must be obeyed by each of its contributors. Furthermore, it also verifies articles to ensure their reliability. Online literature studies collected are in English (EN), Thai (TH), and Chinese (ZH). Table 2 explains the online version of the dataset.

Data preprocessing
Preprocessing is a data process used to improve classification performance. Data consists of noise (errors or unexpected data) and is inconsistent. Therefore, this technique is carried out to obtain the best dataset amount, structure, and format suitable for each algorithm [26]. It is carried out by erasing all special characters (such as punctuation), leaving only alphanumeric characters [14]. In addition, news title characters need to be changed into lowercase letters to reduce errors during data interpretation, while eliminating stopwords [9,14].
The results of data analysis by Chakraborty et al. showed that clickbait articles use more stopwords than non-clickbait. Another study carried out in the Thai-language dataset [20] showed that the researchers segmented words using the maximal matching algorithm due to its differing writing format (not using spaces) with Latin. In other studies, based on observations made, clickbait news titles usually contain characters such as "!" and "?" therefore, two datasets are used to determine their impact.
The first dataset consists of several titles containing punctuation, while the second does not contain any [23].

Feature analysis
Feature analysis needs to be carried out before classification to determine the syntactic and semantic structure patterns in the clickbait title. It was also conducted due to the differences in writing structures. Table 3 describes the features used in each clickbait detection studies. In addition, another technique used for feature extraction is word embedding, which is a representation of distributed words. This technique is often used in Natural Language Processing (NLP) [27] to represent words into a vector in order to produce context or information on semantic and syntactic similarities, as well as their relationship with other words. Table 4 shows some of the word embedding models and algorithms used in clickbait detection.
Anand et al. [1] used Word2Vec and Continuous Bag of Words (CBOW) to obtain lexical and semantic features. The combination of BiLSTM, distributed word embedding, and character-level word embedding showed a better performance than BiRNN and BiGRU. Word2Vec consists of CBOW and SkipGram Mikolov et al. [28], with CBOW used to predict words based on the context. Besides, Mikolov et al. also developed Extended SkipGram as a new distributed word embedding model developed from SkipGram Mikolov et al. [29]. This technique maps sentences to vectors and uses softmax as classifiers Rony et al. [7]. Rony et al. stated that this technique measures the distance between the title and the first paragraph (intro). The test results showed that the method works better than ordinary SkipGram.
Another algorithm used in word embedding techniques is GloVe, which is an unsupervised learning algorithm used to obtain the vector representation of a word Pennington et al. [30]. In a study conducted by Pandey et al., GloVe and Word2Vec algorithms were used as a comparative analysis. The results obtained showed that BiLSTM and GloVe have the best level of accuracy [21]. Currently, machine learning is widely used in various case studies, such as clickbait detection, due to its ability to allow a program to learn datasets [31]. One method used by machine learning in detecting clickbait is the classification algorithm. According to Table 5, several studies use machine learning classification techniques to detect clickbait. Initial research on clickbait detection through a machine learning approach was conducted by Potthast et al. In this study, 215 features were grouped into three categories, namely 1) the teaser message, 2) the linked web page, and 3) the meta-information [4]. The study compared three machine learning classification algorithms, namely Logistic Regression, NB, and Random Forest. Logistic regression (LR) is a classification algorithm used to obtain nonlinear curves that match the data using different target variables [32]. NB is one of the simplest methods for supervised learning and data mining [33]. It is an algorithm obtained from the Bayes theorem concept and works well on large volumes of data [34]. While Random forest (RF) is a popular algorithm due to its accuracy even with limited samples with quite a lot of features [35]. The results of comparisons conducted by Potthast et al. showed that RF has the best performance than others [4]. Chakraborty et al. analyzed the dataset linguistic using Stanford CoreNLP [5] to obtain the features used in the classification process using SVM, DT, and RF classification algorithms. The study utilized sentence structure, word patterns, language characteristics, and n-grams as the features. SVM is one of the supervised learning algorithms used for classification or regression [36] and to determine the maximum marginal hyperplane [37]. While DT uses tree concept to deduce classification rules based on practical examples [38]. Each node in the tree represents a parameter, while its branch represents a possible value of the connected top node [38]. Chakraborty et al. found different sentence structures in clickbait and non-clickbait titles. Clickbait titles tend to be wordy, contain stopwords and slangs, and hyperbolic with 40 frequent words used as the classification features [6]. Furthermore, Biyani et al. defined several types of clickbait titles using GBD [39]. The clickbait title is usually written in a hyperbole, ambiguous, or unclear statement that affects the emotions of the reader (increase the user's curiosity) [12].

Clickbait classification
The study of clickbait detection in Indonesian-language articles used NB as a classification method [10]. It measured its effect based on the number of "shares" and "likes" in a news article on Facebook, while other similar research uses Neural Network (NN) with Backpropagation algorithm [9]. Neural Network is a technology that is able to study the input of previous and new data obtained. The process of learning data and NN prediction is carried out through a network of neurons that are connected and arranged into a layered structure [40]. Conversely, Backpropagation is a hierarchical design consisting of layers or rows of processing units that are fully interconnected [41].  Table 6 shows some methods of deep learning (DL) approach, which is made and functions based on Artificial Neural Networks (ANN). The science of human cognition is used to understand DL [42], while [15] used the Convolutional Neural Network (CNN) as a classification method. CNN functions automatically to provide a vector that is used as a representation of news headlines. It calculates the phrase vector for every single phrase in a sentence [43]. In addition, its purpose in clickbait detection is to represent news headlines as vectors which have fixed size, continuous, and real values as classifiers [15]. Another method used to detect clickbait is Feed-Forward Neural Network (FNN), an algorithm consisting of several layers of neurons, each of which is determined by a set of synaptic weights that are appropriate [44]. Recurrent Neural Network (RNN) is a class of ANN and a system that functions in decision making by considering the things and information that already exists. It processes data sequentially, and not suitable for long processing as it is hampered by the learning process. Therefore, the LSTM model is developed to overcome this problem [45].
Manjesh et al. conducted a study using LSTM by analyzing sentiment to determine the classification of a sentence by indicating whether it is positive or negative. The analysis found that nonclickbait titles naturally tend to use more neutral language [17]. This study also analyzed the average number of words used in the clickbait title, which showed that non-clickbait articles tend to contain shorter phrases due to its ability to represent the content of news directly [17].
Manjesh et al. [17] compared several machine learning algorithms using deep learning methods (LSTM and MLP). This study showed that the deep learning method works better than the machine algorithm [17]. Figure 1 show the mind mapping of machine learning of several methods found based on similarity.

Differences in each approach
The explanation above shows that there are two approaches in the classification method (machine and deep learning algorithm). Some of the steps of the approaches and current research are as follow: a. Machine Learning Here are some steps for the GNB, BNB, and MNB algorithms [17]: 1) Data collection/corpus This stage consists of data collection and classification methods obtained from several news sites and social media. The initial classification was conducted manually by several volunteers through voting.

2) Analysis
Feature analysis is performed to determine semantic and syntactic differences in sentence structure.

3) Classification and testing
At this stage, the research datasets were divided into training, validation, and testing with a ratio of 70:20:10 [17]. The phrases that often appear in clickbait and non-clickbait articles are calculated with N-grams for each title created. Data training is carried out until it gets maximum and constant accuracy values, then dataset testing is obtained. b. Deep Learning and Neural Network Anand et al. [1] and Klairith & Tanachutiwat [20] used a deep learning approach to process information hierarchically and automatically while capturing new levels of data abstraction effectively. Although it is capable of handling large data dimensions, a model also requires more examples to carry out maximum exploration [46] through the following steps: 1) Data collection/corpus Data is collected from several sites or social media. The initial classification is conducted manually using a machine learning method.
2) Embedding layer It functions as input for the hidden layer and transforms each word into embedded features.

3) Hidden layer and Model
It is a layer that is between the input and output layers that functions to process and study the input data from the embedded layer.

4) Output Layer
At this layer, the model is able to classify clickbait and non-clickbait titles.

5) Testing
Tests were carried out to determine the performance of the model. c. State of The Art Dong et al. [8] analyzed the similarities between the title and content of a story. This study used a deep learning approach that is carried out through the following stages: 1) Analysis of latent representation At this stage, researchers conducted data preprocessing and transformed it into vectors (word embedding) with BiGRU used to determine hidden representations of the hidden layer.

2) Analysis of similarities
Researchers calculated the suitability of the title and content using the cosine similarity algorithm.

3) Prediction
The results obtained in stages 1 and 2 were combined using a fully connected layer to map hidden representations as input to the Multilayer Perceptron.

4) Testing
At this stage, testing was performed on two different databases and obtained accuracy results above 85%.

Conclusions
Various researches have recently been conducted on clickbait detection. However, there is limited research on the detection of clickbait in Indonesian-language articles due to the lack of availability of online datasets because majorities are in English, Thai, and Chinese. The 21 literature studies showed that clickbait is detected using a machine learning classification algorithm with deep learning and neural network approaches. Therefore, machine learning-based research often uses the Random Forest algorithm, while deep and neural network makes use of CNN, BiLSTM, BiGRU, and Multilayer Perceptron.