Sentiment Analysis of Public Transportation Services on Twitter Social Media Using the Method Naïve Bayes Classifier

Public transportation services in Indonesia, especially Jabodetabek, have used social media, especially Twitter, as a way to improve services. Currently the use of online transportation services is like a need, it is necessary to conduct a sentiment analysis of online transportation to find out how people respond to these online transportation services. This research was made to analyze community responses with data analysis in the form of tweets that filtered system built, the total sentiment results for the percentage of the occurrence of positive words were 0.507843137 and the sentiment results for the percentage of negative word occurrences were 1.4132493. The results show that the level of negative sentiment from public tweets is greater than the level of positive sentiment.


Introduction
Service (customer service) in general is any activity aimed at providing satisfaction to customers, through service the desires and needs of customers can be fulfilled [1]. For an organization, especially a service organization, customer service is very useful to bind old customers and add new customers. This service can be in the form of complaint management or as a medium of information that the organization can provide to customers. The large and growing use of social media in Indonesia makes organizations use it as a media service. Organizations can use social media to share information and receive complaints from customers. By taking advantage of the advantages of social media, organizations can be more active in managing customer complaints so that customer satisfaction can also be easily achieved. The existence of services on social media by various organizations can also make it easier for customers or prospective customers to get the information they need. However, services on social media can also make it easier for customers to submit reviews and complaints about the organization's services. As the focus of this research is how to take advantage of reviews or complaints on social media from the services provided by this public transportation organization. Indonesian people, especially in Greater Jakarta, use several public transportation services such as Transjakarta, Commuter Line (KRL), Gojek, Grab, etc. These four public transportation organizations have services that are carried out on social media, especially Twitter. Services carried out on Twitter social media look very transparent and effective. With these services on social media, it can be seen the performance of the services offered by each organization with the information submitted by customers. Information can be submitted directly by customers on Twitter social media to review or submit complaints, it can be just a tweet or directly submit it to the organization's social media account. The advantage of social media Twitter in this regard is that reviews and complaints can be seen by anyone. These reviews and complaints can be used to assess how good or bad the services provided by each organization are. This encourages researchers to research more about the services of public transportation organizations through Twitter.
This study will focus on the service assessment of public transportation services based on reviews or complaints submitted by customers on Twitter social media. The researcher conducted a sentiment analysis on Twitter data by crawling the data on each keyword for the four public transportation services that the researcher adopted. Then do the clarification of the document. There are various techniques for clarifying documents, namely Naive Bayes Classifier, Decision Trees, and Support Vector Machines [2]. And among these the Naive Bayes Classifier technique is the most popular. So that positive and negative values will be generated from reviews of the four public transportation organizations on Twitter social media from each of these public transportation service keywords.

Sentiment Analysis Framework
Sentiment analysis or sentiment analysis in Indonesian is a technique or method used to identify how a sentiment is expressed using text and how that sentiment can be categorized as positive sentiment or negative sentiment [3]. Almost the same opinion was expressed [4], where sentiment analysis is used to understand comments made by users (internet) and explain how a product or brand is received by them. Meanwhile, sentiment analysis is the process used to determine opinions, emotions and attitudes that are reflected through the text, and are usually classified into negative and positive opinions.

Naive Bayes Classification
Naïve Bayes Classification is a classification method rooted in Bayes' theorem. The classification method using probability and statistical methods proposed by British scientist Thomas Bayes, which predicts future opportunities based on past experience, is known as Bayes' theorem. The main feature of this Naïve Bayes Classifier is a very strong (nave) assumption of the independence of each condition or event.
According to [6] explains Naïve Bayes for each decision class, calculates the probability provided that the decision class is true, given the object information vector. This algorithm assumes that object attributes are independent. The probabilities involved in producing the final estimate are calculated as the sum of the frequencies from the "master" decision table [6]. The Naive Bayes equation used in this study using Bayes' theorem (Bustami 2013) is [7]: (1) Information : X : Data with unknown class H : Hypothesis data is a specific class P(H|X) : Probability of hypothesis H based on condition X (posteriori probability) P(H) : Hypothesis probability H (prior probability) P(X|H) : Probability of X based on the conditions on the hypothesis H P(X) : Probability X The advantage of using this method is that this method only requires a small amount of training data to determine the parameter estimates needed in the classification process. Because it is assumed to be an independent variable, only the variance of a variable in a class is needed to determine the classification, not the entire covariance matrix. Benefits of Naive Bayes: a) Classify text documents such as news texts or academic texts. b) As a machine learning method that uses probability c) To make medical diagnosis automatically. d) Detect or filter spam.
Advantages of Naive Bayes: a) Can be used for both quantitative and qualitative data. b) Does not require a large amount of data. c) No need to do a lot of training data. d) If there is a missing value, it can be ignored in the calculation. e) The calculation is fast and efficient. f) Easy to understand. g) Easy to make. h) Document classification can be personalized, tailored to the needs of each person. i) If used in a programming language, the code is simple. j) Can be used for binary or multiclass problem classification.

Research Instruments
Based on the scope of the research to be studied, the approach of this research is quantitative research which is studied by studying numbers on Twitter social media. The location of this research was conducted on Twitter social media in Indonesia. Then the source of data from this research is from observation and calculation of data that has been collected from social media Twitter.

Data Collection Procedure
The data collection technique used is the Python programming language. Observations on Twitter social media were carried out using text mining techniques for 5 hours on each keyword. We search for tweets or comments from Twitter users by category as follows:

Data Analysis Method
Research conducted using the Python programming language to develop it. The process methodology consists of Data Collection taken from social media Twitter, Data Cleaning, and Naive Bayes Classification. Python is used to display Sentiment Analysis because Python can collect, process, classify, analyze, and loop conditional data based on various types of desired input.
a) Data Collection. The raw data is collected from Indonesian-language tweets from all users in Indonesia. b) Data Cleaning.
Original data tends to be incomplete, noisy, and inconsistent. Data Cleaning (or data cleaning) will detect missing values, smooth noise while identifying outliers, and correct inconsistencies in the data [8]. c) Data Integration.
Data mining often requires data integration. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help increase the accuracy and speed of the data mining process. d) Data Reduction.
Complex data analysis and mining of large amounts of data can take a long time, making data analysis impractical. Data reduction techniques can be applied to obtain a reduced representation of a data set whose volume is much smaller, but still maintains the integrity of the data. e) Data Transformation.
In Data transformation, data is transformed or consolidated into a form suitable for mining. f) Classification.
At this classification stage, we use a classification with Naive Bayes.

Results And Discussion
The data collection technique through Twitter social media is carried out using the Python programming language. The data collected is in the form of tweets which will be processed with positive and negative classification filters. The resulting data has a percentage for each occurrence of the word, both positive and negative. The following is an analysis and discussion of each category using the Naive Bayes calculation method. The Naive Bayes equation used is as follows.

P(c\x) = P(x1\c) P(xn\c) …. P(c)
(2) P(c\x) = Probability based on condition P(x\c) = Probability based on hypothetical conditions P(c) = Probability of the whole hypothesis The following is the crawled data for the keyword 'Gojek' using the Python programming language.  There are 14 tweets with each classification. There are 4 positive tweets with a probability of 0.285714286, and negative tweets with a probability of 0.714285714. The percentage for each type of tweet is 2.997994069% for positive, and 10.13561835% for negative. For classification related to word occurrence data, it is in the following table. In table 3, there are occurrences of words with their respective numbers. The probability of the occurrence of a word is the result of each occurrence of the word among the total number of words that appear in the tweet, which is 17. The probability result of the occurrence of a positive word is 0.29 and a negative one is 0.71. The following is the result of calculating the total probability using the Naive Bayes method. The final probability is 0.084033613 for positive, and 0.504201681 through Naive Bayes calculations. It can be concluded that there are more negative tweets than positive tweets. And also the percentage of occurrences of negative words in table 4, the result is much larger than the percentage of occurrences of positive words.

Grab Data Analysis
The following is the crawled data for the keyword 'Grab' using the Python programming language.  There are 9 tweets with each classification. There are 3 positive tweets with a probability of 0.333333333, and a negative tweet with a probability of 1. For the percentage for each type of tweet, it is 1.371980676% for positive, and 10.26900957% for negative. For classification related to word occurrence data, it is in the following table. In table 7, there are occurrences of words with their respective numbers. The probability of word occurrence is the result of each word occurrence among the total number of words that appear in the tweet, which is 15. The probability result of a positive word occurrence is 0.2 and a negative one is 0.79. The following is the result of calculating the total probability using the Naive Bayes method. The final probability is 0.66666667 for positive, and 0.79 for negative through Naive Bayes calculations. It can be concluded that there are more negative tweets than positive tweets. And also the percentage of occurrences of negative words in table 8, the result is much larger than the percentage of occurrences of positive words. The following is the crawled data for the keyword 'Commuter Line' using the Python programming language.  There are 3 tweets with each classification. There are 3 positive tweets with a probability of 1 and a negative tweet with a probability of 0. The percentage for each type of tweet is 11.5151515% for positive, and 0% for negative. It can be seen in the probability and percentage that the positive value generated is greater, which means that there are more positive tweets than negative tweets. Not even negative tweets were generated from the data we processed. For classification related to word occurrence data, it is in the following table. In table 11, there are occurrences of words with their respective numbers. The probability of word occurrence is the result of each word occurrence among the total number of words that appear in the tweet, which is 3. The probability result of a positive word occurrence is 0.33 and a negative one is 0. The following is a total probability calculation result using the Naive Bayes method. The final probability is 0.333333333 for positive, and 0 for negative through Naive Bayes calculation. It can be concluded that there are more negative tweets than positive tweets.
And also the percentage of occurrences of negative words in table 12 is much greater than the percentage of occurrences of positive words.

Transjakarta Data Analysis
The following is the crawled data for the keyword 'Transjakarta' using the Python programming language.  There are 6 tweets with each classification. There are 2 positive tweets with a probability of 0.333333333, and a negative tweet with a probability of 0.666666667. The percentage for each type of tweet is 4.615384615% for positive, and 4.524410774%. It can be seen in the probability and percentage that the resulting negative value is greater, which means that there are more negative tweets than positive tweets. For classification related to word occurrence data, it is in the following table. In table 15, there are occurrences of words with their respective numbers. The probability of word occurrence is the result of each word occurrence among the total number of words that appear in the tweet, which is 6. The probability result of a positive word occurrence is 0.07 and a negative one is 0.18. The following is the result of calculating the total probability using the Naive Bayes method. The Final probability is 0.023809524 for positive, and 0.119047619 through Naive Bayes calculations. It can be concluded that there are more negative tweets than positive tweets. And also the percentage of occurrences of negative words in table 16, the result is much larger than the percentage of occurrences of positive words.

Implications of the Fourth Analysis Results
From the observations and analysis that the researchers did, the researchers wanted to prove what transportation services were considered good and which were considered bad by the Twitter user community. The value of 'good' is taken from the results of the percentage of occurrences of positive words that the researcher has analyzed in each of the points above and the value of 'bad' is taken from the results of the percentage of occurrences of negative words from data collection and calculations that have been carried out.

Criticism and Suggestions
From the results of observations and analyzes that have been carried out, it is hoped that it will be useful for each organization in monitoring service reviews submitted by the public on Twitter social media. And this research is also expected to be used as a consideration to improve the public transportation services of each organization in the future. The suggestion from the researcher is that it would be better for each public transportation service organization to improve the services provided on Twitter social media so that the reviews given by the public or customers can be monitored and the relationship to customers will be more interactive.

Conclusion
From the results of data collection carried out for 5 hours on each keyword and also carried out calculations using the Naive Bayes method, it was resulted that the Commuter Line transportation service had the highest positive final probability, namely 0.333333333. Next is Gojek which has a positive ending probability of 0.084033613. While the positive final probability for Grab transportation services is 0.066666667. And finally, the final positive probability for Transjakarta transportation services is 0.023809524. There is also a negative final probability in the results of the analysis of this study. The first one was obtained by Grab, which was 0.79. Next is Gojek which has a negative final probability of 0.504201681. Then Transjakarta which has a negative final probability of 0.119047619. And finally, the Commuter Line which does not have negative tweets, which means it has a probability of 0. Based on the system built, the total positive final probability sentiment result is 0.507843137, and the negative final probability sentiment result is 1.4132493. The results show that the level of negative sentiment from public tweets is greater than the level of positive sentiment.