Sentiment Prediction using Naive Bayes Algorithm

Introduction

This is a post about Sentiment Prediction work that I did with the Naive Bayes Classifier. The dataset I used for the experiments were on this sentiment labelled dataset. Which had 3 types of review datasets:
  1. IMDB Movie Reviews
  2. Amazon Product Reviews
  3. Yelp Reviews
I used the IMDB Movie Reviews dataset whose textual variation can be found here.

The Jupyter Notebook having the outcome of my experimentation is committed in this GitHub Repository.

Dataset

The dataset consists of review and sentiment pairs as follows
Figure 1: The IMDB review dataset

The reviews consist of movie reviews having both positive and negative sentiments. Each review is labelled with the respective sentiments which have positive as 0 and negative as 1.

Goals

The major part of this project was to understand the working of the Naive Bayes Classifier.
The following are the important MVPs of the project:
  1. Predicting the sentiment for a given review.
  2. Dividing the dataset into train, dev and test.
  3. Building the vocabulary list.
  4. 5 fold cross-validation.
  5. Comparing the effectiveness of smoothening.
  6. Deriving the top 10 words that predict the positive and negative sentiments.
  7. Use the optimal hyperparameters from the cross-validation to test in dev and test the dataset that we split.

The Experimentation

First, I imported the dataset from Kaggle using the link here provided in the Kaggle documentation page.
Second, I split 80% of the data for training and the remaining split in half for dev and test.

Next, I split the training data into K*2 of train and test for conducting K-fold cross-validation. Let's talk about it...

K-fold cross-validation

It is a technique used to reduce the bias on training data i.e. to reduce overfitting of the model. The dataset will be of huge volume using the entire dataset for training to fit the model to the training set and produce more errors when doing predictions with unknown new data.

So what we do is split the training dataset into K*2 folds K training and K testing sets. Each of the K sets will be as follows:
  • Train_1: S2, S3, S4, ..., Sk and Test_1: S1
  • Train_2: S1, S3, S4, ..., Sk and Test_2: S2
  • ... up to Train_K: S1, S2, S3, S4, ..., Sk-1 and Test_K: Sk
Each of the training set will have more dataset compared to the test set.
We validate the train set over the test set by doing prediction and will generate the accuracy.
With the accuracy as a comparison, we will find the hyperparameters with the best accuracy and use them to further validate the dev and test.

Further, we started with designing the Naive Bayes Algorithm. For which we needed to understand its parameters...

Bayes Theorem

The Bayes Theorem states that the conditional probability is given by:

P(Sentiment | Sentence (Collection of words)) = P(Sentence | Sentiment) * P(Sentiment) / P(Sentense)

The numerator part is the significant term during comparison as the denominator will cancel out during comparison.

Naive Bayes Theorem

To find the numerator in the above equation the Naive Bayes Theorem assumes the events are independent and the probability:

P(Sentence | Sentiment)  = P(Word_1,Word_2,...,Word_n | Sentiment)

P(Word_1,Word_2,...,Word_n | Sentiment) = P(Word_1 | Sentiment).P(Word_2 | Sentiment). ... .P(Word_n | Sentiment)

with this, we have created the vocabulary list in our notebook...

For the training, we supply the reviews and sentiments and split the words in the review and count the number of occurrences of the following:
  1. Word Frequency
  2. Positive Sentiment Word Frequency
  3. Negative Sentiment Word Frequency
With the above values we will be able to find the following:
  1. P(Word)
  2. P(Sentiment = Positive)
  3. P(Word | Sentiment = Positive)
  4. P(Sentiment = Negative)
  5. P(Word | Sentiment = Negative)
These are the hyperparameters of the Naive Bayes Algorithm...

We will find the 2 probabilities based on sentiments Positive and negative during prediction based on the higher value we will decide the predicted sentiment.

Once we calculate the above we just need to use it to conduct the K-Fold cross-validation and get the accuracy.

Find the accuracy is simply calculating (number of correct predictions)/(total number of data in the dataset)

We do these for all the K-Folds and find which hyperparameter has the maximum value.

This maximum value is used to validate the dev and test dataset to get the accuracy of the parameters.

Figure 2: The Naive Bayes Hyperparameters
before Smoothening



Figure 3: Accuracy before smoothening

Smoothening

If we look closely at Figure 2 there will be NaN (Not a Number) values in the count of Negatives this means that these words have no negative sentiment records in the dataset.

Also, there may be words that are not available in the training dataset which we may encounter in the validation set.

This just means that there will be Missing data in the dataset. Naive Bayes Classifier will be able to handle these situations well using the Smoothening techniques.

Smoothening is a technique used to handle missing data by compensating by adding values into the dataset.

There are 2 types of smoothening:
  1. +1 Smoothening
  2. Laplace Smoothing
In this experiment, I used the first one the +1 Smoothening.

All it does is adding +1 to the following:

  1. Word Frequency
  2. Positive Sentiment Word Frequency
  3. Negative Sentiment Word Frequency

Also, +2 for the Number of sentiments as these terms are in the denominator and needs to adhere and compensate for the +1 in the numerator so that the probability of most occurrence words will be less than 1
  1. Total words
  2. Total Positive sentiments
  3. Total Negative sentiments
  4. Total sentiments
Figure 4: The Naive Bayes Hyperparameters
after Smoothening

Figure 5: Accuracy after smoothening

We can see from the results in Figure 4 & 5 that there are no NaN's in the vocabulary list and also the accuracy has been significantly improved

Top words used for prediction

To get these values we order the vocabulary list based on P(Sentiment | Word) in descending order and list them separately.
Figure 6: Most used positive words

Figure 7: Most used negative words


From the above two listings we can Infer the following:
  1. The most common words are useful words.
  2. The most common words and some words have a higher probability of both positive and negative sentiments.

This shows us that these data need to be removed.

We can remove stop words from Pythons old NLTK library for stop words for doing these as future work. Also, we can remove the more frequent words like the movie, film as it is in both positive and negative which is logical as it is a movie database...

Removing Stop words and specific positive & negative words [Bonus Experiment]:

For this, as described in the inference above we use the NLTK Library and add the most positive and negative words to the stop words like "movie", "film", "on"...
Figure 8: Accuracy after smoothening
and removing stop words


I found that the accuracy still increased by 3%.
And also the most useful positive and negative words looked reliable. The results are as follows...


Figure 9: Most used positive words


Figure 9: Most used negative words


Conclusion

From the experiment, we can conclude that the smoothening and removing stop words have improved the accuracy a lot to 6% raise.

Also, the Naive Bayes classifier handles missing data well and is better for textual analysis.

References

Comments

Popular posts from this blog

DISASTER ALERT SYSTEM

CIFAR10 Image Classifier using PyTorch