A Comparison of Machine Learning Models to Prioritise Emails using Emotion Analysis for Customer Service Excellence

ABSTRACT

There has been little research on machine learning for email prioritization for customer service excellence. To fill this gap, we propose and assess the efficacy of various machine learning techniques for classifying emails into three degrees of priority: high, low, and neutral, based on the emotions inherent in the email content. It is predicted that after emails are classified into those three categories, recipients will be able to respond to emails more efficiently and provide better customer service. We use the NRC Emotion Lexicon to construct a labeled email dataset of 517,401 messages for our proposal. Following that, we train and test four prominent machine learning models, MNB, SVM, LogR, and RF, and an Ensemble of MNB, LSVC, and RF classifiers, on the labeled dataset. Our main findings suggest that machine learning may be used to classify emails based on their emotional content. However, some models outperform others. During the testing phase, we also discovered that the LogR and LSVC models performed the best, with an accuracy of 72%, while the MNB classifier performed the poorest. Furthermore, classification performance differed depending on whether the dataset was balanced or imbalanced. We conclude that machine learning models that employ emotions for email classification are a promising avenue that should be explored further. stress, and work-family imbalance. Email overload has direct negative consequences on employee productivity and must be addressed.
In various contexts, emotion detection from written text, such as emails, may be used to improve work performance and customer relationships [6]. Emotion indicates the psychological state, which is impacted by the discernment of someone"s surroundings, health, and intent [7], and email contents are often filled with emotional cues. Through automatic emotion analysis, it is possible to obtain valuable information on how a specific audience feels about a given product, person, or service offered by a business. In other words, automated emotion detection systems can be employed by businesses to track and recognize emotional reactions to their goods and services. For instance, in power marketing, the user's feelings from speech data have been analysed for improved customer service [8]. In other cases, customer service agents can use automated anger detection systems in customer care emails to recognize unhappy consumers more quickly and take the necessary prompt actions to boost customer retention rates [9]. Without measures that track customer emotions, businesses risk-averse consequences on their reputation and related financial impacts, such as the loss of clients [10].
Emotion analysis differs from sentiment analysis, categorizing textual data as positive, neutral, or negative. Instead, emotion analysis provides information about an individual"s feelings or emotions through a series of "emotional connotations" like joy, sadness, or anger. Many proposed emotion models are reported in [11][12] [13]. Each of those emotion models proposes a list of emotions that humans express. A popular emotion model is the wheel of emotions defined by Robert Plutchik [14]. As shown in Figure 1, the wheel of emotions lists several emotions that an individual usually expresses. Each emotion can have different intensity, as illustrated by different wheel cones. Robert Plutchik also noted that individuals could express one or more of eight primary emotions, as shown in Table 1.
Following the reasoning that frustrated customers will express primarily negative emotions, it should be possible for machine learning to detect email contents with negative content and classify them as high priority compared to emails, which contain neutral or positive emotions. To date, however, not much attention has been given to the use of emotions to classify emails according to   [18] demonstrate machine learning techniques for email spam detection. A hybrid approach to spam detection is further found in the work of [19] and [20], and [21] evaluated the use of semantic features for spam detection in emails. In addition, a detailed review of spam detection techniques can be found in the works of [22] [23][24] [25]. Filtering spam emails targets unwanted emails but does not set any priority scheme for emails [15]. As stated by [26], there is a clear distinction between spam detection and email prioritization. The prioritization of emails aims at personalizing non-spam emails by estimating their relevance. Wang [26] also states that email prioritization can be split into two main groups depending on the targeted outcome: action prediction and priority label prediction, both of which require a classification task. To the researchers" knowledge, research on using machine learning and emotion analysis for email prioritization is scarce. One such research can be found in [27]. The authors used Naïve Bayes to categorize several emails according to their importance. [27] hypothesized that assigning different weights to selected terms from email contents makes it possible to calculate the overall importance or priority of these emails. However, the authors did not report any implementation results.
In this study, we investigate the possibility of using machine learning to analyse the emotions expressed in emails to set a priority ranking to different emails. It is posited that customers will send emails containing different expressed emotions, which, when detected, can further help classify those emails into three main groups: high priority, neutral, and low priority. Our work contrasts with previous studies in that most works on email classification have focused on spam detection. The main contributions of this work are as follows. We create a labelled dataset of emails using emotions from the NRC Emotion Lexicon. There is currently no email dataset labelled with emotions. We then devise a novel algorithm to assign three levels of priorities, namely high, low, and neutral to the messages in our dataset. Once the priority labels are assigned, we subject our dataset to some preprocessing stages. We then train, test, and compare different supervised machine learning models for their ability to correctly classify different email messages according to the three priority levels set for this study.
The rest of the paper is organized as follows. In section II, we provide details on our proposed methodology to use emotions and machine learning to classify emails according to three levels of priorities. In section III, we present and discuss the results obtained. Moreover, in section IV, we conclude our work with some future recommendations.

II. Method
This study aims to evaluate the efficacity of machine learning to prioritize emails based on the emotional contents of the texts within. The general process flow for our proposal is depicted in Figure 2.

A. Data Acquisition
No publicly accessible email dataset is labelled with emotions like happiness, sadness, or anger. Hence, a labelled dataset will have to be created for this study. To this end, the Enron email dataset is selected because it is a large email datasets that has already been used in several related studies such as [19], [20], [28], [29], and [30]. The Enron email dataset at https://www.cs.cmu.edu/~./enron/ includes 517,401emails sent by Enron Corporation employees. The "Federal Energy Regulatory Commission" collected it as part of its inquiry into Enron's downfall. The dataset is saved as a csv file and obtained from Kaggle.

B. Data Cleaning and Pre-processing
The process of data cleaning aims to eliminate irrelevant contents from the dataset. In the context of this project, irrelevant content refers to any part of the email that is not valuable when the learning algorithm assigns a class to the email. Not only will data cleaning make the task of classification easier for the classification model, but it may also significantly reduce the processing time in the training stage. As stated by [20], data pre-processing is essential to yield a better outcome. Data preprocessing aims at curtailing noise and can help tackle the dimensionality curse reported by [31] and [32].
For data cleaning, duplicate and irrelevant fields were removed from the raw dataset. As for data pre-processing, the following was applied to the cleaned email dataset: lower casing, noise removal, stop words removal, and tokenization. The curse of dimensionality constraint is dealt with by including text normalization and lemmatization techniques in the pre-processing phase to help in dimensionality reduction. The steps have been curated and adapted from [19] and [20].

C. Annotation and Priority Labeling
Annotation preparation is a crucial step as the emails in the dataset must be labelled with their relevant emotions to enable the use of supervised machine learning. It was reported by [20] that lexicon labelling provides clear and uniform results. Several existing sentiment lexicons have been employed in developing different systems and algorithms. Some examples are VADER, AFINN, and Sentiment140. In this study, the NRC Word-Emotion Association Lexicon at https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm is used for the emotion detection process since it is a list containing words based on different emotions.
It should be noted that the NRC Word-Emotion Association Lexicon provides multiple emotions, which is associated with a polarity (positive/negative number) weight based on the contents of an analysed text contents. Once labeled, each email is tagged with a priority label according to the emotion detected. The pseudocode for assigning the labels "High Priority", "Low Priority", and "Neutral" is as follows.  Else If weight sum bad_emotion > weight sum good_emotion Then return 'High Priority" Else return "Neutral" END An example of emotion polarity weights obtained for different messages that can be obtained from the NRC lexicon is shown in Figure 3.

D. Feature Extraction and Selection
Machine learning algorithms are unable to work directly on raw text. Hence, feature extraction methods, otherwise known as vectorization, are conducted to transform text to numerical data, more specifically into a vector of features using Term Frequency-Inverse Document Frequency (TF-IDF), which was initially designed for text categorization [33].
TF-IDF classifiers use frequency feature vectors as input and assess the weight of the features/words by using both TF and IDF. Term Frequency (TF) is the number of times a term appears in a text and Inverse Document Frequency (IDF) assesses a term"s significance [34]. The formulas used to calculate the TF and IDF are given by (1) and (2).
TF-IDF classifiers rely on a computational statistical approach that works by filtering the features by weighting and rating each unigram and N-grams based on the number of times certain words appear in the text [35]. In this study, TF-IDF is used to execute this conversion as recommended by [18] [19][20] [35]. Table 2 provides some more details on the hyperparameters used for the TfidfVectorizer available in python.

E. Model Training
In this step, the vectors generated during the feature extraction phase are used to train and test the machine learning models selected for this study. The dataset is uniformly and randomly split into 80% train set and 20% test set. We shall train and test the performance of the following popular machine learning models: SVM, NB, LogR, and RF. Those classifiers have been chosen for their reported good performance scores as reported in [35][36] [37][38] [39]. As recommended by [40], we will also investigate whether an ensemble method may yield better performance than the selected machine learning algorithms alone. Stacking is an ensemble method which learns to integrate the predictions from several machine learning models optimally. Here, the MNB, LSVC and RF model will be stacked to build a new ensemble model. The ensemble method choses the best classification  _ . " Set a threshold to ignore words with document frequency greater than 0.90 " _ " Set a threshold to ignore words with document frequency lower than 2 " _ " To consider the top 1000 features in the corpus " _ _ " To remove the words from the stop words list " _ ( , )" To get features composed of single tokens. model to use on the test set after each one has been evaluated on the training set. The main goal of ensemble method is to integrate the outputs of several classifiers to build a strong one [41].

F. Model Evaluation
The selected machine learning models will be trained and tested on the Enron email dataset labelled with the NRC lexicon. For evaluation purposes, the accuracy and F1-score obtained for each model will be used to compare the performance of the implemented algorithms. Accuracy refers to the ratio of correctly categorized data to the overall classifications. The formula used to calculate accuracy is: F1-score, alternatively termed as F-measure is the "harmonic mean" of the Precision and Recall. In other words, F1-score indicates which percent of positive predictions observed were correct.

III. Results and Discussions
We used Python 3.9.2, Jupyter notebook, and the Anaconda distribution to implement our proposed email prioritization approach. Table 3 lists the different python libraries we used to execute some of the main processes described in Section 2.

A. Calculating Raw Emotion Scores for Annotation and Priority Labeling
Once we obtained the Enron email dataset, as explained in Section IIA, we cleaned the data and applied several pre-processing operations as described in Section IIB. We then used the "top_emotion" module from NRCLex to view the highest polarities from the email text for training our machine learning models. A snapshot of the resulting email messages and the associated emotions is shown in Figure 4.
The "raw_emotion_scores" module from NRCLex was used to obtain the polarities of the different emotions. The results were then transformed into a Pandas DataFrame and the array of the different polarities were classified according to each emotion using the "pandas.DataFrame.form_records" module.
The score obtained for each emotion set was then used to decide on the polarity label (high, low, neutral) to assign to each email message according to the algorithm described in Section IIC. The resulting dataset was then inspected for data distribution. Figure 5 shows the results of the size of classes of the complete dataset and of the dataset after removing duplicates.
As observed, the pre-processing phase and priority labels were applied to two groups of the Enron email datasets. In one group, we kept all the records but in the second group, we removed all duplicate messages. We could see that both data groups were imbalanced, which can further influence the classification performance. In other words, the classifiers may try to improve the accuracy of the larger class to the detriment of the smaller classes.
The data was further sampled to balance the dataset as recommended by [29] to address the issue of the classifier biasing towards the majority class. The sampling method used was random oversampling, where data from the "minority class" were duplicated randomly, and random undersampling, where data from the "majority class" were randomly removed. The same sampling techniques were applied to both the complete or full dataset and the dataset with duplicates removed. Figure 6 shows the dataset distribution for the dataset with no duplicate after undersampling and oversampling, respectively. More after, a similar balanced class distribution was obtained for the entire dataset.

B. Feature Extraction and Selection
For feature extraction, the "TfidfVectorizer()" function from "SciKit Learn" module has been employed. The lemmatized text is fitted into the TfidfVectorizer. The main purpose of this approach was to improve the computation and training processes. Once the TF-IDF representation of the dataset is generated, the dataset was split into 80% train set and 20% test set using sklearn"s "train_test_split" function. The feature vectors generated by the TfidfVectorizer are then used as input to train the ML classification models.
As mentioned earlier, the following classifiers are used to fit the training data: NB, SVM, LogR and RF. Thus, the inbuilt classes, namely MultinomialNB, LinearSVC, LogisticRegression, and RandomForestClassifier from the "SciKit Learn" library are used to train the models on the dataset, both before and after the removal of duplicates, and evaluate whether the performance on a larger data set is improved.

C. Model Training and Evaluation
In python, we used the "s " _ _ ", feature to split our dataset uniformly and randomly into 80% train set and 20% test set. The feature vectors generated by the TfidfVectorizer and the labeled datasets were used as input to train all the ML classification models selected. The vectorizer and models were then pickled using the python library to enable saving and loading of the classifiers. We then obtained the training and testing classification score for different datasets and models when classifying emails into different priority categories using emotions. The relevant confusion matrix was generated for each model to calculate the corresponding TP, TN, FP, and FN values. The F1-Score and overall accuracy for each model and the corresponding dataset were calculated from those values. The confusion matrix for the MNB, LogR, and LSVC classifier corresponding to the full oversampled testing set are shown in Figure 7. Similar confusion matrices were obtained for the other datasets.
We used different performance scores to match the dataset used. For an imbalanced dataset, F1score gives a more representative idea of the performance of a classifier model, whereas, for balanced datasets, we used the accuracy metric. We also prefer to consult the macro average for the F1-Score as this metric treats all classes equally. The classification performance scores obtained for the full imbalanced dataset with and without duplicates are shown in Table 4. Table 5 provides the accuracy results for all the models for the balanced datasets with and without duplicates.
The performance scores for the RF and Stacking classifiers are seen to exhibit model overfitting, with a perfect 100% score in training but a reduced performance score for the testing set. Similarly, as seen in Table 5, the RF and stacking classifiers obtained 100% accuracy on the training set for all the balanced datasets. However, depending on the dataset, it drops between 72% and 99%, creates a misleading sense of obtaining high accuracy, which can be mostly attributed to model overfitting. In other words, both the RF and stacking models overfit the training set at the expense of an inferior performance on the testing set. To recall the Stacking model was built using the MNB, LSVC and RF classifiers. Therefore, it is safe to assume that the output of RF classifier in the stacking model has resulted in overfitting and hence fails to perform well with the new dataset.
In contrast, the performance scores obtained for the other models, i.e., MNB, LSCV and LogR appear to be more reliable. For the imbalanced datasets (Table 4), the LogR classifier gives a slightly better performance score of 0.67 compared to MNB and LSVC. Overall, all the models gave close performance scores during their training and testing phases.
Likewise, for the balanced datasets (Table 5), the LogR classifier is again seen to provide a good classification performance score. Maximum accuracy of 0.73 close to the LSVC classifier across the balanced datasets, was observed, making both LogR and LSVC as the two most suitable priority classifiers for emails using emotions. Since the MNB classifier gave the worst performance for both the balanced and imbalanced datasets, we deduce that this type of task is not the most suitable model.
In general, therefore, it is found that machine learning models are good candidates for classifying emails into different priority levels based on emotional content in the email. Previous studies have mostly focused on using machine learning techniques for spam detection. This study used the NRC Emotion Lexicon to label an otherwise unlabeled email dataset. The best performance score obtained is good but not good enough to be deployed in a real organization setting. Several improvements can still be made to obtain a better-performing email prioritizing solution to the email overload problem. For instance, as discussed in [12], other emotion models can be used for the data labeling step. Using lesser emotion categories could also increase accuracy, as observed by [6]. Last but not least, as investigated by [42], other machine learning models like RNN can be evaluated for their performance in detecting emotions in email contents.

IV. Conclusion
Email overload is a growing organizational problem that has been overlooked. For businesses, this represents a considerable loss in productivity and poor customer service and increasing psychological stress imposed on employees. The efficacity of four machine learning models namely MNB, LSVC, RF, LogR, and an Ensemble of MNB, LSVC, and RF classifiers were evaluated to address this problem, for their performance in prioritising messages from the Enron email dataset. The dataset was labelled using the NRC emotions lexicon and following several experiments on both imbalanced and balanced datasets, it was discovered that supervised machine learning could be used to detect emotions in email contents and assign priorities to emails accordingly. It was also noticed that data balancing influenced the classification performance and that the RF and the Ensemble methods tended to overfit the data. In parallel, it was found that the LogR and LSVC classifiers gave the best classification score while the MNB classifier performed the poorest. However, the highest performance scores obtained from this study are not good and considered good enough to be effective in a real-life organizational setting. Thus, there is a need for more research into the use of emotions in email content when setting up a priority reply list. In future works, it is recommended that other deep learning models and alternative emotion lexicons be tested for the possibility of achieving better performance scores. In addition, the principle discussed in this paper considered email content written in the English language only. The same techniques may not work well for other written languages, which may require other considerations for text cleaning and preprocessing. In this case, further research is warranted.