Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data

Mursi, Japheth Kiplang’at; Subramaniam, Prabhakar Rontala; Govender, Irene

Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data

dc.contributor.author	Mursi, Japheth Kiplang’at
dc.contributor.author	Subramaniam, Prabhakar Rontala
dc.contributor.author	Govender, Irene
dc.date.accessioned	2025-10-09T13:02:30Z
dc.date.issued	2023
dc.description	Journal article
dc.description.abstract	Pre-processing input text play a crucial role in text classification by reducing dimensionality and removing unnecessary content. Different text pre-processing techniques affect prediction models' input vocabulary and documents. Many of the decisions that affect model performance are made during data pre-processing. The notion of data pre-processing affecting the outcome of a prediction task is widely accepted, yet not much work has been done on measuring this impact. In this study, six different text pre-processing techniques were applied, resulting in five types of labelled datasets used in classification. Three machine learning classifiers, Naïve Bayes (NB), Random Forest and Logistic Regression (LG), were used. The accuracy of the classifiers after applying to the different datasets were calculated. Results showed that Naïve Bayes, Random Forest and Logistic regression accuracy significantly improved after using only stemming and removing Stop Words. Naïve Bayes achieved the highest accuracy of 90.71% when the dataset was stemmed and Stop Words removed. Similarly, Random Forest and Logistic Regression gained a higher accuracy of 94.5% and 93.5% when the dataset was stemmed, and Stop Wordsm removed. In addition, accuracy of classifiers on labelled dataset which was tokenized and lemmatized reduced to 88.44% for Naïve Bayes, 92.94% for Random Forest and 92.23% for Logistic Regression. The study concludes that the removal of Stop Words, stemming and lemmatization affect data labelling and prediction model accuracy
dc.identifier.citation	Mursi, J. K., Subramaniam, P. R. & Govender, I. (2023). Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data. IEEE AFRICON
dc.identifier.uri	https://repository.daystar.ac.ke/handle/123456789/7927
dc.language.iso	en
dc.publisher	IEEE AFRICON
dc.subject	pre-processing
dc.subject	labelled data
dc.subject	machine learning
dc.subject	accuracy
dc.subject	stemming pre-processing
dc.subject	stemming
dc.title	Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data.pdf
Size:: 488.15 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Journal Articles