Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data

dc.contributor.authorMursi, Japheth Kiplang’at
dc.contributor.authorSubramaniam, Prabhakar Rontala
dc.contributor.authorGovender, Irene
dc.date.accessioned2025-10-09T13:02:30Z
dc.date.issued2023
dc.descriptionJournal article
dc.description.abstractPre-processing input text play a crucial role in text classification by reducing dimensionality and removing unnecessary content. Different text pre-processing techniques affect prediction models' input vocabulary and documents. Many of the decisions that affect model performance are made during data pre-processing. The notion of data pre-processing affecting the outcome of a prediction task is widely accepted, yet not much work has been done on measuring this impact. In this study, six different text pre-processing techniques were applied, resulting in five types of labelled datasets used in classification. Three machine learning classifiers, Naïve Bayes (NB), Random Forest and Logistic Regression (LG), were used. The accuracy of the classifiers after applying to the different datasets were calculated. Results showed that Naïve Bayes, Random Forest and Logistic regression accuracy significantly improved after using only stemming and removing Stop Words. Naïve Bayes achieved the highest accuracy of 90.71% when the dataset was stemmed and Stop Words removed. Similarly, Random Forest and Logistic Regression gained a higher accuracy of 94.5% and 93.5% when the dataset was stemmed, and Stop Wordsm removed. In addition, accuracy of classifiers on labelled dataset which was tokenized and lemmatized reduced to 88.44% for Naïve Bayes, 92.94% for Random Forest and 92.23% for Logistic Regression. The study concludes that the removal of Stop Words, stemming and lemmatization affect data labelling and prediction model accuracy
dc.identifier.citationMursi, J. K., Subramaniam, P. R. & Govender, I. (2023). Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data. IEEE AFRICON
dc.identifier.urihttps://repository.daystar.ac.ke/handle/123456789/7927
dc.language.isoen
dc.publisherIEEE AFRICON
dc.subjectpre-processing
dc.subjectlabelled data
dc.subjectmachine learning
dc.subjectaccuracy
dc.subjectstemming pre-processing
dc.subjectstemming
dc.titleExploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data.pdf
Size:
488.15 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections