Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data
| dc.contributor.author | Mursi, Japheth Kiplang’at | |
| dc.contributor.author | Subramaniam, Prabhakar Rontala | |
| dc.contributor.author | Govender, Irene | |
| dc.date.accessioned | 2025-10-09T13:02:30Z | |
| dc.date.issued | 2023 | |
| dc.description | Journal article | |
| dc.description.abstract | Pre-processing input text play a crucial role in text classification by reducing dimensionality and removing unnecessary content. Different text pre-processing techniques affect prediction models' input vocabulary and documents. Many of the decisions that affect model performance are made during data pre-processing. The notion of data pre-processing affecting the outcome of a prediction task is widely accepted, yet not much work has been done on measuring this impact. In this study, six different text pre-processing techniques were applied, resulting in five types of labelled datasets used in classification. Three machine learning classifiers, Naïve Bayes (NB), Random Forest and Logistic Regression (LG), were used. The accuracy of the classifiers after applying to the different datasets were calculated. Results showed that Naïve Bayes, Random Forest and Logistic regression accuracy significantly improved after using only stemming and removing Stop Words. Naïve Bayes achieved the highest accuracy of 90.71% when the dataset was stemmed and Stop Words removed. Similarly, Random Forest and Logistic Regression gained a higher accuracy of 94.5% and 93.5% when the dataset was stemmed, and Stop Wordsm removed. In addition, accuracy of classifiers on labelled dataset which was tokenized and lemmatized reduced to 88.44% for Naïve Bayes, 92.94% for Random Forest and 92.23% for Logistic Regression. The study concludes that the removal of Stop Words, stemming and lemmatization affect data labelling and prediction model accuracy | |
| dc.identifier.citation | Mursi, J. K., Subramaniam, P. R. & Govender, I. (2023). Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data. IEEE AFRICON | |
| dc.identifier.uri | https://repository.daystar.ac.ke/handle/123456789/7927 | |
| dc.language.iso | en | |
| dc.publisher | IEEE AFRICON | |
| dc.subject | pre-processing | |
| dc.subject | labelled data | |
| dc.subject | machine learning | |
| dc.subject | accuracy | |
| dc.subject | stemming pre-processing | |
| dc.subject | stemming | |
| dc.title | Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data | |
| dc.type | Article |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Exploring the Influence of Pre-Processing techniques in Obtaining Labelled Data from Twitter Data.pdf
- Size:
- 488.15 KB
- Format:
- Adobe Portable Document Format
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 1.71 KB
- Format:
- Item-specific license agreed upon to submission
- Description:
