Back close

Content based spam detection in short text messages with emphasis on dealing with imbalanced datasets

Publication Type : Conference Proceedings

Publisher : IEEE

Source : Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)

Url : https://ieeexplore.ieee.org/abstract/document/8697372

Campus : Bengaluru

School : School of Computing

Year : 2018

Abstract : Short text messages are an important means which help people connect with each other via their cell phones. The owing popularity of these messages are at times hindered by sometimes unwanted messages and advertisements also being sent via text messages which are called spams. Sometimes this behaviour can be irritating for the recipient. Automatic spam filters are being used to identify these unwanted messages and help the users to prevent those messages from getting into their inbox. The approaches to spam detection problem has been either content based or heuristic based. The proposed work puts forward a content based machine learning approach with a special emphasis on the fact that the datasets are imbalanced which is a reflection of the real world scenario with respect to spam detection. Expermentations have been performed on popular machine learning algorithms like SVM, AdaBoost, Bagging and J48 to find the classifier which is better to deal with imbalanced datasets and hence experimenting with that classifier on techniques for imbalacing. Identifying the discriminating features, application of feature reduction techniques, dealing with issues related to imbalanced datasets etc. are the major milestones in the proposed work. SMOTE technique is applied to deal with imbalanced datasets. SVM in combination with SMOTE exhibited the best performance with an improvement of 7 points in the JSC dataset and 3 points in the UCI Dataset over imbalanced datasets, the results reported in Average Class accuracy.

Cite this Research Publication : Aich, P., Venugopalan, M., & Gupta, D. (2018). Content based spam detection in short text messages with emphasis on dealing with imbalanced datasets. In 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA) (pp. 1-5). IEEE.

Admissions Apply Now