Back close

Entity Extraction of Hindi-English and Tamil-English Code-Mixed Social Media Text

Publication Type : Conference Paper

Publisher : Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Source : Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Verlag, Volume 10478 LNCS, p.206-218 (2018)

Url : https://www.scopus.com/inward/record.uri?eid=2-s2.0-85041849478&doi=10.1007%2f978-3-319-73606-8_16&partnerID=40&md5=015290ce32cfa7ede3f79affbf15881e

ISBN : 9783319736051

Keywords : Clustering algorithms, Code-mixed text, Codes (symbols), Data mining, Entity extractions, extraction, Learning systems, Social media, Social networking (online), Support vector machines, Text processing, Tri grams, Word embedding

Campus : Coimbatore

School : School of Engineering

Center : Computational Engineering and Networking

Department : Electronics and Communication

Year : 2018

Abstract : Social media play an important role in, today’s society. Social media is the platform for people to express their opinion about various aspects using natural language. The social media text generally contains code-mixed content. The use of code-mixed data is popular in them because the users tend to mix multiple languages in their conversation instead of using their native script as unicode characters. Entity extraction, the task of extracting useful entities like Person, Location and Organization, is an important primary task in social media text analytics. Extracting entities from code-mixed social media text is a difficult task. Three different methodologies are proposed in this paper for extracting entities from Hindi-English and Tamil-English code-mixed data. This work is submitted to the shared task on Code-Mix Entity Extraction for Indian Languages (CMEE-IL) at the Forum for Information Retrieval Evaluation (FIRE) 2016. The proposed systems include approaches based on the embedding models and feature-based model. BIO-tag formatting is done as a pre-processing step. Extraction of trigram embedding is performed during feature extraction. The development of the system is carried out using Support Vector Machine-based machine learning classifier. For the CMEE-IL task, we secured second position for Tamil-English data and third for Hindi-English. Additionally, evaluation of primary entities and their accuracies were analyzed in detail for further improvement of the system. © Springer International Publishing AG. 2018.

Cite this Research Publication : R. G. Devi, Veena, P. V., M. Kumar, A., and Dr. Soman K. P., “Entity Extraction of Hindi-English and Tamil-English Code-Mixed Social Media Text”, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, vol. 10478 LNCS, pp. 206-218.

Admissions Apply Now