Publication Type : Conference Paper
Publisher : CEUR Workshop Proceedings, CEUR-WS.
Source : CEUR Workshop Proceedings, CEUR-WS, Volume 1737, p.321-324 (2016)
Keywords : Artificial intelligence, Codes (symbols), Context-based, Cross validation, Data mining, Entity extractions, Fires, Indian languages, Information Retrieval, Mixed supports, Named entity recognition, Social networking (online), Support vector machines, Training data, Word embedding
Campus : Coimbatore
School : School of Engineering
Center : Computational Engineering and Networking
Department : Electronics and Communication
Year : 2016
Abstract : This paper presents the working methodology and results on Code Mix Entity Extraction in Indian Languages (CMEE-IL) shared the task of FIRE-2016. The aim of the task is to identify various entities such as a person, organization, movie and location names in a given code-mixed tweets. The tweets in code mix are written in English mixed with Hindi or Tamil. In this work, Entity Extraction system is implemented for both Hindi-English and Tamil-English code-mix tweets. The system employs context based character embedding features to train Support Vector Machine (SVM) classifier. The training data was tokenized such that each line containing a single word. These words were further split into characters. Embedding vectors of these characters are appended with the I-O-B tags and used for training the system. During the testing phase, we use context embedding features to predict the entity tags for characters in test data. We observed that the cross-validation accuracy using character embedding gave better results for Hindi-English twitter dataset compare to Tamil-English twitter dataset.
Cite this Research Publication : S. V. Skanda, Singh, S., G. Devi, R., Veena, P. V., Dr. M. Anand Kumar, and Dr. Soman K. P., “CEN@Amrita FIRE 2016: Context based character embeddings for entity extraction in code-mixed text”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 321-324.