Publication Type : Conference Proceedings
Publisher : Advances in Intelligent Systems and Computing. Springer,
Source : Advances in Intelligent Systems and Computing, Springer, p.47-58 (2021)
Url : https://link.springer.com/chapter/10.1007/978-981-15-1275-9_5
Keywords : accuracy, Language identification, n-gram, Trigrams
Campus : Bengaluru
School : Department of Computer Science and Engineering, School of Engineering
Department : Computer Science
Year : 2021
Abstract : Language Identification is used to categorize the language of a given document. Language Identification categorizes the contents and can have a better search results for a multilingual document. In this work, we classify each line of text to a particular language and focused on short phrases of length 2 to 6 words for 15 Indian languages. It detects that a given document is in multilingual and identifies the appropriate Indian languages. The approach used is the combination of n-gram technique and a list of short distinctive words. The n-gram model applied is language independent whereas short word method uses less computation. The results show the effectiveness of our approach over the synthetic data.
Cite this Research Publication : S. Bhaskaran, Geetika Paul, Dr. Deepa Gupta, and Amudha J., “Indian Language Identification for Short Text”, Advances in Intelligent Systems and Computing. Springer, pp. 47-58, 2021.