Publication Type : Journal Article
Publisher : International Journal on Computer Science and Engineering
Source : (IJCSE) International Journal on Computer Science and Engineering , Volume 2, Number 8 (2010)
Campus : Coimbatore
School : School of Artificial Intelligence - Coimbatore, School of Engineering
Center : Computational Engineering and Networking
Department : Computer Science, Electronics and Communication
Year : 2010
Abstract : Development of a well fledged bilingual machine translation (MT) system for any two natural languages with limited electronic resources and tools is a challenging and demanding task. This paper presents the development of a statistical machine translation (SMT) system for English to South Dravidian languages like Malayalam and Kannada by incorporating syntactic and morphological information. SMT is a data oriented statistical framework for translating text from one natural language to another based on the knowledge extracted from bilingual corpus. Even though there are efforts towards building such an English to South Dravidian translation system ,unfortunately we do not have an efficient translation system till now. The first and most important step in SMT is creating a well aligned parallel corpus for training the system. Experimental research shows that the existing methodology for bilingual parallel corpus creation is not efficient for English to South Dravidian language in the SMT system. In order to increase the performance of the translation system, we have introduced a new approach in creating parallel corpus. The main ideas which we have implemented and proven very effective for English to south Dravidian languages SMT system are: (i) reordering the English source sentence according to Dravidian syntax, (ii) using the root suffix separation on both English and Dravidian words and iii) use of morphological information which substantially reduce the corpus size required for training the system. Since the unavailability of full fledged parsing and morphological tools for Malayalam and Kannada languages, sentence synthesis was done both manually and existing morph analyzer created by Amrita university. From the experiment we found that the performance of our systems are significantly well and achieves a very competitive accuracy for small sized bilingual corpora. The proposed ideas can be directly used for other south Dravidian languages like Tamil and Telugu with some minor changes.
Cite this Research Publication : Unnikrishnan, P., P. J. Antony, and K. P. Soman. "A novel approach for English to South Dravidian language statistical machine translation system." International Journal on Computer Science and Engineering, 2.08 (2010): 2749-2759.