Publication Type : Conference Paper
Publisher : IEEE
Source : 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)
Url : https://doi.org/10.1109/icccnt61001.2024.10725521
Campus : Amritapuri
School : School of Computing
Department : Computer Science and Engineering
Year : 2024
Abstract : Programming languages function as a medium for expressing instructions in the context of computer program creation. A prevalent convention in programming is categorizing source code into fragments, a strategy that enhances understanding and simplifies the upkeep process. This research study presents a comparative examination of two distinct models utilized to classify source code. The first model employs a neural network approach that heavily relies on Syntax Trees (ASTs). In contrast, the second model is built on a neural network framework that utilizes the tokenization process. The AST-based model employs a methodology that entails decomposing extensive ASTs into smaller clusters of statement trees. These clusters are subsequently transformed into vectors by capturing the lexical and syntactical characteristics of the statements. The naturalness of certain statements is captured by employing a bidirectional Recurrent Neural Network (RNN) model, which utilizes a collection of statement vectors. In contrast, the token-based paradigm diverges from ASTs and emphasizes tokens obtained through tokenization. The tokens undergo preprocessing and vectorization before inputting into a bidirectional RNN model to evaluate the naturalness of the statements. To choose the optimal model, the accuracy of the two models are compared.
Cite this Research Publication : R Parvathy, Mg Thushara, AST-Based and Token-Based Neural Networks for Source Code Classification: A Comparative Performance Analysis, 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), IEEE, 2024, https://doi.org/10.1109/icccnt61001.2024.10725521