Publication Type : Conference Proceedings
Publisher : Lecture Notes in Electrical Engineering
Source : Lecture Notes in Electrical Engineering, 736 LNEE, pp. 41-50.
Url : https://link.springer.com/chapter/10.1007/978-981-33-6987-0_4
Campus : Amritapuri
School : School of Biotechnology
Department : biotechnology
Year : 2021
Abstract : Pathogenic microorganisms are always a challenge when they form biofilms on submerged surfaces such as pipes, drains, or sewers, which are difficult to remove using normal chemical or biological treatments. Developing a fundamental understanding of the biodiversity of sewage microbiome or finding out the key species that can be targeted to significantly reduce the pathogenic population within can be critical in advancing and optimizing the technology for maintaining environmental health. Hence to find articles with relevant information about this microbiome and the interactions within is like finding a needle from the haystack. There comes the need for data mining tools, a key part of such a tool would be named entity recognition. To train a NER model, a relevant dataset with the required entities tagged is required and no such were to be found in the biomedical domain. So, in our study, we intended to develop a microbiome dataset with all the relevant concepts tagged for training a NER model which is to be a part of a semantic information retrieval tool. For this, we engineered a dataset specifically focusing on keywords related to the characteristics of the wastewater microbiome that could cluster out the relevant information from the bulk data of PubMed literature. The new engineered data was then used for fine-tuning NER models with different variants of BERT models for analyzing which had the most efficiency with our dataset. We implemented NER models capable of accurately predicting the concepts tagged in the microbiome dataset and designed experiments to validate the efficiency of the different models on our dataset and also other open-source biomedical datasets like JNLPA and BC5CDR. The results show that out of the three BERT variants, BioBERT was the most performant model, and also even with a fairly limited size compared to other biomedical NER datasets, we were able to achieve similar scores. The NER model fine-tuned using the microbiome dataset was able to successfully predict the tagged concepts/named entities in the datasets.
Cite this Research Publication : Joshy Alphonse, Anokha N Binosh, Sneha Raj, Sanjay Pal, Nidheesh Melethadathil, “Semantic Retrieval of Microbiome Information Based on Deep Learning” in Fourth International Conference on Computing and Network Communications (CoCoNet'20) & International Conference on Applied Soft Computing and Communication Networks (ACN'20)