Publisher : International Journal of Applied Engineering Research
Campus : Coimbatore
School : School of Engineering
Department : Computer Science
Verified : Yes
Year : 2014
Abstract : Scalability, ease of access and recovery from data loss (as a result of replication) makes Hadoop an inevitable framework for Big Data Synthesis. Map reduce paradigm, a parallel data processing method has proven to be one of the best strategies to be made use of in data intensive processes. Based on the type of tasks carried over big data, whether the tasks should be carried out in a physical or virtual environment has to be decided. For performance intensive and real time operations, physical cluster seems much more reliable. When data nearness or cloning the characteristics of name node, lesser communication latency become a factor, virtual node can be made use of. According to the analysis, for search operation over enormous text content, the problem is, it should be both performance intensive and there should be lesser communication latency. Following the already existing features for efficient search and retrieval of text based data; we propose mechanism to further enhance it by making use of suitable environment for conducting search along with an efficient algorithm to process it. By building an inverted index over the big data and running data retrieval tasks using Hadoop map reduce paradigm in both physical and virtual multi node cluster environment, a detailed performance analysis has been made. Performance measurements of our experimental file system demonstrate that physical cluster yields faster search results. Hardware constraints have also been taken into account for the process as the data at hand is large and more RAM performance would be necessary to process such large data. In addition an algorithm that makes use of two hash maps to maintain count and position of words and a global has map that keeps the number of occurrences in the relevant document for processing. This, in turn would advocate efficient retrieval of data from the Wikipedia documents.