Publication Type : Conference Proceedings
Publisher : International Conference on Advances in Computing, Communications and Informatics
Source : International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017’. pp. 1028-1034 , 2017.
Campus : Coimbatore
School : School of Engineering
Center : Research & Projects
Department : Computer Science, Sciences
Verified : Yes
Year : 2017
Abstract : Data is continuously increasing in volume, variety and velocity and is becoming a precious and an irreplaceable asset. The insights drawn from Data analysis are used to transform healthcare, Engineering of products, Cyber Security, National intelligence and business. Data is becoming extremely large that, it is becoming difficult to process using traditional data storage systems, because of their centralized data processing behavior. Big Data tools like Hadoop and Apache Spark use MapReduce as programming paradigm which has decentralized data processing behavior. Spark performs in-memory cluster computing which makes it 100x times faster than Hadoop. This paper proposes to identify various inefficiencies in the current design of shuffle phase in Spark. Shuffle phase is all-to-all communication mechanism and can potentially introduce network, disk, memory and CPU scheduling overheads. A simple strategy to optimize shuffle performance is by using NIO buffers and large buffer read and writes during shuffling which would result in better performance by reducing the number of disk read and writes when partition size in spill is small. Data from input channel is buffered into large buffer by performing a single read using NIO buffer. Then, data from buffer is put into the output channel for shuffling. This reduces the number of disk operations as well as multiple copies of data between JVM and native memory. This method also reduces the CPU scheduling overheads by providing non-blocking modes for threads provided by NIO buffer.
Cite this Research Publication : Dr. (Col.) Kumar P. N. and , “Shuffle Phase Optimization in Spark”, International Conference on Advances in Computing, Communications and Informatics, ICACCI 2017’. pp. 1028-1034 , 2017.