Characteristics of Big Data, Types of Big Data, Technologies for Big Data, Infrastructure for Big Data, Use of Data Analytics, Big Data Challenges, NoSQL, Comparison of SQL and NoSQL, Distributed Computing Challenges, Hadoop Ecosystem: HDFS (Hadoop Distributed File System), MapReduce: Inputs, Outputs, and Data Serialization, Managing Resources with Hadoop YARN, Interacting with the Hadoop Ecosystem, Functional Programming in Scala:
Basic Syntax, Type Inference, Parameters, Recursive Arbitrary Collections, ConsList, Arrays, Tail Recursion, Higher-Order Functions.
MapReduce Programming: Mapper, Reducer, Combiner, Partitioner, Real-Time MapReduce Applications, Data Serialization, Apache Spark: Resilient Distributed Datasets (RDDs), Creating RDDs, Lineage and Fault Tolerance, DAGs, Immutability, Task Division and Partitions, Transformations and Actions, Lazy Evaluations and Optimization, Formatting and Housing Data from Spark RDDs, Persistence
Hive Architecture: Hive Data Types, Hive File Format, Hive Query Language (HQL), User-Defined Functions (UDF) in Hive, Introduction to Machine Learning with Spark: MLlib, Building a Machine Learning Pipeline in Spark, Pig on Hadoop: Anatomy of Pig, Use Cases for Pig, ETL Processing, Data Types in Pig, Running Pig, Execution Modes of Pig, HDFS Commands, Relational Operators, Piggy Bank.