Introduction to Spark : Spark Architecture, Spark Jobs and APIs. Resilient Distributed Datasets- Creating RDDs, Transformation, Actions. Dataframes- Python to RDD communications, Creating Dataframes, Dataframe queries. MLlib -Loading and Transforming the data. Implementation of Machine Learning algorithms such as Classification and Clustering using the MLlib
Approaches to Modelling- Importance of Words in Documents – Hash Functions- Indexes – Secondary Storage -The Base of Natural Logarithms – Power Laws – Map Reduce. Finding similar items: Shingling – LSH – Distance Measures. Mining Data Streams: Stream data model – Sampling data – Filtering streams. Link Analysis: Page Rank, Link Spam.
Frequent Item Sets: Market Basket Analysis, A-Priori Algorithm – PCY Algorithm, Big data Clustering: Clustering in Non-Euclidean Spaces, BFR, CURE. Structured Streaming: Spark Streaming, Application dataflow. Coresets: Coresets for K-means, K -median clustering