Machine Learning with Spark MLLib:
MLlib: MLlib is Apache Spark’s library of machine learning functions and designed to run in parallel on the different clusters (single, multi-node). MLlib contains a variety of learning algorithms and is accessible from all of Spark’s programming languages. Basically, Mahout with Map Reduce solution to Mahout with Spark solution has bee started.
It invokes various algorithms on distributed data sets, representing all data as RDDs. It depends on RDD and DataFrame only.
MLlib introduces a few different data types like labeled points and vectors but it is simply a set of functions to call on RDDs only.
Labeled point: A labeled data point for supervised learning algorithms such as classification and regressions. It includes features of vectors. Below one is the package of labeled point
Vector: A vector is a mathematical vector term. MLlib supports both dense vector and a sparse vector. Here dense vector means where every entry is stored. Coming to the sparse vector is where only the non zero entries are stored to save space.
How to identify Spam mails using MLlib:
1. We start with an RDD of your strings it representing your (mail composed)messages.
2.After getting string run on MLlib’s feature extraction algorithm to convert text into numeric.
3. Then call a classification algorithm on the retrieved RDD of vectors.
4. Finally, evaluate the model on a test data frame or data set t using one of MLlib’s evaluation functions.
Machine Learning Basics:
Training data: ML algorithms attempt to make predictions or decisions based on training data
Three types of Algorithms:
What are the Classification and Regressionalgorithms?
Classification and regression are two common forms of supervised learning in Machine learning, where two algorithms attempt to predict a variable form features of objects using labeled training data.
What is the major difference between Classification and Regression?
Regression is used to predict continuous values. Classification is used to predict which class a data point is part of the discrete value.
The best example is Naive Bayes for the classification algorithm.