Machine Learning with

                           _            __  __       _                 _   
    /\                    | |          |  \/  |     | |               | |  
   /  \   _ __   __ _  ___| |__   ___  | \  / | __ _| |__   ___  _   _| |_ 
  / /\ \ | '_ \ / _` |/ __| '_ \ / _ \ | |\/| |/ _` | '_ \ / _ \| | | | __|
 / ____ \| |_) | (_| | (__| | | |  __/ | |  | | (_| | | | | (_) | |_| | |_ 
/_/    \_\ .__/ \__,_|\___|_| |_|\___| |_|  |_|\__,_|_| |_|\___/ \__,_|\__|
         | |                                                               
         |_|

                             and
                             __       
             ________ ___   / /  ___  
            / __/ __// _ | / /  / _ | 
          __\ \/ /__/ __ |/ /__/ __ | 
         /____/\___/_/ |_/____/_/ | | 
                                  |/  Programming Language

Saleem Ansari (@tuxdna)

http://tuxdna.in/

Presenter Notes

Outline

Mahout Algorithms

Mahout Scala primitives

Demo

Presenter Notes

Apache Mahout Algorithms

Mahout Logo

Some use-cases:

Product Recommendation: understanding / inferring what your customers are looking for
Topic Modeling: identifying topics from documents
Frequent Patterns Mining: knowing which entities occur together very often
Clustering: grouping similar items or grouping very similar documents, which are perhaps talking about the same subject
Regression and Classification: predicting house prices, or identifying a class of an item viz. product, document, person etc.
And many more

Presenter Notes

Basic Ideas

Similarity and Distance metrics
Vector and Matrices
Statistics
Probability

Presenter Notes

Similarity / Distance metrics

Different Similarity metrics

Pearson correlation
Euclidean distance
Cosine measure
Spearman correlation
Tanimoto coefficient
Log likelihood test

Distance to Similarity conversion ( not the only way )

s = 1 / ( 1 + d )

Presenter Notes

Similarity / Distance metrics contd...

Similarity Metric Selection

Presenter Notes

Matrix

Presenter Notes

Vector

Presenter Notes

Statistics

What are the stats almost everyone knows?

mean / average / expectation
median
mode

What about these?

variance
stardard deviation

Presenter Notes

Probability

Conditional Probability: P(A|B) = num(A intersection B) / num(B)
Bayes Rule: P(A|B) = P(B|A) / P(B)
Probability Distribution: PMF for discreet, PDF for continuous variables

Presenter Notes

Mahout Scala API

Vector
Matrix

( see the bindings )

Presenter Notes

Classification

Presenter Notes

Clustering

Presenter Notes

Recommendation Algorithms

User Based

Item Based

Presenter Notes

Demo

Naive Bayes Classifier
Clustering the Synthetic Control Data
Recommendation Algorithms

Presenter Notes

What upcoming in Mahout 1.0 ?

No further development in Map-Reduce ( Hadoop ) style, although existing algorithms will remain.
Existing MR algorithms to be ported from MR1 to MR2.
All the new algorithms will use Scala Math DSL which can be run seamlessly over Hadoop, Spark or anything else.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Mahout Algorithms

Mahout Scala primitives

Demo

Table of Contents

Help