Machine Learning with

                           _            __  __       _                 _   
    /\                    | |          |  \/  |     | |               | |  
   /  \   _ __   __ _  ___| |__   ___  | \  / | __ _| |__   ___  _   _| |_ 
  / /\ \ | '_ \ / _` |/ __| '_ \ / _ \ | |\/| |/ _` | '_ \ / _ \| | | | __|
 / ____ \| |_) | (_| | (__| | | |  __/ | |  | | (_| | | | | (_) | |_| | |_ 
/_/    \_\ .__/ \__,_|\___|_| |_|\___| |_|  |_|\__,_|_| |_|\___/ \__,_|\__|
         | |                                                               
         |_|

                             and
                             __       
             ________ ___   / /  ___  
            / __/ __// _ | / /  / _ | 
          __\ \/ /__/ __ |/ /__/ __ | 
         /____/\___/_/ |_/____/_/ | | 
                                  |/  Programming Language

Saleem Ansari (@tuxdna)

http://tuxdna.in/

Presenter Notes

Outline

Mahout Algorithms

Mahout Scala primitives

Demo

Presenter Notes

Apache Mahout Algorithms

Mahout Logo

Some use-cases:

  • Product Recommendation: understanding / inferring what your customers are looking for
  • Topic Modeling: identifying topics from documents
  • Frequent Patterns Mining: knowing which entities occur together very often
  • Clustering: grouping similar items or grouping very similar documents, which are perhaps talking about the same subject
  • Regression and Classification: predicting house prices, or identifying a class of an item viz. product, document, person etc.
  • And many more

Presenter Notes

Basic Ideas

  • Similarity and Distance metrics
  • Vector and Matrices
  • Statistics
  • Probability

Presenter Notes

Similarity / Distance metrics

Different Similarity metrics

  • Pearson correlation
  • Euclidean distance
  • Cosine measure
  • Spearman correlation
  • Tanimoto coefficient
  • Log likelihood test

Distance to Similarity conversion ( not the only way )

s = 1 / ( 1 + d )

Presenter Notes

Similarity / Distance metrics contd...

Similarity Metric Selection

Presenter Notes

Matrix

Matrix

Presenter Notes

Vector

Vector

Presenter Notes

Statistics

What are the stats almost everyone knows?

  • mean / average / expectation
  • median
  • mode

What about these?

  • variance
  • stardard deviation

Presenter Notes

Probability

  • Conditional Probability: P(A|B) = num(A intersection B) / num(B)
  • Bayes Rule: P(A|B) = P(B|A) / P(B)
  • Probability Distribution: PMF for discreet, PDF for continuous variables

Presenter Notes

Mahout Scala API

  • Vector
  • Matrix

( see the bindings )

Presenter Notes

Classification

Classification

Presenter Notes

Clustering

Clustering

Presenter Notes

Recommendation Algorithms

User Based

Item Based

Presenter Notes

Demo

  • Naive Bayes Classifier
  • Clustering the Synthetic Control Data
  • Recommendation Algorithms

Presenter Notes

What upcoming in Mahout 1.0 ?

  • No further development in Map-Reduce ( Hadoop ) style, although existing algorithms will remain.
  • Existing MR algorithms to be ported from MR1 to MR2.
  • All the new algorithms will use Scala Math DSL which can be run seamlessly over Hadoop, Spark or anything else.

Presenter Notes

Questions

Presenter Notes

Thanks and happy coding :-)

Presenter Notes